In this paper we present a unified approach to vector and scalar computation, using a single register file for both scalar operands and vector elements. The goal of this architecture is to yield improved scalar performance while broadening the range of vectorizable applications. For example, reduction operations and recurrences can be expressed in vector form in this architecture. This approach results in greater overall performance for most applications than does the approach of emphasizing peak vector performance. The hardware required to support the enhanced vector capability is insignificant, but allows the execution of two operations per cycle for vectorized code. Moreover, the size of the unified vector/scalar register file required for peak performance is an order of magnitude smaller than traditional vector register files, allowing efficient on-chip VLSI implementation.
Introduction
Specialized vector hardware has been available on supercomputers since the mid 1970's [13] . Recently, specialized vector hardware has also been appearing on traditional mainframes [7] , minisupercomputers [ 15, 9] and even workstations ("solo supercomputers") [6] . This specialized vector hardware often adds substantial complexity to the scalar machine, especially at the high and low end (i.e., supercomputers and workstations).
Vectorization improves the peak performance of an application. However, even if the portions that do vectorize run arbitrarily fast, the net performance only improves by the percentage of vectorizable code. Since the range of vectorization in general-purpose scientific computing is typically 0.3 to 0.7 [16] , infinitely fast vector Permission to cop without fee all or part of this material is granted provided t.lT at the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its-date -appear, and. notice is _ 8 iven eat co ying is by omputmg &chinery. 1 requires a fee and/or specif 0 1989 ACM O-8979 permission of the Association for Yo copy otherwise, or to republish, ic permission. l-300-O/89/0004/0134 $1.50 134 performance would only improve the performance of the entire benchmark by 1.4 to 3.3 times. Rather than trying to improve performance by increasing the peak vector computation rate, in the MultiTitan FPU we obtained more leverage by improving scalar performance, while broadening the range of vectorizable applications. This leads to better overall performance for most applications with small or modest amounts of classically vectorizable code.
In this paper we present a unified approach to vector and scalar floating-point computation.
This floatingpoint architecture was developed as part of the MultiTitan project from 1985 through 1987. The approach stresses very high performance scalar operation. It then adds an insignificant amount of hardware to provide a relatively small (2x) performance improvement for classically vectorizable code. However, this additional hardware also yields improved performance for operations not normally vectorizable, such as reductions and recurrences. The net result is to improve the non-peak performance as well as broadening the range of peak performance.
The hardware architecture of this approach is given in Section 2. Section 3 presents results of simulation studies for vectorized and non-vectorized versions of various benchmarks. Section 4 concludes the paper. (Reciprocal approximation, coupled with use of the multiply unit, is used to implement division.) Any functional unit can accept a new set of operands each cycle and produce a new result each cycle. The Three key features distinguish our work in floatinglatency of the functional units is three cycles for all point architecture: a unified approach to scalar and vecoperations (i.e., the result of a computation is available tor processing, low latency floating-point functional three cycles after it is issued to the functional unit.) units, and simplicity of organization. A register file, containing 52 general-purpose 64-bit registers, sits between the functional units and the data cache. The register file has four ports: two ALU source operands are read from the A and B ports, ALU results are written on the R port, and loads write and stores read the memory (M) port. In addition the FPU PSW is conceptually in the register file.
Hardware Architecture
There are two separate instruction registers for controlling the operation of the FPU. One holds Load/Store instructions which are transmitted from the CPU over a lo-bit coprocessor instruction bus. The 10 bits supply an opcode (4 bits) and source or destination register specifier (6 bits). The second instruction register is 32 bits wide and holds FPU ALU instructions. These are transferred over the address bus from the CPU. The separate Load/Store and ALU instruction registers allow FPU loads and stores to proceed in parallel with the issue of FPU ALU operations. Traditionally, machines that support vectors and use a load/store architecture (supporting only register-toregister arithmetic) provide separate register sets for vector and scalar data. This creates a distinction between elements of a vector and scalars, where none actually exists. This distinction makes mixed vector/scalar calculations difficult. For example, when vector elements must be operated on individually as scalars they must be transferred over to a separate scalar register file, only to be transferred back again if they are to be used in another vector calculation. This distinction is unnecessary. The MultiTitan floating-point architecture provides a single unified vector/scalar floating-point register file. Vectors are stored in successive scalar registers. This allows individual vector elements to be addressed and accessed with scalar operations, unlike classical vector machines. Each arithmetic instruction contains a vector length field, and scalar operations are simply vector operations of length one.
Vector ALU operations
The format of FPU ALU instructions is given in Figure 2 -3. Figure 2 -4 summarizes the operation of the func and unit fields. Vector instructions are issued by merely incrementing register fields in the instruction register and issuing the resulting instructions with the same mechanism used for scalar operations. The vector length field specifies the number of elements in the vector, from 1 to 16. The only means of specifying vector length is statically in the instruction itself; there is no dynamically loadable vector length register. After issuing the first instruction in the vector, the vector length field is checked. If it is zero, the instruction is cleared from the instruction register. If it is non-zero, the vector length field is decremented and the appropriate register specifiers are incremented.
This instruction is then treated the same as any other instruction newly placed in the instruction register. If the SRa (stride Ra) bit is zero, register source field Ra does not increment (i.e., it is a scalar). If the SRb bit is zero, register source field Rb does not increment. If both bits are zero then "vector := scalar op scalar" is performed. With this organization, operations that are not vectorizable on other machines can be vectorized. Since the normal scalar scoreboarding is used for each vector element, reduction and recurrence operations can be naturally expressed in vector form. For example, the inner loop of matrix multiplication consists of a dot product in which the elements of a vector multiply must be summed {i.e., a reduction). Since there is no distinction between vector and scalar registers, the reduction can be performed with scalar operations without moving the data from the result registers of the vector multiply. In fact there are several different ways in which the reduction may be performed. We can sum adjacent pairs of elements of the vector multiply result, placing the partial sums in temporary registers R8 through Rl 1. For each sum we need to transfer a FPU ALU instruction over the address bus from the CPU. (We could also reuse the vector multiply result registers, but extra temporary registers have been used for clarity.) Next we can pairwise sum R8 through Rll into R12 and R13. Note that the sum of RlO and Rll cannot issue in cycle 5 since Rll is not yet available. A simple result reservation mechanism stalls the issue until cycle 6 when both operands are available. If some other independent CPU or FPU instruction is available, it would typically be scheduled before the sum of R10 and Rll to prevent the cycle from being wasted. Similarly, the final addition of R12 and R13 to complete the sum cannot issue until cycle 9. The total time required to sum 8 elements of a vector is 12 cycles. A second way to implement the reduction is with a vector ALU instruction. (Actually, all of the scalar adds in the previous example were vectors of length 1.) This is illustrated in Figure 2-6 . In this example, we initialize R8 to 0 and sum each of the registers RO through R7 with R8. This is not a particularly time-efficient way of performing the sum but it illustrates an important point. Since the normal scalar issuing hardware is used for issuing each element of the vector, arbitrary data dependencies between elements of a vector are possible. This is in contrast with classical vector architectures, where arbitrary data dependencies between elements are not allowed. In this case, each element depends on the result of the previous element's computation, so the overall computation takes 24 cycles. While the vector instruction is executing, other iustructions may be issued by the CPU. However, while the vector is issuing the FPU ALU instruction register is occupied, so no other FPU ALU instructions can be issued. . This computation is identical to the scalar tree version of the computation with two exceptions. First, since the register specifiers are incremented only by 0 or 1 between elements of a vector, the pairs summed must be (RO,R4), (Rl,R5), (R2,R6), and (R3,R7) instead of (RO,Rl), (R2,R3), (R4,R5), and (R6,R7) as in the scalar example. Second, only three instructions must be issued by the CPU to perform the sum. This frees the CPU to issue more instructions concurrent with the summation. In this example there are 9 cycles out of the 12 in which the CPU may issue other instructions. In this matrix multiply example it would allow the 8 elements of the next row to be loaded in parallel with the reduction of the current row. Since data dependencies are allowed between vector elements, recurrences can also be expressed in vector form. For example, the first 10 Fibonacci numbers (i.e., a recurrence) can be computed by initializing RO and Rl to 1 (Fib, and Fib,) and executing R2 c-Rl + RO (length 8) (see Figure 
. Vector Loads and Stores
The FPU registers can be loaded or stored individually, but the MultiTitan does not have vector load or store instructions. Vector register load or store instructions in a virtual memory environment share many problems with multi-word memory references present in CISC machines. For example, the vector load can cross a page boundary, and the machine must save enough state to properly restart it. Vector memory references can result in a significant performance improvement for vectorizable portions of code on machines with large memory bandwidth. However, the MultiTitan has more limited bandwidth than these machines, in keeping with the goal of maximizing average (i.e., scalar) and not peak (i.e., vector) performance. Also, in many applications the most important advantage of vector instructions is the ability to overlap floating-point computations, memory references, and normal loop overhead. In the MultiTitan, this is possible to a large extent without the use of vector memory references. Once a vector arithmetic operation is begun, the CPU is free to issue loads for future computations, stores of previous results, and loop overhead instructions.
For fixed stride applications, the MultiTitan can issue one load per cycle by folding the stride into the load offset (see . Combined with the ability to independently issue one FPU ALU operation per cycle during vector operations, this allows a peak issue rate of two operations per cycle. Since the loading and storing of vector elements is performed under program control, full flexibility is retained and operations such as scatter and gather are easily implemented. Vector elements could even be gathered from a linked list with only a doubling of the time otherwise required, even though loads have a one cycle delay slot. This is illustrated in Figure 2 -9. (Loads that follow the linked list alternate between an even and odd temporary register (even" and odd/\) so that the load of the floating-point data can use the pointer concurrent with the load of the next pointer.)
Traditional vector register banks, where registers are grouped into vectors of fixed length and operated on as a group, reduce the opcode space required to represent instructions but also limit the flexibility of use of the individual registers. For example, in these static allocation schemes, the user cannot select between 8 banks of 64 registers or 64 banks of 8 registers. In MultiTitan, the user can dynamically partition the 52 64-bit registers into any number of 1 to 16-element register groups on an instruction-by-instruction basis. The MultiTitan FPU register file requires 3.3K bits of dual port storage (time multiplexed to be four ports). This easily fits on the same chip as the functional units. Other vector register architectures require much larger amounts of storage. For example, 8 64-element 64-bit registers would require 32K bits of storage, or about ten times that of the unified vector/scalar register file. It is not possible to put 32K bits of multiported register (in which each cell is larger than a SRAM cell) on the same chip as the functional units in today's technologies. Systems with large vector register files thus require off-chip accesses, increasing the latency of operations. The benefits of the small unified vector/scalar register file size will continue into the future. When technology has advanced enough to put a large vector register set on the same chip as the functional units, the unified vector/scalar register set will fit on the same chip as the functional units, the integer portion of the CPU, the instruction buffer, etc. A final benefit of the small register file size is that the context switch cost is smaller than that of traditional vector machines when the vector register state must be saved. 
Low Latency Functional Units
The net performance of a vector operation is a function of its peak performance as well as its latency. If a pipelined functional unit has a latency of 1 cycles, then it is not operating at peak performance unless its pipeline is filled with I operations. At the beginning and end of every vector operation the functional unit will be operating at less than its peak rate. The functional unit will never attain its peak performance if the vectors are shorter than its latency. The vector half-performance length (n,,,) [3] is the vector length required to achieve half of the maximum performance.
Low latency functional units are essential in the MultiTitan for two reasons. The first is specific to the unified vector/scalar register file of the MultiTitan, and the second is from the applications executed.
Latency Constraints of the Unified
Vector/Scalar Register File In a machine with a unified vector/scalar register file all of the FPU registers must be directly addressable. Thus, there must be a limited number of them in order for 3-operand FPU ALU instructions to be encoded in 32 bits. The 6-bit MultiTitan register addresses form a constraint on the maximum vector size. Since this six bit field is also used to address registers in other coprocessors, the actual FPU register address space is limited to 52 registers. Often the 52 registers are used as six vectors of length 8 and four scalars.
Thus, in order for good performance to be obtained, the vector half-performance length on the MultiTitan (nl,J must be kept to less than 8. The vector halfperformance length achieved by the MultiTitan is approximately 4. This is due to the single-cycle load/store latency from the cache and the three cycle latency of FPU ALU operations. This minimum vector length for half-performance is much smaller than the minimum for traditional vector machines, such as the CDC Cyber 205 (n 1,2=1OO), array processors such as the ICL DAP (n,,,=2048), or even the Cray-1 (nrlz=15).
Latency Constraints from the Applications
Low latency operations are essential for high performance on scalar applications with data dependencies. The latency of operations also determines the minimum vector half-performance length. Many applications will always have very short vectors. For example, 3-D graphics transforms are expressed as the multiplication of a 4 element vector by a 4x4 transformation matrix.
Implementation Latencies
In the MultiTitan FPU the latency of all floatingpoint operations is three cycles, including the time required to bypass the result into a successive computation. This is very short in comparison to most machines. (Division is implemented as a series of six 3-cycle operations.) The functional units support only double precision floating-point operations, simplifying the design of the functional units. It also enables special cases specific to the double-precision format to be exploited, further reducing functional unit latency.
Each functional unit uses novel structures to reduce the latency of its operations. For example, the add unit uses separate specialized paths for aligned operands and normalized results [2] , as well as specialized paths for positive and negative results. The multiply unit uses a novel "chunky binary tree" which is faster in practice than a Wallace tree. The reciprocal approximation unit uses linear interpolation to develop a 16-bit reciprocal approximation. Additional details of the functional unit design are beyond the scope of this paper, but may be found in other documents [4] .
As a reference, the latency of various operations in the Cray X-MP (with a 9.5ns cycle time) are compared with the latency of the functional units in the FPU in Figure 2 
and less powerful memory subsystem, the peak performance relative to the Cray X-MP can be significantly less than that implied by the latency ratios. There are two types of control logic in the floatingpoint architecture: logic that supports both fast scalar execution and vector execution, and logic specific to vector execution. The only vector-specific hardware required by the architecture is three six-bit incrementers for the register specifiers, one four bit decrementer for the vector length, and a very small amount of pipeline control to reissue instructions whose count is non-zero.
Control Logic for Fast Scalar Execution
The FPU control logic provided in the MultiTitan for scalar execution is much simpler than in most highperformance machines. For example, the FPU has a lock step pipeline like the CPU, greatly simplifying the control logic. Also, since all functional units have the same latency, the functional unit write port to the register file need not be reserved or checked for availability before instruction issue.
Vector instructions that overflow on one element discard all remaining elements after the overflow. The destination register specifier of the first element to overflow is saved in the PSW. Note that vector ALU instructions may continue long after an interrupt. For example in the case of vector recursion (e.g., r[a] := r[a-l] + r[a-21) of length 16, the last element would be written 48 cycles later, even if an interrupt occurred in the meantime.
The FPU control is split into two parts. The first part manages the operation of FPU loads and stores. The second part manages FPU ALU instructions. These two parts of the machine communicate through the register file, the inter-chip pipeline control signals, and the scoreboard.
Central to the scoreboard is a register write reservakwz table. This table consists of one bit for each register in the register fiIe. The bit is set when there is an outstanding operation which will write the associated register. This bit is used to prevent subsequent instructions from reading the register before it has been written. Five ports are required on the register write reservation table every cycle: X38 l 2 read with source operands for ALU operations l 1 set for destination upon ALU operation issue l 1 cleared for destination of retired ALU operation l 1 read for loads and stores Of the five required scoreboard ports, all ports are accessed at the same time as their associated data, except for the port that sets a bit on issue of FPU ALU instructions. For example, the ALU source operand reservation bits are read at the same time as the ALU source operands. Moreover, one of the writes is always a set, while the other is always a clear. We will take advantage of these restrictions in the following implementation. This implementation has the advantage that it requires only one extra decoder besides those already required for the register file, and for a single reservation bit the decoder area greatly exceeds that of the RAM cell.
Reservation bits are stored as an extra bit on each word in the register file. The register file R port word line of the extra bit is partitioned into two separate word lines. One segment is controlled by the same word line as the rest of the word. The other is controlled by the destination of the provisionally issued instruction. Since we will never want to write a reservation bit with an arbitrary value, but only set it or clear it, we can do both by single-ended writes. The true bitline can be used to clear a bit at the same time as the complement bit line is used to set another bit.
The FPU uses a distributed result bypass in which each functional unit in the FPU does its own bypassing. If the bypass logic were centralized at the register file, results would have to be put out on the global result bus, then transferred to a global source bus. But since the result bus goes to all functional units, they can select between each source and the result bus based on control signals from the scoreboard. Thus, with distributed bypass logic, the delay from driving the result to the latching of a source is only one global wire delay, not two.
Control Logic for Support of Vector Execution
The reservation of vector result registers is a difficult problem. Three approaches exist:
1. Reserve all elements of the result register at once before issue of the first element.
2, Handle reservations in software, 3. Reserve each element's result register upon issue of the element. Note that in traditional vector machines like the Cray-1, vector registers are treated as an indivisible resource and the vector result register reservation problem is simplified to reserving a single resource.
Two difficulties occur if all result registers of a vector operation are reserved at once before issue of the first element's computation. First, special hardware must be provided beyond that required for scalar operations in order to reserve more than one register at a time. Second, additional hardware must be provided to check for prior reservations on all result registers simultaneously. Otherwise the vector register reservation may reserve an already reserved register, in which case the second reservation will be lost on the retiring of the first reservation. One solution would be to compute the remaining source and result register ranges of in-progress vector instructions each cycle, and compare load and store register operands against these ranges before issuing them. This would add a fair amount of hardware, and is the approach taken in the recently announced Ardent Titan.' [6] Reservation of vector result registers could also be handled by code reorganization.
In most machines, floating-point operations have relatively long latencies (e.g., 7-30 cycles). This coupled with potentially long vector lengths makes the scheduling of operations well enough to prevent insertion of NOP's very unlikely. Although all operations in the MultiTitan FPU have a three cycle latency, and the maximum vector length is 16, vector recursion can yield vector execution times of up to 48 cycles. This is still far too long to pad with NOPs. Instead we must rely on a hardware reservation mechanism.
Reservation of individual vector element result registers upon issue of each element can provide hardware interlocks with very low cost. This is the approach used by the MultiTitan.
By reserving result registers at the issue of each element, the reservation hardware already in place for scalar operations can be used. Unfortunately this causes a synchronization problem: while the elements of the vector are issuing one by one, a load or store instruction may issue and retire. In particular, the register operand of the load or store may be the same as a source or result register operand of a vector element which has not already issued, but whose vector instruction was issued before the load or store. In order to prevent out-of-order execution (with nondeterministic results), execution constraints must be placed between the vector instruction and any following loads and stores that issue before every element of the vector has issued.
If dependencies occur between loads and stores or elements in a vector other than the first, the compiler must break the vector into smaller vectors so that the normal scalar interlocks are effective. However, in most code sequences this will not be necessary: for example, if a vector operation is followed by stores of each result register, the stores can be performed in the same order as the result elements are produced. Only when operands must be stored out of order, or when the first elements of a vector are not stored but later elements are, will the compiler need to break a vector in order for the normal ' The Ardent Titan should not be confused with the unrelated DECWF& Titan or MultiTitan. scalar interlocks to be effective. Note that if an entire vector were required to issue before loads or stores were honored, most useful overlap of transfers and computations would be precluded.
In summary, when the scalar issue logic is used to issue vector instructions, only a very small amount of extra logic (a few counters) is required to support vector execution.
Benefits of Additional Vector Support
There are several obvious limits of the performance of this floating-point architecture:
l only 1 FP ALU operation may issue per cycle l only 1 Load/Store may issue per cycle l only a total of two operations may issue per cycle
In the MultiTitan concurrent issue is only supported between loads and ALU operations, or ALU operations and stores, but not between multiple operations. Why isn't the simultaneous issue of multiple ALU operations supported? There are two basic reasons.
First, the cost of issuing multiple ALU operations per cycle is very high. Each additional operation would require 3 additional register file ports, and three additional busses. The three busses already present consume about 10% of the chip area.
Second, and more important, the ability to issue multiple ALU operations per cycle would not significantly improve performance. The additional hardware resources would improve peak performance, but not scalar performance. If the basic vector capability already places the machine past the point of diminishing returns, investing heavily to push it even further past is pointless. 
capability in the MultiTitan obtains a significant portion of performance improvement available from vectorization, at least for code with low to average amounts of vectorizable operations. Supercomputers, on the other hand, are biased towards higher peak performance. The ratio of peak vector performance to scalar performance is about 10 for the Cray-1S and the Cray-XMP.
Finally, further increasing the size and complexity of the machine to support higher peak performance is dangerous. If it slows down non-peak performance significantly, the overall performance of the machine can easily be reduced.
Simulation Results
To experiment with the MultiTitan vectors, we extended the Mahler [ 141 intermediate language for our compiler system with a primitive vector capability that corresponds fairly closely to the machine. Vector variables can be declared with a specified constant number of elements, and any consecutive subsection of this vector can be used in a vector operation, provided that the offset and size of the subset is fixed at compile time. Moreover, memory may be referenced directly as a vector with a size and stride futed at compile time.
The usual scalar floating-point operations may be performed on two vectors of the same length, or on a vector and a scalar. We also added an operator that summed a vector, by performing a vector sum to add its two halves and then doing the same thing to the resulting smaller vector, until left with one or two scalar additions. Loads and stores of the vector variables could be specified by an assignment from or to a memory vector, respectively. Such an assignment was implemented by a series of loads or stores. Memory vectors could also be used directly as operands or destinations, in which case the values would be loaded into or stored from a series of registers not used for vector variables.
Each vector mapped directly to a group of registers. Registers were allocated on a per-procedure basis, on the assumption that they would be used only for the duration of an inner loop. If the total amount of space needed for the declared vectors and temporaries was too large, a compile error was raised. In most cases this meant that our vector operations had lengths of 4 or 8.
These primitives were then used to manually recode the benchmarks we studied. The recoded benchmarks were then simulated on an instruction-level simulator. The Mahler code was intended to represent what the compiler would generate from a program written in an extended version of Modula-2 that provided vector primitives, not what could be produced from the original FORTRAN sources. Our strip-mining followed standard techniques [S] . (Strip-mining is the process of dividing a vector computation of possibly indeterminate length into a loop that performs independent vector computations of a fixed length. Code also must be provided to compute any remainder of the original vector computation not handled by the loop.) Register allocation was done by checking lifetimes of subexpressions, which gave the number of vector values live at any point in the code. Knowing that value and the number of registers on the FPU allows a compiler to choose vector lengths. The Mahler code was produced without extensive hand optimization other than induction variable analysis, stripmining, and careful vector register allocation.
Because the vectors had to be short, we were not hampered by the fact that they also had to be of fixed length. When a loop could not be unrolled an integral number of times, the leftover was always small. When the leftover was of known size, it could be done as a shorter vector operation. However, even when it had to be done as a scalar loop, it was still fast because the scalar operations are themselves fast.
Graphics Transform
This section describes a graphics routine to transform a point by multiplying a vector by a transformation matrix. It is representative of many possible applications for the FPU. The problem and register allocation are given in Figure 3 -l. Assume that many points will be transformed by one matrix. Thus the transformation matrix will already be loaded into RO..R15. If the transformation matrix is not loaded, this will require an extra 16 cycles (assuming no cache misses). For each element of the initial point vector we will load it and issue a vector floating-point multiply of the element by a column in the transform matrix, resulting in a total of 16 result elements. Once these multiplications have been issued, we will start adding together rows of the 4x4 result elements. Each row is added together in a binary tree, and the four trees are summed in parallel using four element vectors. Finally we will store the result vector. Figure 3-2 gives the code sequence and cycle timings for this routine. Each instruction requires one cycle, with two exceptions. First, back-to-back stores require two cycles. Second, arithmetic operations cannot issue until a previous vector operation has completely issued all of its elements at a maximum rate of 1 element per cycle, There is only one scoreboard stall for data dependencies in the routine. It is assumed that there are no instruction buffer misses during the routine. This example has been run on the MultiTitan simulator and achieves 20 MFLOPS. The total latency in this example is 35*40ns cycles (1.4us), for double precision computations.
This performance is better than that often provided by special-purpose graphics hardware [4] .
Solution:
cyclaa : 
Livermore Loops
The simulation results obtained for the Livermore Loops running on the MultiTitan are shown in Figure  3 -3. The performance of each loop was highly dependent on whether the data referenced by the loop was present in the cache. The performance figures for the warm cache were obtained by running the loops twice, thus preloading the code and the data. The numbers shown for the cold cache performance assume that both the instruction and data caches are empty at the start of the simulation.
The performance of the cold cache simulations is quite low compared to the warm cache numbers, by factors of about three to six. Because the MultiTitan lacks the pipelined memory access of the Cray, its performance suffers greatly from cache misses. The actual cache miss rate depends on the size of the cache and the size of the data set. Studies of some scientific workloads indicate that good cache hit ratios (much greater than 90%) can be obtained [lo, 111, so we expect numbers closer to the warm cache numbers for real programs.
The primary bottleneck keeping the MultiTitan from obtaining higher performance in these benchmarks is its limited memory bandwidth. Even when a cache hit is made, only one load can issue per cycle, and stores can only issue every other cycle. For a two-operand vector add this requires about 4 cycles per result -two loads, a compute, and then a partially overlapped store. (Stores take two cycles, but half of the time for the stores can be overlapped with the computation).
However, since memory bandwidth in excess of one operand per cycle is very expensive and primarily improves peak performance, the current design seems to be very costeffective. The benchmarks that do better than 4 cycles per result do so because operands can be kept in the registers and used multiple times across a single vector expression. For this reason, even some of the cold cache performance figures are good, particularly Livermore loops 1 through 3 and 7 through 9. In these loops a performance improvement .of at most 2533% would be gamed from additional load and store bandwidth. Note that the warm cache MultiTitan had better performance than the Cray-1S on Livermore Loops 5 and 11, which were not vectorized on the Cray. Livermore Loops 13-24 in general have larger and more complex kernels than loops l-12. The difference in performance between the cold cache numbers and the warm cache results were less than that for the fist set of loops. This is because the later loops contain more branching and index calculations, so the relative data bandwidths were less, and hence the effects of cache misses diluted. Due to the complexity of the later loops, loops 13, 15, 17, 19, 20, 22, and 23 were coded in Modula-2 instead of Mahler. Since the loop inductionvariable elimination was not effective in Modula-2, the performance of these loops on the MultiTitan would improve for more sophisticated compiler technology. Also, because of the complexity of the procedures, some routines written in Mahler are not well-tuned. Loops 14, 16, and 18 suffer the most in this regards. The performance of the MultiTitan on loop 22 is the worst in proportion to the numbers for the Crays. This is because it contains an exp() call and is vectorized by the Cray, but the MultiTitan version is implemented with a scalar subroutine call.
Overall, the warm-cache MultiTitan performance was about one-half that of the Cray 1-S and about onethird that of the Cray X-MP.
Linpack
Linpack has been run on the MultiTitan simulator. The scalar Linpack performance obtained was 4.1 MFLOPS, while the vector Linpack performance obtained was 6.1 MPLOPS. The scalar performance is approximately 25 times the performance of a VAX 1 l/780 with FPA. However, the vector performance is only l/4 that of the Cray 1-S Coded BLAS and l/8 that of the Cray X-ML? [l] . The MultiTitan vector performance for Linpack is worse in relation to the Cray than for the Liver-more Loops because Linpack has a higher degree of vectorization and increased memory bandwidth requirements in comparison to the Liver-more Loops.
Conclusions
The MultiTitan unified vector/scalar floating-point architecture is a very powerful yet simple and costeffective architecture. This architecture emphasizes improved scalar performance while broadening the applicability of vectorization. This results in higher and more cost-effective overall performance for most applications than emphasizing peak vector performance. Only an insignificant amount of additional hardware (a few incrementers) is required to provide a vector capability. Rather than issuing vector operations as a whole, each vector element is issued with the existing scalar issue hardware. This enables reduction operations and recurrences to be expressed as vectors, unlike most traditional vector machines.
The unified vector/scalar register file, coupled with low latency functional units, allows a much smaller register file to be sufficient for peak performance compared to traditional vector register machines. Thus the unified vector/scalar register file can easily fit on the same chip as the functional units in today's CMOS technology. (In the next CMOS technology they could easily fit on the CPU chip.) The pipeline control is extremely simple. For example, all floating-point operations take the same amount of time, simplifying the scoreboard logic.
Separate Load/Store and ALU instruction registers provide load/store bandwidth that is well balanced with the computation bandwidth. This also enables execution of two operations per cycle during vector execution. Sustained execution rates of 15 double-precision MFLOPS with vectorization and 7 MFLOPS without vectorization are attainable for many applications with current CMOS technology.
