1 Iiitroduction 1;ector architectures have been used for many years for high performance numerical applications -an area where they still excel.
1 Iiitroduction 1;ector architectures have been used for many years for high performance numerical applications -an area where they still excel.
The traditional approach to vector processor design has been to use an in-order execution engine and ncliieve high performance exploiting the natural datalevel parallelism embedded in each vector instruction.
Typically, traditional vector architectures have used very limited forms of ILP techniques, only allowing some overlapping of vector and scalar instructions but keeping the scalar and vector instruction streams strictly ordered. To achieve good performance and to be able to tolerate the large latencies associated with supercomputer main memory systems, vector designers have exploited the large number of independent operations present in each vector instruction. When a vector instruction is started, it pays for some initial (potentially long) latency, Imt then it works on a long stream of elements and effectively amortizes this latency across all elements. A few of these vector instructions running concurrently can yield a very good usage of the available hardware resources.
In this context, it is natural that vector processor designers have striven to implement vector registers as large as budget and technology constraints would allow. Nonetheless, in today's environment where ILP techniques such as out-of-order execution, decoupling, multithreading, branch prediction, speculation, etc, have proved their value as latency tolerance mechanisins, it is less clear that the best way to invest the available register space consists in having only few very large registers.
First, if an application can not make full use of each register, then a precious hardware resource is being wasted. Second, given a certain budget in terms of transistors, large registers imply that only a few of them can be implemented. A small number of logical registers has a direct impact on the amount of spill code that the compiler and/or programmer must introduce to fit all live variables in the limited register file. Third, introducing ILP techniques in a processor having a few very large logical registers is difficult. For example, out-of-order execution without renaming with only 8 logical vector registers provides little benefit. On the other hand, introducing register renaming can be very costly since many copies of registers that are very large have to be provided.
Reducing the vector registers length is certainly a solution to the problems just outlined. If most applications can not fully use all elements present in each vector register, then reducing the vector register length will reduce cost and increase the fraction of usage of Large registers have several drawbacks.
registers Tlie drawback of register length reductio11 is t,lie a,ssociated perforiiiance penalty. Each time a vect,or iiistruction is executed, its associated latencies are iIl11ortlzetl over a smaller number of elements. This c a n ha\-e a significant impact on performance, especially for memory accesses. Moreover, more instructions have to be executed each with a shorter effective length, and, therefore, the number of times that lat.encies must lie payed is larger.
Unless some extra latency tolerance mechanism is iiit,roduced in a vector architecture, vector length can not lie reduced without a severe performance penalty. TT71iile many techniques have been developed to tolerate iiieiiiory latency in superscalar processors, only a f e r studies have considered the same problem in the coiitext of' vector architectures [l, 2, 31. This pa.per will present data confirming the fact t,liat traclit,ioiial vector architectures can not reduce heir vector register length without suffering a severe p(>rlorinaiice penalty. However, we will show that by combiniiig the vector register length reduction with an ILP kchnique, decoupling, the per€ormance penalty ca,ii be made very small. We will show that resulting architecture tolerates very well long memory latencies and also malies a better usage of the available storage s p x e in each vector register. Not only 1,lie performance iinpact of reducing the vector length is small, but when our architecture with short veci,or registers are compared against a traditional vector iiia,chine 1mitl.1 large vector registers, performance is in iiiost ca.ses far better across a large memory latency r m g e .
Vector Length usage
The usage of the vector register file elements is deteriniiied by both the degree of vectorizatioii of a program aad the natural vector lengths associated with the data structures of' an application. Many applicat,ioiis ha.ve small data sets or iterate over a particular diineiision of' an iteration space which is smaller than t h 7;ector register length. In [3] we evaluated a set d highly vectorizable applications in order to know \rliicii \vas the vector length used by these programs.
'rhe first thing to note is that, even though these set, oC prograins are highly vectorizable, their average \,.ect,or leiigths are not very high. Investigation of the programs reveals that often times this is due to the ~i a t~u r a l shape of the application data space. In other cases, it is due to the nature of the algorithm, i.e., a triaiigiilar matrix operation teiids to have many small wctor lengths.
Which Adding strip-mining.
Compiliiig for smaller vector lengths
In order to investigate the effects of reducing the hardware vector register length we need a set of benchmarks compiled assuming different vector lengths. Unfortunately, no public domain vectorizing compiler is available and, therefore, we a.re forced to artificially fool the Convex compiler [5] to generate code "as if" the vect,or length mas 16, 32 or 64 (instead of the real 128). To obtain the desired binaries we modified the source benchmarks as follows. Using the vectorization information produced by the Convex compiler, we located in the source code each vectorized loop. For each loop nest, and taking into account loop transformations such as peeling, interchange and skewing, we manually strip-mined the loop being vectorized. This manual strip-mining consisted in adding a strip mine loop performing steps of length VLZ and modifying the original vectorized loop to do at most VLZ iterations (see figure 1) . To prevent the compiler from genera.ting a doubly strip-mined loop (our strip-mining plus the natural strip mining introduced by the compiler) we used tlie MAXTRIPS directive [SI. This directive iiiforined the compiler that the inner loop was performing less than 128 trips and thus no extra stripmining was generated Using such a procedure we strip-mined most (but not all) vectorized loops present in our ten benchmarlis. Loops that escaped from this strip-mining where vector loops that are in libraries and loops where introducing one extra level of strip-mining stopped vectorization. Moreover, due to the large number of loops to strip-mine, we first selected those that accumulate 95370 of all execution time. The remaining loops that form tlie other 5% of execution time were not instrumented. For each program, we generated four different binaries, assuming that the maximum hardware vector leiigth was 16, 32, 64 and 128. For each register length, the percentage of operations that escaped our strip-mining procedure varied, but was below 4% for all programs except arc2d and flo52 where it was close to 10%.
Short Vectors Performance
We &art by analyzing the performance of a tradjtional in-order vector machine when the hardware vxtjor length is varied. We are interested in the effect (,hat, cliff'ereiit memory latencies have on performance a.nd how it interacts with vector register length. 4 .1 Performance on the Reference Architecture
Our reference inachine is loosely based on a Convex C3409. The esseiitia.1 characteristics of the reference architecture are a single memory port, two functional units and 8 vector register. In [GI we give a detailed explanation of this reference archit'ecture. In [4] we studied four different variants of this reference machiiie. The four models under study was referred to as the REF128, REF64, REFS2 and REF16 architectures r i t h a. vector length of 128, 64, 32 and 16 elements respectively.
1Te noted that the impact of memory latency is very significant. For our unmodified model (REF128) we olxerved that execution time is degraded by factors of 1.2-1.4 in most programs when we vary the latency froin 1 to 100 cycles.
We observed that reducing the vector register length performaace degradation is very high. Our conclusion were that ] reducing the vector register length in a traditional vector machine results in a remarkable loss of performance, The cost savings are clearly outweighted liy the execution time degradation. Unless some latency tolerance technique is added to a tradif,ional vect,or inachine, vector register length should be kept a.s long as possible. In the next section we will see 11ow drcoupliiig can compelisate this performance loss.
5
Combining short vectors and decoup li 11 g
In this section we will study how the combination of a. latency t,olerance technique such as decoupling can he combined with a vector architecture having short registers to overcome the performance degradation seen in the previous section. As we will see, decouIlling ~i t h short registers can even provide speedups Tyit,li respect to a traditional in-order machine.
Decoupled Vector Architecture
For ours simulations we used the decoupled vector architecture introduced in [l] . The main idea in this architectjure is lo use a fetch processor to split the incoming, noli-decoupled, instruction stream into three dif€erent, decoupled streams. Tlie translation is such h t each processor can proceed independently and, yet synchronizes through the coinmuiiication queues hen needed. Each of these three streams goes to a different processor: the address processor ( A P ) , that performs all memory accesses on behalf of the other two processors, the scalar processor (SP), that perforins all scalar computations and the vector processor ( V P ) , that performs all vector computations. The three processors communicate through a set of zmple~r i e~i t u t i o n a l queues and proceed independently. This set of queues is akin to the implementational queues that can be found in the floating point part of the RPOOO microprocessor [7] . Tlie main difference of this decoupled architecture with previous scalar decoupled architectures such as the ZS-1 [8] or the MAP-200 [9] is that it has two computational processors iiistead of just one. These two computation processors, the SP and tlie VP, have been split due to the very different nature of tlie operands on which they worli (scalars and vectors, respectively).
Tlie main parameters of this architecture are tlie length of its queues: the three instruction queues, tlie inter-processor queues, the scalar queues and the load store address queues were set at 16 elements. For the vector queues (numbers 1 and 2), each slot is a full vector register and, therefore, their size has to be carefully considered. We start with 4 slots in each of them, as suggested in [l] . Reducing the vector register length benefits a decoupled implenientatioii since each slot in the extra queues required to decouple the machine can be smaller than in the original machine.
The key points in this architecture will be to achieve good perforinance with relatively few slots in these two queues. This is another point where reducing the vector register length can be very helpful.
Performance of the DVA
What is the performance of the clecoupled machine using different vector register lengths ? Figure 2 We will start comparing the performance of the decoupled and non-decoupled machines with the maximum vector register length (128). As already presented in [1], the performance improvements due to decoupling are quite substantial. Even with a perfect memory system with latency 1, speedups are in the range 1.10-1.25. When memory latency is increased up to 100 cycles, the DVA experiences some slowclowiis, hut much smaller than the reference machine.
Comparing both machines at a latency of 100, the DVA yields speedups in the 1.22-1.52 range.
When the register length is reduced we still obtain very good results. Halving the register length (64 elements), yields a machine tha,t performs only worse s w t81iat the decoupled machine perforins better (by filct80rs 111 the range 1.01-1.32). These results suggest f#lia,t evrii halving tlie register length, a machine with a slower vneniory system (thus, a much cheaper memory system) would perform better than a traditional machine.
Reducing the register length to 1/4 of the original lengLh (3% elements). we still see that the performance of the DVAS2 is better than the reference machine. Except for programs hydro2d, nasa7 and su2cor, the llJ7A32 achieves speedups over tlie REF machine in t,hc range 1.01-1.25 and goes up to 1.42 for dyfesm ( a t la.tency 50).
Only when tlie register length is reduced to 16 elements (1/8 of the original) performance starts to degra.da.te noticeably. Seven out of ten prograins perforin worse with tlie DVA16 than with the REF machinp, a n d only dyf esm and tomcatv maintain a. good perf'orinaace. This sudden jump in execution time is due to the combination of several effects: the number of scatter/gather operations, the number of outstanding hranches and dependencies in scalar code introduce iiiaiiy cycles of stall iii a program run. These t h e e types of hazards stall the vector processor very hxpeiitly, thereby exposing the full memory latency at, ex11 memory load being executed. This explains the steep slopes of each of the DVAlG curves.
Increasing Queue Length
The load and store queue length is a Bey parameter in a decoupled architecture. It determines the amount of data that can be prefetched ahead of time and, therefore, the queue length puts an upper limit on the maximum memory latency that can be tolerated. For example, a system having 8 slots in the load queue, each corresponding to a 32 element vector can request up to 8 x 32 = 256 data items to the memory system before blocking. If main memory latency is shorter than 256 cycles, then this decoupled system can establish a continuous flow of data from main memory into the processor without stalls (provided there are enough load instructions to keep the pipeline fed, of course). On the other hand, if memory latency (L) is larger than 256 cycles, no matter how fast we can feed the address processor, the flow of requests to the memory system will be interrupted and a fraction of all memory latency ( L -256) will be exposed to the coinputation processor.
In this section we will look at the performance improvement due to enlarging both the load and store queue lengths. We expected that, the longer the queue, the better memory latency will he tolerated. As we will see, this intuition is wrong and there is a limit after which increasing queue length does not yield any significant performance advantage. 'The overall conclusion is that increasing queue size does not coinpensate for the reduction in vector register length. This will be further analyzed in the follou7ing section.
Limits on performance
In what ways does reduciiig the vector register I(wgt1i limit per€ormance ? As we have seen in the previous section, vector length reductio11 can no be com-1)ensated increasing the depth of the queues or trying to augiiieiit the ILP inside the computation processor. 'rhis section will analyze the causes of this behavior.
Latency Masking
The most important effect of reducing the vector length is that many latencies that were previously hidden uiiderneath the execution of vector code are now exposed in the critical path of the program. This effect is represented in figure 3 . On the left, we present a fraction of code from the most important loop of program su2or. This loop presents a true dependency from iiistruction 3 into instruction 4. Let's assume that the latency for performing an addition on our ar- of wasted cycles will be 3/(64 x 3 + 2 + 3) = 1.5%.
Finally, in the extreme case, corresponding to a scalar machine (vector length = 1) shown in (c), we would pay 3 stall cycles every 8-cycle iteration, yielding a waste of 37.5%. Another way of loolting at figure 4 is to consider that the 3 cycles of stall involved in the dependency will have to be payed for each data item processed in loop (c), for every 64 data items processed in loop (b) or for every 128 items processed in loop (a). Thus, in order to execute a given amount of work architecture (a) will take less time than architecture (c).
The overall lesson is that the more the vector register length is reduced, the more this small latencies are exposed to total execution time. In a vector architecture having 128 elements per register, a 3 cycle latency is almost hiclden, whereas on a scalar machine this latency is exposed on every single iteration.
Note thai we are not claiming here that the scalar execution model is necessarily worse than the vector model. There are many techniques (loop unrolling, software pipelining, etc.) that could help improve the performance oi the loop as executed on (c). We are simply pointing out that, given the binaries as they are, a decrease in vector length will expose more latencies (both from main memory and from functional units) and will increase a program's critical path. The increase in total execution time is proportional to the decrease in vector register length.
Gather-Scatter inst ructions
Another very important limitation to performance in a decoupled vector architecture is the amount of gather/scatter instructions in the code. A gather instruction can not he characterized with a memory range, and thus imposes a sequential bottleneck in the otherwise out-of-order execution of load/store instructions. Moreover, a gather instruction requires a vector from the VP before being able to proceed to the memory system. Thus, each time a gather has to be executed, a loss of decoupling appears: the VP and the A P have to synchronize to launch the gather instruction. No inatter how inuch ahead the A P was from the VP it will have to wait until the VP provides the vector register with the required addresses. figure 5 shows the latency exposure introduced by the gather instruction. A full memory 1a.tency plus the latency of an add and a mu1 opera.tion must elapse before the g a t h e r instruction can proceed. This full memory latency can not be used to dispatch other loads because the decoupled architecture executes loads in-order. It can not be used to clispatch younger stores precisely because a gather instruction can not be characterized with a memory range a,nd thus, the hardware must conservatively assume a dependency between a gather and all following store instructions.
As already mentioned in the previous case, the number of tiimes that this full memory latency is exposed is proportional to the length of the vector regiskrs. In the DVAlG model, this memory latency will be exposed 8 tiines inore than in the DVA128 model, thus partially contributing to the slowdowns of the DVAlii machine. The longer the memory latency, the worse is this effect in the DVA16 case. 
Branch Penalties
Another effect of reducing the vector register length is the increase of mispredicted branches. Table 1 presents the total number of mispredicted branches for each program, for the DVA128 and the DVA16. Note that the table presents absolute number of mispredictions rather than misprediction rate because the number of branches in each architecture varies (in fact inisprediction rate is higher for the DVA128 machine because it executes much fewer branches overall).
As it, can be seen from table 1, the number of mispredictions can greatly increase. For most programs, this effect is due to the following. The number of branch instructions in the program under either model is essentially the same. The effect of reducing vector register length is that each branch is visited more times. Those branches that were difficult to predict or that had conflicts with other branches and resulted in misses in the BTB in the DVA128 model, are executed many more times under the DVA16 model. Thus, total number of misprediction increases.
This explanation is clearly not enough in the case of t r f d and hydro2d, which have an increase of mispredictions of 21.0 and 16.8 respectively. We investigated the two programs and found that the increase was clue to a combination of our strip-mining and short vector trips. The real vector length register in the C3 machine morlm in such a way that, if a value larger than 128 i s written to it, it is automatically chopped down to 128. Tlie compiler relies on this hardware behavior to save one test in its strip-mined code. By contrast, our manual strip-mining, although achieved the desired effect of emulating a machine with a smaller h a . r d~a r e vector length, can not rely on this effect and requires an extra comparison and jump to implement a MIIT(I6, J ) operation. In t r f d , the variable J takes values from 10 to 40 in steps of 5, causing the two-bit satmilrating counter to inispredict the jump most of the t, i in (2 .
Suiiiiiiary
'This paper has presented data on tlie tradeoffs involved in choosing an adequate vector register size for vector ISAS. Traditionally. very large vector registers have been chosen to maximize the amount of latency a.inortized per vector instruction. Nonetheless, this rlectioii was made in an environment where almost all vector architectures executed instructions in strict prograin order (with some minor overlapping between vector and scalar instructions). Despite the need for very long registers, many highly vectorizable programs c m not, inalie full use of every single element in a regist,er. Our measurements show how in many programs, less than 30% of all register being used are completely filled with 128 elements of data. Unfortunately, our simulations confirm that it is not possible to reduce ishe 1-ector register length in a traditional vector architecture without severely affecting performance: halving the register length, for example, yields slowdowns in t,lie range 1.05-1. 8 . This paper has shown that when TLP is exploited using decoupling the negative impact of reducing the register length is substantially reduced. The reduction 111 \.ector register length can be wecl in two different \rays: either to decrease processor cost by reducing i2he total amount of storage devoted to register values or to improve performance by inore effectively using t,he available storage by adding vector queues in a decoupled environment. The overall effect is that very large registers in the decoupled context are no longer 11eecled.
Simulations show that combining decoupling and short registers it, i s possible to reduce the size of each vector register to 1/2 with a good perforimnce improvement (speedups of 1.05-1.49) and down to 1/4 at a similar level of performance (speedups of 1.01-1.25) although some programs might experience small slowrd0~~11~ (less than 5%). The overall register space requireinelits for the DVA32 machine is half the origina.1 non-decoupled reference iliachine.
We have seen that there is a limit to the maximum possible reduction of the vector register length. D w to the increase of inispredicted hranches, a.nd the sclietluliiig limitation imposed by Gather/Scatter operations, if the register length is reduced down to 1 G elements, many stall cycles appear in the critical path of a program. Moreover, our simulations have shown that it is not possible to overcome these effects by enlarging the vector queues. Nonetheless, we are currently working in using the dynamic load/store eliinination techniques described in [l] in our decoupled machine with short registers. Tlie results show that in inany cases, if bypassing is allowed between the store and tlie loa,d queue, the performance of tlie DVA16 machine can be greatly improved.
We believe that the results presented in this paper are not only relevant to the vector processor community but could also be of use in the near term for designers of multimedia instruction sets [IO] [Ill.
