Effective usage of vector registers in decoupled vector architectures by Villa, Luis et al.
Effective Usage of Vector Registers in Decoupled Vector 
Architectures 
Luis Villa * Roger Espasat Mako Valero 
Departament d'arquitectura de Computadors, 
Universitat Politkcnica de Catalunya-Barcelona 
{ 1uisv.roger ,mateo} @ac.upc.es 
http:/fwww.ac.upc.es/hpc 
Abstract 
Thzs paper presents  a s tudy  of t he  ampact of reduc- 
i n g  t h e  vector  regaster saze rn a decoupled vector  archz- 
tecture.  In tradi t ional  in-order vector  archatectures, 
l o i i g  ?lector regrsters haue typically been the n o r m .  W e  
s tart  present ing data t h a t  shours that, even  for hrgh.ly 
rmtorrnable codes, only  a sma l l  f rac t ion  os all e l emen t s  
of (I long vector  regaster are actually used. U7e also 
sho,w tha t  reducang the  regaster swe an a tradataonal 
~ i ~ e c t o ~  architect,ui,e in an a t t empt  t o  reduce hardware 
cost u n d  iiiaxainzze regrster utalazataon 1-esults an a se- 
' i l c ~ e  l m f o r m a i z c e  degradataon. However,  we combine 
t h e  decoupling techiiique urith the  vector  register re- 
diietron and show t h a t  t he  resultany architecture toler- 
a t t s  very  well  the regaster szze cuts.  W e  samulate a 
s t l cc f zon  of Perfect  Club and SpecfpSL programs us- 
rirg U trace draven approach, and compare t h e  ezecu- 
troll. t i m e  L I Z  a convent ional  vector architecture with a 
tlr-coupled vector  architecture uszng dafferent regasters 
' i i z c s .  Halvang the regrster szze and uszng decouplzng 
pi,oe'ides speedups between 1.04-1.49 over a tradataonal 
iiZ-order .rector machrnes.  Euen reduczng the  regaster 
lei ig-th t o  1/4 the original size ( a n d ,  in. s o m e  cases, 
t o  1/81 t h e  per formance  of t he  decoupled machzne IS 
better t h a n  a conventronal vector  model.  Moreover,  'we 
observe  t h a t  t he  resultaizg decoupled mach ine  wath short  
r-egisters tolerates  very  well  long m e m o r y  latenczes. 
1 Iiitroduction 
1;ector architectures have been used for many years 
for high performance numerical applications - an area 
where they still excel. 
The traditional approach to vector processor de- 
sign has been to use an in-order execution engine and 
ncliieve high performance exploiting the natural data- 
level parallelism embedded in each vector instruction. 
* C h i  leave froin the Centro de Investigaci6n en C6mputo, In- 
st,ituto Polit6cnico Nacional ~ hlCsico D.F. This work was sup- 
ported by the Instituto de Cooperacion Iheroamericana (ICI), 
Cloiisqio Nacional de Ciencia y Teciiologia (CONACYT). 
t This work was supported by the Ministry of Education of 
Spain under contract 0429/95, and by the CEPBA. 
Typically, traditional vector architectures have used 
very limited forms of ILP techniques, only allowing 
some overlapping of vector and scalar instructions 
but keeping the scalar and vector instruction streams 
strictly ordered. To achieve good performance and to 
be able to tolerate the large latencies associated with 
supercomputer main memory systems, vector design- 
ers have exploited the large number of independent 
operations present in each vector instruction. When 
a vector instruction is started, it pays for some ini- 
tial (potentially long) latency, Imt then it works on a 
long stream of elements and effectively amortizes this 
latency across all elements. A few of these vector in- 
structions running concurrently can yield a very good 
usage of the available hardware resources. 
In this context, it is natural that vector processor 
designers have striven to implement vector registers 
as large as budget and technology constraints would 
allow. Nonetheless, in today's environment where ILP 
techniques such as out-of-order execution, decoupling, 
multithreading, branch prediction, speculation, etc, 
have proved their value as latency tolerance mecha- 
nisins, it is less clear that the best way to invest the 
available register space consists in having only few 
very large registers. 
First, if 
an application can not make full use of each register, 
then a precious hardware resource is being wasted. 
Second, given a certain budget in terms of transistors, 
large registers imply that only a few of them can be 
implemented. A small number of logical registers has 
a direct impact on the amount of spill code that the 
compiler and/or programmer must introduce to fit all 
live variables in the limited register file. Third, in- 
troducing ILP techniques in a processor having a few 
very large logical registers is difficult. For example, 
out-of-order execution without renaming with only 8 
logical vector registers provides little benefit. On the 
other hand, introducing register renaming can be very 
costly since many copies of registers that are very large 
have to be provided. 
Reducing the vector registers length is certainly a 
solution to the problems just outlined. If most applica- 
tions can not fully use all elements present in each vec- 
tor register, then reducing the vector register length 
will reduce cost and increase the fraction of usage of 
Large registers have several drawbacks. 
495 
0-8186-8332-5/98 $10.00 0 1998 IEEE 
registers Tlie drawback of register length reductio11 is 
t,lie a,ssociated perforiiiance penalty. Each time a vec- 
t,or iiistruction is executed, its associated latencies are 
iIl11ortlzetl over a smaller number of elements. This 
can  ha\-e a significant impact on performance, espe- 
cially for memory accesses. Moreover, more instruc- 
tions have to be executed each with a shorter effective 
length, and, therefore, the number of times that la- 
t.encies must lie payed is larger. 
Unless some extra latency tolerance mechanism is 
iiit,roduced in a vector architecture, vector length can 
not lie reduced without a severe performance penalty. 
TT71iile many techniques have been developed to toler- 
ate iiieiiiory latency in superscalar processors, only a 
f e r  studies have considered the same problem in the 
coiitext of’ vector architectures [l, 2, 31. 
This pa.per will present data confirming the fact 
t,liat traclit,ioiial vector architectures can not reduce 
heir vector register length without suffering a severe 
p(>rlorinaiice penalty. However, we will show that by 
combiniiig the vector register length reduction with an 
ILP kchnique, decoupling, the per€ormance penalty 
ca,ii be made very small. We will show that result- 
ing architecture tolerates very well long memory la- 
tencies and also malies a better usage of the avail- 
able storage s p x e  in each vector register. Not only 
1,lie performance iinpact of reducing the vector length 
is small, but when our architecture with short vec- 
i,or registers are compared against a traditional vector 
iiia,chine 1mitl.1 large vector registers, performance is in 
iiiost ca.ses far better across a large memory latency 
rmge .  
2 Vector Length usage 
The usage of the vector register file elements is de- 
teriniiied by both the degree of vectorizatioii of a pro- 
gram aad the natural vector lengths associated with 
the data structures of‘ an application. Many applica- 
t,ioiis ha.ve small data sets or iterate over a particular 
diineiision of‘ an iteration space which is smaller than 
t h  7;ector register length. In [3] we evaluated a set 
d highly vectorizable applications in order to know 
\rliicii \vas the vector length used by these programs. 
‘rhe first thing to note is that ,  even though these 
set, oC prograins are highly vectorizable, their average 
\,.ect,or leiigths are not very high. Investigation of the 
programs reveals that often times this is due to the 
~ ia t~ura l  shape of the application data space. In other 
cases, it is due to the nature of the algorithm, i.e., a 
triaiigiilar matrix operation teiids to have many small 
wctor lengths. 
Which mill be the effective register length use if we 
vary tlie vector register size ?? In [4] answered this 
question showing how tlie application vector length 
aiid the hardware vector register length are related. 
We noted that in order to augment the percentage of 
rii l i  stripes we would have to choose as relatively small 
\,-ector regist,er size. Next sections will look into the 
performance implications of choosing a small vector 
regiskr size. 
+FW(I.J.l) 
+FW(I.J.l) 
+FW(I.J.3) 
+FW(I.J.4) 
Figure 1: (a) Flo52 loop without Strip-Mining, (b) 
Adding strip-mining. 
3 Compiliiig for smaller vector lengths 
In order to investigate the effects of reducing the 
hardware vector register length we need a set of bench- 
marks compiled assuming different vector lengths. 
Unfortunately, no public domain vectorizing compiler 
is available and, therefore, we a.re forced to artificially 
fool the Convex compiler [5] to generate code “as if” 
the vect,or length mas 16, 32 or 64 (instead of the real 
128). To obtain the desired binaries we modified the 
source benchmarks as follows. Using the vectorization 
information produced by the Convex compiler, we lo- 
cated in the source code each vectorized loop. For 
each loop nest, and taking into account loop transfor- 
mations such as peeling, interchange and skewing, we 
manually strip-mined the loop being vectorized. This 
manual strip-mining consisted in adding a strip mine 
loop performing steps of length VLZ and modifying 
the original vectorized loop to do at most VLZ itera- 
tions (see figure 1). To prevent the compiler from gen- 
era.ting a doubly strip-mined loop (our strip-mining 
plus the natural strip mining introduced by the com- 
piler) we used tlie MAXTRIPS directive [SI. This di- 
rective iiiforined the compiler that the inner loop was 
performing less than 128 trips and thus no extra strip- 
mining was generated 
Using such a procedure we strip-mined most (but 
not all) vectorized loops present in our ten bench- 
marlis. Loops that escaped from this strip-mining 
where vector loops that are in libraries and loops 
where introducing one extra level of strip-mining 
stopped vectorization. Moreover, due to  the large 
number of loops to strip-mine, we first selected those 
that accumulate 95370 of all execution time. The re- 
maining loops that form tlie other 5% of execution 
time were not instrumented. For each program, we 
generated four different binaries, assuming that the 
maximum hardware vector leiigth was 16, 32, 64 and 
128. For each register length, the percentage of opera- 
tions that escaped our strip-mining procedure varied, 
but was below 4% for all programs except arc2d and 
flo52 where it was close to 10%. 
496 
4 Short Vectors Performance 
We &art by analyzing the performance of a tra- 
djtional in-order vector machine when the hardware 
vxtjor length is varied. We are interested in the effect 
(,hat, cliff'ereiit memory latencies have on performance 
a.nd how it interacts with vector register length. 
4.1 Performance on the Reference Archi- 
tecture 
Our reference inachine is loosely based on a Convex 
C3409. The esseiitia.1 characteristics of the reference 
architecture are a single memory port, two functional 
units and 8 vector register. In [GI we give a detailed 
explanation of this reference archit'ecture. In [4] we 
studied four different variants of this reference ma- 
chiiie. The four models under study was referred to as 
the REF128, REF64, REFS2 and REF16 architectures 
r i t h  a. vector length of 128, 64, 32 and 16 elements re- 
spectively. 
1Te noted that the impact of memory latency is very 
significant. For our unmodified model (REF128) we 
olxerved that execution time is degraded by factors of 
1.2- 1.4 in most programs when we vary the latency 
froin 1 to 100 cycles. 
We observed that reducing the vector register 
length performaace degradation is very high. Our con- 
clusion were that ] reducing the vector register length 
in a traditional vector machine results in a remarkable 
loss of performance, The cost savings are clearly out- 
weighted liy the execution time degradation. Unless 
some latency tolerance technique is added to a tradi- 
f,ional vect,or inachine, vector register length should be 
kept a.s long as possible. In the next section we will 
see 11ow drcoupliiig can compelisate this performance 
loss. 
5 Combining short vectors and decou- 
p li 11 g 
In this section we will study how the combination of 
a. latency t,olerance technique such as decoupling can 
he combined with a vector architecture having short 
registers to overcome the performance degradation 
seen in the previous section. As we will see, decou- 
Illing ~ i t h  short registers can even provide speedups 
Tyit,li respect to a traditional in-order machine. 
5.1 Decoupled Vector Architecture 
For ours simulations we used the decoupled vector 
architecture introduced in [l]. The main idea in this 
architectjure is lo use a fetch processor to split the in- 
coming, noli-decoupled, instruction stream into three 
dif€erent, decoupled streams. Tlie translation is such 
h t  each processor can proceed independently and, 
yet synchronizes through the coinmuiiication queues 
 hen needed. Each of these three streams goes to a 
different processor: the address processor ( A P ) ,  that 
performs all memory accesses on behalf of the other 
two processors, the scalar processor (SP), that per- 
forins all scalar computations and the vector proces- 
sor ( V P ) ,  that performs all vector computations. The 
three processors communicate through a set of zmple- 
~ r i e ~ i t u t i o n a l  queues and proceed independently. This 
set of queues is akin to the implementational queues 
that can be found in the floating point part of the 
RPOOO microprocessor[7]. Tlie main difference of this 
decoupled architecture with previous scalar decoupled 
architectures such as the ZS-1 [8] or the MAP-200 [9] 
is that it has two computational processors iiistead of 
just one. These two computation processors, the SP 
and tlie VP, have been split due to the very different 
nature of tlie operands on which they worli (scalars 
and vectors, respectively). 
Tlie main parameters of this architecture are tlie 
length of its queues: the three instruction queues, tlie 
inter-processor queues, the scalar queues and the load 
store address queues were set at 16 elements. For the 
vector queues (numbers 1 and 2) ,  each slot is a full vec- 
tor register and, therefore, their size has to be carefully 
considered. We start with 4 slots in each of them, as 
suggested in [l]. Reducing the vector register length 
benefits a decoupled implenientatioii since each slot in 
the extra queues required to decouple the machine can 
be smaller than in the original machine. 
The key points in this architecture will be to achieve 
good perforinance with relatively few slots in these 
two queues. This is another point where reducing the 
vector register length can be very helpful. 
5.2 Performance of the DVA 
What is the performance of the clecoupled machine 
using different vector register lengths ? Figure 2 plots 
the simulated performance for tlie decoupled and non- 
decoupled machines for several ineinory latencies. For 
each program, we plot tlie baseline performance of the 
non-decoupled inachine with a register length of 128 
and the performance of the decoupled versions using 
register lengths of 16, 32, 64 and 128. Note that the 
Y-asis plots the relative performance of each config- 
uration relative to the non-decoupled machine with 
length 138 and memory latency of 1 cycle. Thus, in 
figure 2 numbers above 1.0 indicate a slowdown and 
nuinbers below 1 .0 indicate speed,ups. 
We will start comparing the performance of the de- 
coupled and non-decoupled machines with the max- 
imum vector register length (128). As already pre- 
sented in [1], the performance improvements due to 
decoupling are quite substantial. Even with a perfect 
memory system with latency 1, speedups are in the 
range 1.10-1.25. When memory latency is increased 
up to 100 cycles, the DVA experiences some slow- 
clowiis, hut much smaller than the reference machine. 
Comparing both machines at a latency of 100, the 
DVA yields speedups in the 1.22-1.52 range. 
When the register length is reduced we still obtain 
very good results. Halving the register length (64 el- 
ements), yields a machine tha,t performs only worse 
497 
tomcatv trfd 
1 SO 1 0 0  
ClyfeSm 
li'igure 2: Effects of memory latency and vector register length on perforniance wheii using decoupling. 
t,liaii tlie DVA128 by factors of 1.01-1.10 but that, 
in a l l  ca.ses perforins much better than the reference 
machine. Comparing performance at 100 cycles niem- 
or>- hteiicy, me see speedups of tlie DVA64 over the 
REF niacliine in the 1.05-1.49 range. Note t,hat, in 
three cases, tlie performance of tlie DVA64 at 100 cy- 
cles latency is bettel. than the REF ma.chine perfor- 
mance a.t 1 cycle memory latency. In all programs 
but trfd and su2cor, if we compare tlie DVA64 at 
100 cycles and the reference machine at 50 cycles we 
s w  t81iat the decoupled machine perforins better (by 
filct80rs 111 the range 1.01-1.32). These results suggest 
f#lia,t evrii halving tlie register length, a machine with 
a slower vneniory system (thus, a much cheaper mem- 
ory system) would perform better than a traditional 
machine. 
Reducing the register length to 1/4 of the original 
lengLh (3% elements). we still see that the performance 
of the DVAS2 is better than the reference machine. 
Except for programs hydro2d, nasa7 and su2cor, the 
llJ7A32 achieves speedups over tlie REF machine in 
t,hc range 1.01-1.25 and goes up to 1.42 for dyfesm 
( a t  la.tency 50). 
Only when tlie register length is reduced to 16 el- 
ements (1/8 of the original) performance starts to 
degra.da.te noticeably. Seven out of ten prograins per- 
forin worse with tlie DVA16 than with the REF ma- 
chinp ,  a n d  only dyf esm and tomcatv maintain a. good 
perf'orinaace. This sudden jump in execution time is 
due to the combination of several effects: the number 
of scatter/gather operations, the number of outstand- 
ing hranches and dependencies in scalar code intro- 
duce iiiaiiy cycles of stall iii a program run. These 
t h e e  types of hazards stall the vector processor very 
hxpeiitly, thereby exposing the full memory latency 
at, ex11 memory load being executed. This explains 
the steep slopes of each of the DVAlG curves. 
6 Increasing Queue Length 
The load and store queue length is a Bey param- 
eter in a decoupled architecture. It determines the 
amount of data that can be prefetched ahead of time 
and, therefore, the queue length puts an upper limit 
on the maximum memory latency that can be toler- 
ated. For example, a system having 8 slots in the 
load queue, each corresponding to a 32 element vec- 
tor can request up to 8 x 32 = 256 data items to  the 
memory system before blocking. If main memory la- 
tency is shorter than 256 cycles, then this decoupled 
system can establish a continuous flow of data from 
main memory into the processor without stalls (pro- 
vided there are enough load instructions to keep the 
pipeline fed, of course). On the other hand, if mem- 
ory latency (L) is larger than 256 cycles, no matter 
how fast we can feed the address processor, the flow 
of requests to the memory system will be interrupted 
and a fraction of all memory latency ( L  - 256) will be 
exposed to the coinputation processor. 
In this section we will look at the performance im- 
provement due to enlarging both the load and store 
queue lengths. We expected that,  the longer the 
queue, the better memory latency will he tolerated. 
As we will see, this intuition is wrong and there is 
a limit after which increasing queue length does not 
yield any significant performance advantage. 
Figure 3 presents for our ten benchmark programs 
the improvements due to increasing the queue size. 
Due to lack of space we present only the data for tlie 
DVA16 model, where each vector register is supposed 
498 
to liold 16 elements. For each program we plot 4 dif- 
ferent bars, labeled “&=4” through “&=32” that in- 
dica.te the number of slots in the vector load queue 
and the vector store queue. For these 4 bars, memory 
lat,ency was assumed to be 50 cycles. Moreover, at the 
lop of each bar there is a white bar representing the 
execution time for the same queue size but assuming 
a ineinory latency of 100 cycles. 
As it can be seen from this figure, increasing from 
4 slots to 8 slots does provide some performance im- 
provement, especially at 100 cycles memory latency, 
b u t  lurther increasing the queue to 16 or 32 slots 
does iiot provide any additional benefits. The result 
i s  striking: if we compare the total area requirements 
of ljhe DVA128 and DVAlG architectures. For exam- 
ple> t$he DVA128 architecture holds a total of 128 x 4 
itmeins in each of its load and store queues. Similarly, 
t,he DVA16 architecture with 32 slots in the queues 
can  potentially hold exactly the same amount of data 
( I ( <  x 321, and yet it achieves a much worse perfor- 
iiiaice. Although not presented here, a similar effect 
1ia.ppens with the DVA32 and DVAB4 architectures. 
‘The overall conclusion is that increasing queue size 
does not coinpensate for the reduction in vector reg- 
ister length. This will be further analyzed in the fol- 
lou7ing section. 
7 Limits on performance 
In what ways does reduciiig the vector register 
I(wgt1i limit per€ormance ? As we have seen in the pre- 
vious section, vector length reductio11 can no be com- 
1)ensated increasing the depth of the queues or trying 
to augiiieiit the ILP inside the computation processor. 
‘rhis section will analyze the causes of this behavior. 
Latency Masking 
The most important effect of reducing the vector 
length is that many latencies that were previously hid- 
den uiiderneath the execution of vector code are now 
exposed in the critical path of the program. This ef- 
fect is represented in figure 3 .  On the left, we present a 
fraction of code from the most important loop of pro- 
gram su2or. This loop presents a true dependency 
from iiistruction 3 into instruction 4. Let’s assume 
that the latency for performing an addition on our ar- 
chit8ecture is 3 cycles. The schematic code sequence 
shown in (a)  displays the behavior of this loop under 
the DVA12P model. Each vector instruction performs 
128 operations and there is a 3 cycle stall between 
t#he start o f  iiistruction 3 and instruction 4. We note 
that this 3 cycle stall is in the critical path of the 
loop. Under this model, ea.ch iteration of the loop 
~ ~ o u l d  take 128 x 3 + 2 + 3 = 389 cycles, assuming 
I#liat the two sca1a.r operations are executed in a single 
cj.cle each. Therefore, the percentage of wasted cycles 
(3/389) is very small (0.77%). Executing the same 
loop under the DVA64 model (example shown in (b)), 
w e  will still pay the same 3 cycles of stall, but they 
will be miortized over less elements. The percentage 
I LOOP: 
128 0 64 Q 0 
C E I D Q  ax?@@ 
1 ADD.w al,s5 
2 MUL.w vl,s7,v4 . - - - - - - - _ _ _ _ _ _ _  
Figure 4: Effects of dependences for different vector 
register lengths. 
of wasted cycles will be 3/(64 x 3 + 2 + 3) = 1.5%. 
Finally, in the extreme case, corresponding to a scalar 
machine (vector length = 1) shown in (c ) ,  we would 
pay 3 stall cycles every 8-cycle iteration, yielding a 
waste of 37.5%. 
Another way of loolting at figure 4 is to consider 
that the 3 cycles of stall involved in the dependency 
will have to be payed for each data item processed in 
loop (c), for every 64 data items processed in loop (b) 
or for every 128 items processed in loop (a). Thus, in 
order to execute a given amount of work architecture 
(a) will take less time than architecture (c) .  
The overall lesson is that the more the vector reg- 
ister length is reduced, the more this small latencies 
are exposed to total execution time. In a vector ar- 
chitecture having 128 elements per register, a 3 cycle 
latency is almost hiclden, whereas on a scalar machine 
this latency is exposed on every single iteration. 
Note thai we are not claiming here that the scalar 
execution model is necessarily worse than the vector 
model. There are many techniques (loop unrolling, 
software pipelining, etc.) that could help improve the 
performance oi the loop as executed on (c) .  We are 
simply pointing out that ,  given the binaries as they 
are, a decrease in vector length will expose more la- 
tencies (both from main memory and from functional 
units) and will increase a program’s critical path. The 
increase in total execution time is proportional to the 
decrease in vector register length. 
Gather- Scatter inst ructions 
Another very important limitation to performance 
in a decoupled vector architecture is the amount of 
gather/scatter instructions in the code. A gather in- 
struction can not he characterized with a memory 
range, and thus imposes a sequential bottleneck in 
the otherwise out-of-order execution of load/store in- 
structions. Moreover, a gather instruction requires a 
vector from the VP before being able to  proceed to 
the memory system. Thus, each time a gather has 
to be executed, a loss of decoupling appears: the VP 
and the A P  have to synchronize to launch the gather 
instruction. No inatter how inuch ahead the A P  was 
from the VP it will have to wait until the VP provides 
the vector register with the required addresses. 
Figure 5 shows a typical example of gather/scatter 
code from program su2cor. The gather instruc- 
499 
6 
< 6000 s 
M 
'3 4000 
2 
e 
0 
5 2000 
A 
x 
v 
.- 
c1 
a, 
I I I I I I  
Queue size = 4 latency 50 
Queue size = 8 " 
Queue size = 16 'I 
Queue size = 32 'I 
Same queue lengh at latency 100 
Figure 3 :  Performance of the DVA16 architecture for different queue sizes (4,8,16,32) and two memory latencies 
(50 and I00 cycles) 
LOOP : 
1 L0AD.w ' tal,vl 
2 MUL-w "lg7,"z 
3 ADD.w e v 3  
4 GATHER a 5 , v d I  
9 c r . w  S Z , S l , C C 3  
BRCOND CC3 
Figure 5: Structure of gather-scatter code. 
t,ioii requires the computations carried out by instruc- 
tions 2 and 3 before being able to proceed. Unfor- 
tunately, instruction 2 requires a register ( v i )  which 
must be loaded from memory. The time diagram in 
the lower part of figure 5 shows the latency exposure 
introduced by the gather instruction. A full memory 
1a.tency plus the latency of an add and a mu1 oper- 
a.tion must elapse before the g a t h e r  instruction can 
proceed. This full memory latency can not be used 
to dispatch other loads because the decoupled archi- 
tecture executes loads in-order. It can not be used 
to clispatch younger stores precisely because a gather 
instruction can not be characterized with a memory 
range a,nd thus, the hardware must conservatively as- 
sume a dependency between a gather and all following 
store instructions. 
As already mentioned in the previous case, the 
number of tiimes that this full memory latency is ex- 
posed is proportional to the length of the vector reg- 
iskrs. In the DVAlG model, this memory latency will 
be exposed 8 tiines inore than in the DVA128 model, 
thus partially contributing to the slowdowns of the 
DVAlii machine. The longer the memory latency, the 
worse is this effect in the DVA16 case. 
Programs 1 1  128 I 64 1 Ratio 
SWM256 1 1  3.7 I 3.7 I 1.0 
HYDR02D 
ARC2D 
FLOW52 
NASA7 
SU2COR 
TOMCATV 
BDNA 
TRFD 
DYFESM 
0.7 
2 .0 
2.4 
7.4 
15.5 
0.7 
6.0 
0.9 
10.2 
11.8 
4.3 
5.8 
10.8 
19.5 
1.4 
6.0 
18.9 
22.6 
16.8 
2.2 
2.4 
1.5 
1.3 
2.0 
1.0 
21.0 
2.2 
Table 1: Absolute number of mispredictions (in mil- 
lions) for the 128 to 16 architectures 
Branch Penalties 
Another effect of reducing the vector register length 
is the increase of mispredicted branches. Table 1 
presents the total number of mispredicted branches 
for each program, for the DVA128 and the DVA16. 
Note that the table presents absolute number of mis- 
predictions rather than misprediction rate because the 
number of branches in each architecture varies (in fact 
inisprediction rate is higher for the DVA128 machine 
because it executes much fewer branches overall). 
As it, can be seen from table 1, the number of mis- 
predictions can greatly increase. For most programs, 
this effect is due to the following. The number of 
branch instructions in the program under either model 
is essentially the same. The effect of reducing vec- 
tor register length is that each branch is visited more 
times. Those branches that were difficult to predict or 
that had conflicts with other branches and resulted in 
misses in the BTB in the DVA128 model, are executed 
many more times under the DVA16 model. Thus, to- 
tal number of misprediction increases. 
This explanation is clearly not enough in the case of 
t r f d  and hydro2d, which have an increase of mispre- 
dictions of 21.0 and 16.8 respectively. We investigated 
500 
the two programs and found that the increase was clue 
to a combination of our strip-mining and short vector 
t r ips .  The real vector length register in the C3 ma- 
chine morlm in such a way that,  if a value larger than 
128 i s  written to i t ,  it is automatically chopped down 
to 128. Tlie compiler relies on this hardware behav- 
ior to save one test in its strip-mined code. By con- 
trast, our manual strip-mining, although achieved the 
desired effect of emulating a machine with a smaller 
ha . rd~are  vector length, can not rely on this effect and 
requires an extra comparison and jump to implement 
a MIIT(I6, J )  operation. In t r f d ,  the variable J takes 
values from 10 to 40 in steps of 5, causing the two-bit 
satmilrating counter to inispredict the jump most of the 
t, i in (2 . 
8 Suiiiiiiary 
'This paper has presented data on tlie tradeoffs in- 
volved in choosing an adequate vector register size for 
vector ISAS. Traditionally. very large vector registers 
have been chosen to maximize the amount of latency 
a.inortized per vector instruction. Nonetheless, this 
rlectioii was made in an environment where almost 
all vector architectures executed instructions in strict 
prograin order (with some minor overlapping between 
vector and scalar instructions). Despite the need for 
very long registers, many highly vectorizable programs 
c m  not, inalie full use of every single element in a regis- 
t,er. Our measurements show how in many programs, 
less than 30% of all register being used are completely 
filled with 128 elements of data. Unfortunately, our 
simulations confirm that it is not possible to reduce 
ishe 1-ector register length in a traditional vector archi- 
tecture without severely affecting performance: halv- 
ing the register length, for example, yields slowdowns 
in t,lie range 1.05-1.8. 
This paper has shown that when TLP is exploited 
using decoupling the negative impact of reducing the 
register length is substantially reduced. The reduction 
111 \.ector register length can be wecl in two different 
\rays: either to decrease processor cost by reducing 
i2he total amount of storage devoted to register values 
or to improve performance by inore effectively using 
t,he available storage by adding vector queues in a de- 
coupled environment. The overall effect is that very 
large registers in the decoupled context are no longer 
11eecled. 
Simulations show that combining decoupling and 
short registers it, i s  possible to reduce the size of each 
vector register to 1/2 with a good perforimnce im- 
provement (speedups of 1.05-1.49) and down to 1/4 
at  a similar level of performance (speedups of 1.01- 
1.25) although some programs might experience small 
slowrd0~~11~ (less than 5%). The overall register space 
requireinelits for the DVA32 machine is half the orig- 
ina.1 non-decoupled reference iliachine. 
We have seen that there is a limit to the maxi- 
mum possible reduction of the vector register length. 
D w  to the increase of inispredicted hranches, a.nd the 
sclietluliiig limitation imposed by Gather/Scatter op- 
erations, if the register length is reduced down to 1 G  
elements, many stall cycles appear in the critical path 
of a program. Moreover, our simulations have shown 
that it is not possible to overcome these effects by 
enlarging the vector queues. Nonetheless, we are cur- 
rently working in using the dynamic load/store eliin- 
ination techniques described in [l] in our decoupled 
machine with short registers. Tlie results show that in 
inany cases, if bypassing is allowed between the store 
and tlie loa,d queue, the performance of tlie DVA16 
machine can be greatly improved. 
We believe that the results presented in this pa- 
per are not only relevant to the vector processor com- 
munity but could also be of use in the near term for 
designers of multimedia instruction sets [IO] [Ill. 
References 
[l] Roger Espasa and Mateo Valero. Decoupled vector archi- 
tectures. In Proceedziigs of the 2nd International S y m p o -  
swim on High Pwformance Computer Architecture, pages 
281-290. IEEE Computer Society Press, Feb 1996. 
[2] Roger Espasa and Mateo Valero. Multithreaded vector ar- 
chitectures. In Proceedzngs of the 3rd Internationul Sympo- 
$ k m  on High Performance C?omputer Archztecture, pages 
237-249. IEEE Computer Society Press, Feb 1997. 
[3] Roger Espasa, RIateo Valero, and James E. Smith. Out- 
of-order Vector Architectures. In ilIIC'RO-3U. IEEE Press, 
1997. 
[4] Luis Villa, Roger Espasa, aiid Mateo Valero. Effective us- 
age of vector registers in advanced vector architectures. 
In International Conference on Parallel Architectures and 
Compilation Techniques (PA4CT97), San Francisco Cal., 
1997. 
[5] C'oiivex Press, Richardson, Texas, U.S.A. CONVEX Archi- 
tecture Reference M u n  uul  ( C  .Series), sixth edition, April 
1992. 
[C;] R. Espasa, M. Valero, D. Padua, M. Ji inhez,  and 
E. AyguadC. Quantitative analysis of vector code. In EZL- 
romicro Wo,rLshop on Paralltl und Dzstrzbuttd Processing. 
IEEE Computer Society Press, January 1995. 
[7] P.Y.T Hsu. Desigiiiiig the TFP microprocessor. IEEE 
Afzcro, 14(2):23-33, April 1994. 
[8] James E. Smith, G.E. Dermer, B.D. Vanderwarn, S.D. 
Klinger, C. M. Rozewski, D. L. Fowler, I<. R. Scidmore, 
and J. P. Laudon. The ZS-1 Central Processor. In 2nd In- 
ternationul Conferrncr o n  Architect?ara~ Support J O T  Pro- 
yrammzny Languayts und Operuting Systems, pages 199- 
204. CS press, 1987. 
[9] E. U. Cohler aiid J. E. Storer. Fuiictioiially parallel archi- 
tectures for array processors. CO nzp ater, 11:28- 36, Septem- 
ber 1981. 
[lo] Ales Peleg and Uri Weiser. MMX Technology Estension to 
the Intel Architecture. IEEE Micro, pages 42-50, August 
1996. 
[I11 Krste Asanovic, .James Beck, Bel-trand Irissou. Brian 
Kingsbury, Nelson Morgan, and John Wawrzyiiek. The TO 
Vector Microprocessor. In Hot Chrps VII,  pages 187-196, 
August 199.5. 
501 
