Quantitative analysis of vector code by Espasa Sans, Roger et al.
Quantitative Analysis of Vector Code * 
Roger Espasa., Mateo Valero, David Pa.dua{ Marta. Jimdnez, Eduard Ayguad6 
Depart ament d ' A ryui tect ura de Compu tadors, 
Universitat Politkcnica de Catalunya. 
c/ Gran Capit&, Mbdul D6, OS071 Barcelona., SPAIN 
e- mail : roger @ac. u pc.es 
Abstract 
In this paper we present the results of a detailed 
simulation study of the execution of vector yrogrums 
on  a single processor of a Convex C3480 mrrrhine, 
using a subset of the Perfect Club besch.marks. W e  
are interested in evaluating several cost/performai,.ce 
tradeofls that the machine designers m-ade in order to 
asses which features of the arch,itecture severely liiirit 
the performance attainable. W e  present the detailed 
usage of the vector functional units and n study of the  
kinds of resource conflicts that stall 2h.e mach.ine. The  
results obtained show that the resources of the ,vector 
architecture are not e f ic ien t ly  used inaisly due t o  the 
single bus m e m o  y architecture. Other severe liniita- 
t ions of the machine turn  out t o  be the lack of ch,aiit.ing 
between vector loads and vector compuiations, atid ihe 
lack of a second general purpose functiontal unit. We 
also present some data about ihe port pressure on the 
vector register file and we see tAat stalls due t o  pod  
conflicts are relatively high. W e  also consider the slow- 
down introduced by spill code and find tRat th.e 1iin.ited 
number of vector registers also l imits perfornaaiacc. 
1 Introduction 
In order to  design new architectures one has first 
to  properly understand the behavior of current archi- 
tectures to  be able to analyze its strengths and weak- 
nesses and improve future designs. The analysis of 
the interaction between axchitectures, compiler tech- 
nology and application programs is an active field of 
research where several studies have been carried. This 
*This work was supported by the Ministry of Education of 
Spain under contract. TIC 880/92, by ESPRIT 663.1 Basic Re- 
search Action (APPARC) and by the CEPBA (European Center 
for Parallelism of Barcelona). 
'University of Illinois, at. Urbana-Champaign. 
studies try to determine maximum parallelism avail- 
able in t4he programs [7, 8 ,  16, 121, frequency of exe- 
cut*ion of instructions [lo], bottlenecks and hazards in 
the archit,ecture [2], etc. 
In t,his paper we are interested in the evaluation 
of a vector architecture [5], together with its vector- 
izing compiler [17]. Since the introduction of the first 
register-register vector coniputer, the CRAY-1 [ll], 
compila.tion technology has evolved to ma.ximize the 
performa.nce t1ia.t programs written in high level lan- 
guages can obtain from a. vector architecture. Nev- 
ertheless there have been few studies in  depth about 
the relationship between vector architectures, vector- 
izing compilers and vector programs. In [15, 14, 131 
the CRAY-Y-MP a.rchitecture is evaluated through a 
detailed s h d y  of the program characteristics such as 
nuniber and type of instructions executed, basic block 
size, fraction of code vectorized, etc. [9] presents a 
study on the register requirements of vector architec- 
tures and ana.lyzes what combination of number of 
registers and number of rea.d/write ports to the regis- 
ter file has the best. cost performance tradeoff. 
In this paper we present, results obtained from the 
evalua.tion of a subset. of the Perfect Club [4] pro- 
grams compiled using the Convex FC versioii V8.0 
running on a single node of a Convex C3480 vector ma- 
chine. To perform this research we have implemented 
a trace generation tool called dixie able to  generate 
basic block traces from the execution of the programs. 
This basic block traces are fed into a siinulator tl4at we 
hwe developed t1ia.t. gives us detailed information, a t  
the cycle level, of every event that. happens during the 
execution of the program. This is in contrast with the 
tools used in [14, 15, 131, which only provided statisti- 
cal dat,a. to their authors. It is important to note that 
we a.re going to evaluate automatically vectorized pro- 
gra.nis, and thus we will be studying the performance 
of the architecture together with its compiler. 
In section 2 we present the measurement techniques 
108&6192@6 $4.00 Q 1996 IEEE 
used in this paper. We describe the trace driven ap- 
proach we have used to simulate tlie execution of some 
of the Perfect Club programs. In section 3 we present) 
the benchmarks used in this paper. In secbion 4 we 
present the abstract vector machine that. we will use 
to  carry the experiments. In section 6 we will study 
the parallelism exploited by our abst-ract- niachine us- 
ing the output of the Convex compiler. SectZion G will 
look in detail into tlie reason that preveiii. tlie ma- 
chine from extracting all the parallelism available and 
section 7 will study the effects of spill code 011 vec- 
tor execution. Finally in section 8 we will present. our 
conclusions. 
2 Overview of the measurement tech- 
nique 
Tlie machine 011 which the experiments were per- 
formed is a single processor of a. Convex C3480. This 
machine is ranked in the mini-supercomputer class, 
and has 8 processors and each one of them has a. sca1a.r 
execution unit and a vector execution unit,.  We a.re 
interested in the vector beha.vior, and thus all mea- 
sures have been taken using a single processor run- 
ning in single-user mode. Tlie (33400 processor ca.11 be 
described as a register-register vect.or machine. Tlie 
compiler used in all cases is the Convex FC iwrsio~~.  
V8.0 with optiiiiizatioii level - 0 2  (\vhicli implies vec- 
torization) [l] . The vector CPU  consist.^ of t8wo func- 
tional units. The first one ha.ndles all vector opera- 
tions except multiplication, division and square root.. 
The second one handles all vector opera.tions. Ea.ch 
functional unit has access to  8 vector regist.ers. The 
vector registers me set up in register pa.irs. so t1ia.t. 
each pair has two r e d  ports and one write port. Ev- 
ery vector register is ca.pa.ble of holding 121 elements 
of 64 bits each. The vector CPU implements fiilly flcs- 
ible chaining which means that. a.n opera.tion ca.n be 
chained to a previous vector operation currently i n  
progress regardless of the cycle a.t. which ea.ch opera- 
tion has started. Due to  the va.ria.nce in response time 
that tlie memory system dways shows, vector coni- 
putations can never be cha.ined to  itecior lond inst,ruc- 
tions. Vector stores can be cha.ined, though, to vector 
operations because a set of buffers isolates tlie vector 
CPU from the memory la.t,ency when sending c1a.t.a. 1.0 
memory. 
We have taken a tra.ce-driven a.pproa.ch t,o ga.t.her all 
the data  presented in this paper. We Imve cleveloped 
a pixie-like tool called dixie 131 that is able t.o produce 
a trace of basic blocks executed as  well as a. tmce of 
453 
the values conta.ined i n  the vector length (VI) regis- 
ter. Tlie ability to  trace the value of the vector length 
register is critical to have a detailed simulation of the 
execution of the programs, since each vector instruc- 
tion caa execute with a potentially different vector 
leiigth. Thus, our measurements do not suffer from 
the problem reported in  [IS]. 
Disie is a tool that given an executable file will 
produce 1) a modified executable file wi th  instrumen- 
t,a.t.ion code that- will generate a trace and 2) a basic 
block descript.ion file that. maps basic block identifiers 
t.0 t.lie a.ctua.1 instr~ict~ions of each basic block. When 
you run the iiistruiiiented executable it will generate 
a tra.c.e of basic IAock identifiers and a trace of every 
value that is assigned to  the vector length register. 
This two parallel traces a.re consumed by a cycle-level 
simulator t,liat. uses the basic block description file to  
siinulnte the execution of every single instruction and 
measure the dyna.mic behavior of the program. Each 
time t,he simula.tor finds an  instruction that loads the 
vl register it. will consume a value from the vector 
length tra.ce. Dixie is able to  tra.ce user and library 
code mid, t,lius, the simulation runs include the user 
vec.tor code plus all the vector code found in the for- 
t.ra.11 nia,th libraries. It is important to  note that we 
siniu1at.e the output. of a commercial compiler without 
int4roducing any modification in it and that this trac- 
ing method gives us absolute precision in all of our 
nieasorements. 
The simula.tor we have developed 11% been based 
011 an abstract version of IJie Convex a.rchitecture and 
will be described in section 4. 
3 Benchinark Programs 
We Imve selected a subset, of the Perfect Club pro- 
grams as our I>encliniark programs. The Perfect Club 
appIica.tion codes are considered to be representative 
of large typical scientific and engineering programs. 
For our study, we have executed the thirteen codes 
on the C3400 in scalar and vector mode and we have 
obtained the speedups presented in table 1. Column 
two presents the cpu t.ime (in seconds) of each on of 
the programs when run  in  scalar mode. Column three 
presents t.he cpu time of the programs (also in sec- 
onds) when run  in vector mode. Column four is the 
speedup of the vector versus scalar version. Column 
five preseiit,~ an estima.tion of the fraction of time spent 
executing vector code. This estinia.tion has been ob- 
tained inst~rument~ing a l l  those basic blocks that had 
some vect.or code with code that. reads tlie hardware 
tio-~ers (TTR) of the (hives machine. Finally, column 
six presents the percentage of time that this vect,or 
basic blocks represent over the esecution time of ea.ch 
program. This column can be taken as a rough indi- 
cator of the degree of vectorization that. each prograin 
allows. 
As it can be seen from this table, most of the pro- 
grams do not benefit too much from vector esecution, 
and we believe that for the purposes of our study we 
should only examine the subset of programs that. re- 
ally exploit the vector CPU. Including programs that. 
have very low speedups in the study would give us no 
insight into the behavior of the vector cpii, beca.use 
this programs make very little use of the vect$or func- 
tional units and have very little instruction level par- 
allelism to offer. Thus we have selected the five pro- 
grams that have greater speedups: A RC2D, F L 0 5 2 ,  
BDNA, TRFD and SPEC77. We should really have 
included program MGSD iiistead of SPEC73, but due 
to the long running time of MGSD ancl the extremely 
high computation costs of the trace-driven siiniilator, 
we have not been able to simula.te this progra.111 i n  its 
full length and thus we have dropped it. form the s t d y .  
considered fully pipelined and with a latency of 1 cy- 
cle. This asymmetric behavior of the functional units 
is iniportant. when the control unit has to schedule the 
different, operations. Whenever the control unit has to 
issue an instruction that can be executed both in FUi 
and FU2, the decoder will always try to send it first to 
FU1 and, if that unit. is busy, it will try to send it to 
FU2. 
The vector unit has 8 vector registers which hold up 
to 128 eleinents of G4 bits each one. This eight vector 
registers are connected to the functional uni t s  through 
a. restricted crossbar. Every two vector registers are 
grouped in  a register bank and share two read ports 
and one write port that links them to the functional 
unit,s. The coinpiler is responsible to schedule the vec- 
tor instructions and allocate the vector registers so 
t.1ia.t. no port. conflicts arise. The machine modeled im- 
plements fully flexible chaining [SI. Flexible chaining 
allows for two dependent vector operations to be exe- 
cuted simultaneously without imposing restrictions in 
the issue t.ime of the two inst4ructions. Older vector de- 
signs, like the CRAY-l, had a fised chaining scheme 
in which chaining could only occur if the second op- 
eration of a. dependentn pair was issued a t  a particular 
point, in time The cha.ining implementation we have 4 The abstract vector machine 
The simulator used to gather the performance clat,a. 
for the benchmark programs models an idealized ver- 
sion of the C3400 machine. MJe feel that. i n  order 1.0 
better understand vector macliines it. is important, t,o 
abstract low level details (like functionaI unit. lat,en- 
cies, technology imposed 1ia.rdwa.re. restrictions. mein- 
ory delays and so on) €rom our study and concvnt,ra.te 
in the general behavior of the progranis. While t.he de- 
tails omitted in the simulator are very important. and 
deserve several studies in  its own right, the conclu- 
sions obtained from the data. gathered w i t h  our sim- 
ulator will still be valid. Since we \vi11 be looking a.t. 
the relative frequency of severa.l different. event,~,  the 
inclusion of the aforementioned low level cletsails woiild 
not introduce significanto differences in 0111' results. 
The architecture studied consists of a scalar pa,rt,, 
that we shall refer to as the SCAL func.tiona.1 unit,, and 
an independent vector part. The scala-r portion is 
able to execute one instruc.tion per cycle regardless 
of dependencies, functioiial unit hamrds or bra.iiching 
delays. The vector part consists of t8wo compiitat.ion 
units (FUl and FU2) and one memory accessing unit, 
(the LD/ST unit). The FU2 unit is a generd piirpose 
arithmetic unit capable of executing all vector inst.riic- 
tions. The FUl unit is a restricted funct,ional unit .  that. 
executes all vector instructions except  mult,iplica.t,ion, 
division and square root. Both functiona.1 onit.s are 
464 
chosen to model has two read and one write pointers 
for ea.ch one of the vector registers. This read/write 
point.ers control the nest. element. that has to be sent to 
t.he functional units and allow that the same physical 
vector register can be shared by different instructions 
t h t .  1ia.ve started a.(, different. cycles. 
The LD/ST unit. can only service one request 
t.o/froni inenlory at  time, because the architecture 
simulabed has only one bus connecting the CPU to 
iiieniory. TIIP meinory system sirnulated is an ideal 
one t,ha.t. has a. 1 cycle latency 'and delivers one da- 
~.um per cycle, regardless of the stride used. The real 
C3400 architecture has one additional limitation re- 
garding the memory system that we have chosen to 
simulate. Despite the fact the memory delivers one 
datum per cycle, we will not allow to chain the result 
of a vector load instruction with a vector computa- 
tion i~~structioii. This limitation is a common problem 
t,Iia.t. mu1 tiprocessor vector architectures have to face 
beca.use t.he variance in response time of real mem- 
ory systems makes it. difficult to predict the arrival 
time of a. dat.iini to the processor. Thus, while it is 
not. impossible t.0 c.lia.in vector computations to vector 
loads, a. reasoilable tnracleoff is to restrict this chaining. 
M'e believe i k  is important. to simulate this feature of 
the Convex C3400 architecture in order to evaluate its 
impact. on performance. 
scalar 
cputime 
5 Instruction Level Parallelism 
vector speedup vector bb vector 
cputime cputinie execution % 
We run each one of the five programs froin the Per- 
fect Club and simulate its execution cycle by cycle. 
The simulator reads a trace of basic block addresses 
and executes each instructioii in the basic block follow- 
ing the issue rules stated in section 4. A t  every single 
cycle we keep track of how ma.ny functional iinils are 
simultaneously busy. This number gives 11s a n  idea of 
the amount of parallelisin present- in the program that. 
we are actually exploiting. Note t1ia.t in our simulated 
architecture the maximum pa.ra.llelism a.chievable (ig- 
noring the scalar unit) is 3. In this section we will 
look into the parallelisin t1ia.t. the a.rchit.ecture is able 
to exploit, and in the following sections we will con- 
sider what are the factors that. limit. this pa.ra.llelisni. 
Table 2 presents the utilizatioii of the vector units. 
Each row in the table presents a. different. ‘‘sta.te” of the 
vector functional units. A value of 1 in t41ie columns 
two to five indicates that. the corresponding unit. is 
active. Thus, row iiuinber 0, with code 0000, corre- 
sponds to the inachiiie being idle, and row number 
1, with code 0001, corresponds to pure sca1a.r ese- 
cution. The last row, with code 1111, corresponds 
to maximum efficiency: nll three vector units (FU 1 ,  
FU2 and LD/ST) and tlie scalar unit are working si- 
multaneously. Columns sis to nine present, the frac- 
tion of cycles that the nia.cliine 147a.s in  each one of 
the states. Columns six and seven present, tlie frac- 
tion of cycles that the ”ihine was i n  ea.ch one of tohe 
states, using the weighted mean (each program con- 
tributes to the mean proportionately to i t s  running 
time) and the arithmetic mean. Colunins eight. and 
nine are obtained when considering only the vector 
functional units and ignoring the state of the scalar 
unit. Notice how every two rows are exact in its vec- 
tor portion and only differ in  the activity of the scalar 
unit. We have “collapsed” every two rows by adding 
them and we have the results presented in columns 
eight, (tlie result- of collapsing the sixth column) and 
nine (the result. of collapsing column seven). 
Row number 1 i n  table 2 corresponds to pure scalar 
execution. Even for our benchmark programs that are 
highly vectorizable there will always be some portion 
of scalar code mostly related to library code for in- 
put/output, scalar code generated to set up the envi- 
ronment for vector computation and code correspond- 
ing to portions of the program that current vectoriza- 
tion technology can not handle. 
Row 2 represents tlie situation where tlie vector 
memory unit is the only functional unit working, while 
in  row 3 we have the percentage of cycles that the 
loa.d/store vector unit and the scalar unit have been 
running simultaneously. If we look either at column 
8 or 0 we can see that. the fraction of cycles spent in 
this two states is extremely high. Let’s assume that 
scalar code is not useful for the computation, in the 
sense that, scalar code is just overhead code to com- 
pute addresses, perform calls, jumps, control loops, 
etc. Table 2 shows us that in 35.98% of the cycles we 
are irot producing any results. We are either moving 
data or doing “setup” work. This high number of un- 
productive cycles is due to several reasons. First, .all 
a.pplica.tions have initialization loops that just initial- 
ize the data structures to be used during the program 
which only have vector memory operations. Second, 
the architecture only has one memory bus, so when- 
ever t,he instwction issue stage finds two consecutive 
Table 2: tJt.ilixiition of the vecbor functional uni ts .  
vector memory opera.tions in the code. it, will stall 
waiting for the first memory operat,ion tlo complete. At. 
best, the decoder will be able to issue a Te\v sca1a.r in- 
structions found between the two inen1ory opera.t.ions. 
but this only happens i n  a few number of cycles (1%). 
Third, the architectura.1 limitation of not, being a.ble to 
chain vector computation instructions to vector loa.ds 
is also responsible for stalling tlie ma.chine 1.0 wa.it, for 
a memory operation. In tlie nest. section wt? will quan- 
tify each one of these effects. 
Rows 4 and 5 present. ra.t.her unusuad sitiia.t,ions. In 
these rows we have trliat. the 001y vector unit .  \\:orking 
is FU1 and the scalar unit. ca.n be workiiig (row 5) or 
not (row 4). This can happen wlienever a.) there is no 
parallelism in the code, tJ1a.t is, we have a. single vec- 
tor computation isolated between a. long scalar section 
of code or b) whenever t*liere has been a. port conflict, 
between the vector instruction and its sequential fol- 
lower. Both cases are rakher unusud i n  t81ic progra.iiis 
studied, as we can see by the low (0.61%) fra.ct,ion of 
cycles that they represent. 
In rows 6 and 7 we see some degree of overlapping 
between computation and inemory a.ccessing inst.ruc- 
tions. We have that both the restricted vect,or Tunc- 
tional unit and the loa.d/st,ore vect,or unit. are working 
concurrently. the typical sequences of code t1ia.t. p i 1  t, 
the machine in these two states are sequences where 
we have issued am instruction to functional unit. 1 and  
a memory instruction to the LD/ST unit ( t h y  ca.11 be 
related or unrelated) and we encount,er in  t,he instruc- 
tion stream a) a computation instruction t,Iia.t. iiiiglit. 
be dependent on the ineniory instr~ct~ioii i i  \\:hicl1 case 
a. “1oa.d chain*‘ conflict arises and we have to stall the 
niachine. or b) we find a second memory instruction 
t.lia.t. will  ha.vc: to wa.it. unl.il the one runniiig completes 
or c)  there a.re simply no more vector instructions to 
be issued, which happens a t  the end of loops, for ex- 
ainple. 
Rows 8 and 9 are siini1a.r but. not. equivalent to rows 
4 a.nd 5. They represent. situations where the only 
vect.or functional unit. working is the general purpose 
fiincthia.l unit,. But the reasons t h t  leads us to states 
t;/9 or stsat.es 4/5 are rather different. In rows 4/5 we 
were t.alliing  about^ la.ck of parallelism or port conflicts. 
Rows 8/9, a s  wc will see more in  depth i n  the next sec- 
t.ioii, can also be the result. of a port conflict but more 
usually t h y  will  be the result of the presence of par- 
allelism. If we ha.ve to consecut~ive vector instructions 
l.lia.t, require the FU2 unit (for example, any mix of 
consccii tive miilt.i~>lica.t~ions, divisions and square root 
will do) 6he second instruction will have to stall wait- 
ing for the first, one to free the functional unit. This 
sit,ua.t,ion is rather frequent, aad corresponds to corn- 
pute bound loops that have a. lot of vector instructions 
t1~a.t. can only eseciite in FU2. Note the difference of 
cycles percentage between rows 8 and 9 (5.67%) and 
rows 4 a.nd 5 (0.61%). The decision to put a second 
fiinct,iona.I uni t .  t1ia.t. can only perform a subset of all 
opera.t*ions instead of ha.ving a general purpose one has 
a. significa.nl. nega.tive impact on performance. Next 
sectmion will provide inore data to discus this tradeoff 
in c1ept.h. 
lto\vs 10 a n d  1 I represent. the overla.pping of FU2 in- 
st,riict,ioiis wi t.li vector memory opera.tions. Again, this 
two rows have a higher percentage of cycles (13.96) 
than rows 6 and 7, iiiostly because of thc same rea- 
sons explained for rows 8 and 9. See how the four 
rows together represent, almost. a 20% of a.11 esecuted 
cycles. I t  is very import,a.nt to remark tha.ta ow sinillla- 
tor treats all vector operations as being frilly pipelined. 
Had we decided to take into account( the real la.tencies 
of instructions like division or square root. (that- coulcl 
be well beyond 10 cycles witch current. t.echnology ) 
we would see how the number of cycles ant1 percent,- 
age that rows 8 through 11 represent woultl be much 
higher. For example, if we chose a. 10 cyclc latency for 
division, tlie execution of a vector division of il given 
vector length would take a s  much time as 10 vect.or a.& 
ditions of the same vector length. This mea.ns IhaI. the 
consequences of having a. second functiona.1 u i i i  t. that. 
cannot perform certain frequent. opcra.tions would ac- 
tually be worse in a red  niacliine tha.11 w1ia.t we 1ia.w 
found in this paper. For the sake of sin-ip1icit.y. we 
have chosen tlie 1 cycle htency a.pproach 1.0 liigliliglit~ 
the relative importance of the a.rchitectura.1 clecisions 
involved in a ideal vector architecture brit. withoiit, get- 
ting into implenientation issues. 
The last four rows represent t.he stmates wlicre the 
maximum parallelism is achieved hy the archi tec- 
ture. In all of them, both vector computation unit-s 
are working concurrently and producing IUJO resu1t.s 
per cycle. Nevertheless, this peak eficiency is only 
reached in 26.14% of all esecuted cycles. The rest of 
the cycles are divided between a,) approsimat*ely 50% 
of all cycles the vector coinputation unit,s a.re idle, t.hat. 
is, we are executing scalar code or just. moving c1a.t.a. 
around and b) in 26.42% of t,lie cycles only one of the 
functional units is producing sonie result. If, a.ga.iti. 
we w u m e  that scalar code is not useful for the com- 
putation because it’s just, overhead code to compuk 
addresses, perform calls. jumps, control loops. etc., 
and we also assume that3 vect80r loads a.nd st.ore are 
not part of tlie computation, we have a. very low av- 
erage number of results coniputed per cycle. \V “e 1 lave 
that in 26.14% of cycles we a.re prodiicing 2 results. i n  
26.42% of cycles we are produc.ing just. 1 result. and in 
47.43% of the cycles we are producing 0 resii1t.s. ’l-’his 
gives us an average of 0.74 resu1t.s per cycle. Even t a k -  
ing into account that this is a lower bound am1 tha.t, in 
fact, there is also scalar code that. ca.n not. he consitl- 
ered “overhead” since it. is a.ctually produciiig results, 
it is still rather far from the p e d  performa.uce of 2 
results per cycle. 
Looking at table 2 globally, we cmi note some other 
interesting points. First, the fmction of code that. is 
hidden by vector opera.tions is ra.ther high. Adding 
t,he seven rows where the scalar unit is active and at 
least. one of the vector unit.s is active we see that this 
overlapping happens in 6.55% of all cycles. The per- 
centage of cycles where we are executing scalar code 
without. overlapping is 11.45%. As in our simulator 
one cycle is equivalent to one scalar instruction, we 
can see that, 56.3% of all scalar instructions have been 
hidden by vector instructions. For the rest of instruc- 
tions not overlapped, we believe that modern vector 
processors should have a superscalar processor to ex- 
ecute the sca.lar portion of programs, and thus the 
11.45% of cycles could be further reduced. 
Second, t.he t,otal number of cycles where the vec- 
Lor LD/ST functiona.l unit is accessing memory is very 
high. Since we assume a perfect. memory system, we 
caa consider t.1ia.I. i n  every single cycle that this func- 
tional unit was busy, a datum was being transferred 
from/t,o memory. Adding all rows where the LD/ST 
unit. is busy, we get, t,ha.t. i n  a 73.13% of all executed 
cycles we were a.ccessing memory. On t,he other hand, 
we ha.ve alrea.dy seen how i n  79.7% of all cycles a re- 
sult. was produced. The compa.rison between this two 
numbers gives us insight. into the balancing of the pro- 
grams st,udietl. They turn out to be slightly compute 
l~ound in  terms of total number of “abstract” oper- 
a.hions performed. Nevertheless, we have to insist in 
the fact. that. neither memory latencies nor functional 
unit latencies ha.ve been taken into account and these 
a.re t.wo very importa.nt fa.ct.ors that could change this 
bala.ncing. This is a. su tlject. tIia,t. deserves further in- 
ves tigat. ion. 
Finally, we need to M t e r  understa.1~1 what are the 
fa.ct.ors that. limit t.he machine and prevent it from 
a.chieving full  efficiency. We ha.ve a.rgued that some 
of this fa.ct,ors could lie in  the single bus architecture, 
the load chain problem or the nsyniinetry between the 
two fiinctional un i t s .  I n  the next sectioii we will quan- 
tify the re1 a.tive i 11 I por tan ce of all this factors. 
6 Liiiiitatioiis to instruction level par- 
a1 1 e 1 i s iii 
Our a.rchit4ect.ure introduces several resource lim- 
it,a.t.ions t1ia.t ca.n be classified int,o: a) not enough 
fiinct.ional units, 11) the lack of the ability to chain 
a. computa.tion to a load instruction and c) conflicts 
in t,he port,s of the vector register file. There is still 
another intlirec t. I i nii tat ion introduced by the limited 
number of vector registers. It is d) the spill code in- 
st.ruct,ions used to move da.ta. between registers and 
~nemory when 110 free registers are a.vaila.hle to per- 
467 
form a certain computation. In this section we will 
deal with limitations a), b) and c) aiid in the next 
section we will look into the spill code problem. 
The simulator is able to collect. statistics about all 
the hazards that  occur during the execution of the 
programs. When an instruction is not issued we de- 
termine all the reasons that prevented the instructioii 
from executing and store in a table the total nirmber 
of cycles that the decoder had to stall due to those 
reasons. For example, in the following code: 
1. add vO,vl,v2 
2. mu1 v2,vO,v3 
3. add vO,v3,vl 
we see how the third instruction can not be issued 
for two reasons. First, it. has a read port conflict. with 
the two previous instructions. 1nstriict.ions I ant1 2 use 
the two read ports available in regisber ba.iik 0. ‘I“hus. 
no other instruction can simultaneousl,y access a.ny of 
the registers in that bank. In particular, instriiction 3 
has a read port conflict in its first operand (vo). Sec- 
ond, the two computational units in  the vect,or cpu are 
busy and the third instructioii will ha.ve to wait. until 
one of the previous instructions finishes its execution 
and releases its functional unit. When this sibuation 
arises, the simulator stores in a table the total number 
of cycles that the third instruction had to wait prior 
to execution due to this combination of hazards. 
As another example, consider the followiiig two in- 
structions: 
1. Id effa1,vO 
2. st vO,effa2 
This piece of code is just moving data in nlemory. 
The second instruction stalls the decoder because it. 
has two conflicts with the previous one. First., our 
machine only has one load/store unit, so it’s not. pos- 
sible to start the store in  parallel. Second, even if we 
had more busses connecting the CPU and memory we 
have decided that we could not chain instructions to 
the result of a load, so the store iiistructioti would have 
to wait until ,the load completed. We t a n i  this lat.t.er 
situation as a “load chain conflict,”. 
The effects of limitations a), b) and c) caii be seen i n  
table 3. Each row in tlie table corresponds to a differ- 
ent situation that stalled the machine. To iuiderstanrl 
this table, consider all vector computation instructions 
divided into two classes: a FU2-class instruction is an 
instruction that  can only execute in the FU2 fiinctional 
unit. That  is, we have three FU2-class instructioiis: 
mul, d iv  and sqrt .  The rest of vector computation 
instructions are FU1-class instructions because they 
can execute in both functional units. The first, five 
columns in table 3 have the following mea.nings: In 
column labeled FU2, a. 1 indicates t.liat tlie nia.chine 
was stalled because there was no functional unit avail- 
able to execute a FU2-class instruction. In column la- 
beled mi, a 1 indicates that the machine was stalled 
because there was  no functional unit available to ex- 
ecute a FU 1-class instruction, which means that both 
functional units were busy. In column labeled LD, a 
1 indicates that the machine was stalled because the 
vector load/store unit waq busy and could not accept 
more nieniory instructions. In column labeled PRT a 1 
indicates that, a certain instruction was not issued due 
to conflicts in  the read/wr.ite ports of the vector reg- 
ister banks. Finally, a 1 in  column labeled LDX (load 
cbais) indicates that the machine was stalled because 
a computation was dependent on a load instruction 
and. thus, could not he cha.ined to it. The next two 
columns show the percentage that, each one of these 
Iiazard cornbillations represent over the total number 
of eseciit.ed cycles (rising the weighted and arithmetics 
inemis). 
Row 0 presents the percentage of cycles that the 
decoder stalled clue to the inability to chain a com- 
putation to a. 1oa.d. [raving a 1 oiily in column LDX 
means that. we were ready to issue a computation in- 
struction, we had a free functional unit, we had ports 
t.0 access the vec.tor registers iieeded as operands, but 
one of those operands was tlie result of a previous 
load still in  progress. The percenhge of cycles logt 
in this situation is very high (16.42%). If we add to- 
gether all r o \ ~  where LDX is 1, we see that the lack 
of chaining with loads is co-responsible for as much as 
25% of stall cycles. Other vector machines, like the 
Cray-YMP C90 do not, have this chaining restriction 
and cam better utilize t,he functional units. Alterna- 
tively, if tlie machine had more registers, these cycles 
could probably be filled with useful computations by 
unrolling tlie loops and scheduling tlie operations such 
b1ia.t. vector loads and dependent computations could 
be moved apart. as much as possible. Anyway, bear in 
mind that. escept. for row 0, i n  all other rows the LDX 
problem is not. the only ca.use that stalls the machine. 
Thus, if we removed tlie “load chain” restriction we 
would be sure of eliminating those 16.42% of cycles 
a.cc.ounted in row 0, but probably not. many more. 
11.0~ 1 presents the number of lost cycles due exclu- 
sively to port contention. Even though this row does 
not. represent. a very large fraction of all executed cy- 
cles, if we add toget*her all rows where column PRT has 
a. I ,  we see that. there is a high number of wasted cy- 
cles (14.86%) due to port contention. The architecture 
provides $ read ports and 4 write ports to the vector 
registers and tlie maximum possible simultaneous re- 
quests a.re 4 reads and 2 writes  for the functional unitf 
- 
Row 
0 
1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 
10. 
11. 
12. 
13. 
14. - 
- 
FU2 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
1 
1 
1 
1 
-
- 
Iazar 
FUl  
0 
0 
0 
0 
0 
0 
0 
1 
1 
1 
1 
0 
0 
0 
0 
-
-
- 
combi lint ion - 
LD 
0 
0 
0 
1 
1 
1 
1 
0 
0 
0 
0 
0 
0 
0 
0 
-
~ 
PRT 
0 
1 
1 
0 
0 
1 
1 
0 
0 
1 
1 
0 
0 
1 
1 
- 1,DX 
1 
0 
1 
0 
1 
0 
1 
0 
I 
0 
1 
0 
1 
0 
1 
76 of execut.ion cvcles 
wci gh t.ed 
mean 
16.42 
1.62 
2.59 
30.06 
0.65 
3.18 
0.01 
2.94 
1.28 
2.11 
0.36 
8.29 
2.18 
3.48 
1.51 
a.ri tli nietic 
inean 
18.01 
1.57 
3.19 
31.37 
0.Gl 
3.30 
0.01 
3.14 
1.06 
1.82 
0.29 
i.02 
I .70 
2.98 
1.18 
Table 3: Relative importance of' the Iiazarcls occurred during t.he escciit4ion of (.lit sis benchinark programs. 
and one additional read or write for the LD/ST unit, 
but still this seems not t,o be enough to sustain the 
bandwith required by the vector functional units. k\Tc 
have looked in detail at. this port. conflicts a11c1 found 
that the overwhelming majority of conflicts a.re due t.o 
reads. In particular, in  15.02% of a.11 esecuietl cycles 
the machine was stalled due to a. conflict i n  one or t.wo 
of the read ports that the ca.ndida.t,e instfruct~ion for 
being issued required. In contrast. only 1.19%) of all 
executed cycles the reason of a. st.a.11 was a. write port,. 
Note that these two percentages are iiot, indepentlenc., 
because there are instructions t1ia.t cause I>ot,Ii read 
and write port conflicts siinultaneously. 
Ftow 3 is the one that accounts for the largest. per- 
centage of stall cycles (3O.G%).  It. represenis the Iiaz- 
ards occurred when the decoder tries to iswe 1,wo con- 
secutive vector memory inst~ructions. This is a very 
common situation in vector loops €or several reasons. 
If the compiler does not perforni software pipelining 
and/or loop unrolling, Lhe t-ypical startsing code of a 
loop is a sequence of loads that. will bring the da.t>a. 
to be operated on. This sequence of loa.tls will a.lways 
conflict in our architecture. Also, at. tlre eiicl of' the 
loop there is usually a sequence of severa.1 stmores t,lia.t 
will conflict between them and will a.lso conflict. wit.11 
the load instructions a.t the beginning of the nest iter- 
ation of the loop. Note that in this row t.he only con- 
flict is the memory unit, thus increasing the nuinber 
of busses to memory should decrease the percent,age 
of cycles stalled due to this reason. If we also add all 
rows where LD is 3 ,  we have that the 1ac.k of more nieiii- 
ory units is co-responsible €or stalling t.hc nia.chine i n  
33.9% of all executed cycles. 
Row 4 presents a. very special case. Whenever we 
have a. 1oa.d follo\ved by a dependents store, we have two 
conflicts. First, we ha.ve a 1 i n  c.olumn LD because we 
need the loa.cl/st.ore unita to issue the skore butB it's busy 
servicing the load. Second, even if we had a second 
nicn~ory unit, the load chain reqtriction would prevent 
u s  from issuing the store. Thus, we also have a 1 in 
colunin LDX. Note that this situation is fairly rare, 
a.nd corresponds mostly to loops that do initialization 
work, l ike  copying arrays or initializing memory. Row 
6 correspond 1.0 a. still more special case i n  which the 
store following tlie load could also not be issued due 
to the fact t1ia.t the read ports of the corresponding 
regist,er h n k  were both busy (most. probably, some 
other previous instmction is using them). 
Ilow 5 is a niore common situation. Consider the 
follo\l;ing code: 
I .  add v2,v3,vO 
2. St. v0,en-a. 
3. Id effa2,vO 
This code c a n  casily arise i n  loops. The result of 
the addit.ion is stored in some array but. is no longer 
iieeded i n  the rest of the loop body. After instruc- 
tion '2, tlie compiler knows that- register VO is dead, 
and reuses it to bring some other value from memory. 
\l'hcn t,rying 1.0 issue inst,ruction number 3, we have 
a. LD/ST functional unit. conflict, and also a write port 
conflict, t,o register VO, because the only write port to 
t,he register bank where vo belongs is busy servicing 
instruction nu niber 1. Ol>viously, there are other sit- 
i i a h n s  where this combinat.ion of conflicts can arise. 
Let.'s consider rows 7 through 10. When we have a 
1 in coluinn FU1, its meaning is that both functional 
units were busy and so we could not issue a FU1- 
type instruction. This four rows are showing u s  that 
the programs had some more parallelisni available t h . t  
could be exploited by adding more functional units. If 
we ignore for the moment that rows 8 , 9  and 10 present 
port/load chain conflicts, the lost- cycles that. could 
have been successfully used with a third functiona.l 
unit is a1 least a G.G9% of all executed cycles. Actiia.lly, 
if a third functional unit was added we could probably 
reduce more cycles of execution, but. this will remain 
a topic for future research for the moment. 
As we have also noted in the previous sections, the 
pressure 011 the FU2 unit is very high due 1.0 the fa.ct 
that FU1 is not able to execute mult,iplica.t.ions, divi- 
sions and square roots. This  can easily be seen in rows 
11 to 14, where the need for a.nother functional u i i i t ,  of 
type FU2 stalls the machine in 15.46% of a.ll executed 
cycles. It is also true t1ia.t rows 12 and  14 have also the 
load chain conflict! but they represent. only a. 3.69% of 
the executed cycles. 
BDNA 
ARCSD 
FL052 
TRFD 
SPEC77 
7 Spill code 
When the compiler is allocating regist,ers in  a basic 
block file it may find itself withoutb any free register i n  
which to store a result. In this sit.ua.t.ion, thc compiler 
has to insert special instructions, c.alled spill code that .  
will save a certain register to memory (spill it). t,o be 
able to reuse that register to store some ot,her va.lue. 
The contents of the register spilled to memory will be 
reloaded at some 1a.ter point in time to some other 
free register. This spill loa.d/store iiistructions are not. 
part of the coinputakioii but are a a  overhead intro- 
duced by the compiler due to the limited size of the 
register file. However, spill c.ode is a. nega.t.ivr cont>ri- 
bution to  performance only if it. actually increases the 
minimum number of chimes requircd to execute ine 
iteration of the loop where it. appea.rs. If a. loop wit.h 
spill is highly compute hound, it means tha.t. there will 
most probably be maay cycles where the LD/ST unit 
will be free to be used by the spill instruc.t.ions, and 
they will effectively be hidden by the coinput.a.tion in- 
structions. 011 the other Iia,nd, if a loop is meinory 
bound, every single spill in~truct~ioii will Icngthen itas 
initiation interval at least. by vl cycles, where vl will 
depend on the length of the vector registers. 
If we had an architecture .with infinite regist>ers 
there would be no spill code at. all. In this section 
we will study the effect of spill code on perforniance. 
There has been some previous work on the efT~cts of 
spill code in vector processors. In [O] it. is suggested 
that the effect of spill code is not. very hacl i n  vect.or 
original nospill 
cycles' speedup 
1053.7 1.49 
223G.8 1.07 
72G.3 1.07 
875.4 1 
2472.1 1.01 
processors because spill can be hidden in t h e  cycles 
where the load/store unit is not busy. Nevertheless, 
in  [9] the architecture under study was a CRAY-Y- 
MP which has the same number of functional units as 
our a.rchitecture but has 3 paths to ineinory (two load 
busses mid one store bus) and thus has many more 
opportunities to hide those spill instructions. We be- 
lieve that. in our single bus  architecture the effect of 
spill code will he significa.nt1y higher. 
To study the effects of spill on performance we have 
niodified the simulator so that each time it finds a vec- 
tor spill instruction it ignores it. This way we are cre- 
ating the effect. of having an infinite register file, and 
we a.re eliminating the negative effect of spill code. We 
run the programs a.gain with this version of the simu- 
lator and we obtain the tot.al number of cycles needed 
to execute them. We can see the effect of spill code by 
comparing this results with the results obtained with 
t.he origina.1 si mulator. 
Table 4 presents the results. First column in .ta- 
ble 4 is the total nuniber of cycles (in millions) needed 
to esecute the original program and the second col- 
unin is the speedup obtained when eliininating all the 
spill from the programs. We can see the great variance 
in speedup betsweeii the different prograins. BDIA hap 
pens to have extremely long basic blocks, coming from 
very long loops, and' the compiler h a s  to introduce a 
lot. of spill code to esecute them with only eight reg- 
isters. We 1ia.w measured the niean size of BDNA 
vect.or basic blocks a.nd it's close to 120 instructions 
(not. including the scalar instructions). From this table 
we can conclude that in a single bus architecture spill 
code has a. nega.tive effect, 011 performance, especially 
for t,liose programs that have a high register pressure. 
8 Coiiclusioiis and Future Work 
In this paper we have presented quantitative mea- 
surements of the execution of vector code produced by 
a commercial coinpiler on a vector supercomputer. We 
have chosen a subset. of the Perfect Club benchmarks 
and executed it using a simulator. Results show t1ia.t 
the fraction of time executing vector co~nput~atioas is 
not as high as one would expect (around 50%). We 
have presented data  about the fraction of utilization 
of the vector functional units and found that only in 
roughly 5% of the cycles are all vector coiiipritatioii 
units busy, while there is a 25% percent of cycles where 
only one of the two arithmetic units is working a.nd the 
last 50% of the cycles are either load/store cycles or 
purely scalar cycles. The major limitation that pre- 
vents full utilization of the machine is the single bus to 
memory architecture, which even in the case of adding 
infinite number of functional units and ports, would 
be responsible for stalling the machine in 33.9% of all. 
executed cycles. Relaked with the iiieniory problems 
is the inability to cha.in loads and computa.tion. This 
restriction is respoiisible for stalling the " d i e  in a. 
combined 25% of all executed cycles. Additiona.1 limi- 
tations found were the lack of a second functional uni t ,  
able to perform multiplication and division (15.46% of 
executed cycles) and liiiiit>ed nurnber of ports 1.0 ac- 
cess the vector register file (14.86%)). We have a.lso 
seen that vector spill code is not negligible and t1ia.t. it 
produces an average slowdowii of 8%). 
The tools we have developed for this study are cur- 
rently being used to evaluate solutions to tlie problems 
reported in this paper. We are currently looking a.t. dif- 
ferent alternatives to solve the spill problem, as well 
as reducing the number of conflicts in the vector regis- 
ter file ports. We are also investigatiiig new fiinctiona.1 
unit schemes to reduce t>he number of lost cycles. 
References 
Convex Press, Richa.rdson, T e s ~ ,  U S A .  COi\r- 
V E X  Architecture R.efereiice hfait.iral (C Series), 
sixth edition, April 1992. 
Zarka Cvetanovic and Dileep B1ianda.rka.r. C1ia.r- 
acterization of APLIJA AXP performance tising 
TP and SPEC workloads. In,terii.ntio,rnl Syi11.- 
posium on Conzpuier Architecttire, pa.ges 6U--iO, 
1994. 
Roger Espasa and Xa.vier Ma.rtorel1. Dixie: 
a trace generation system for the C3480. 
Technical Report CEPBA-RR-94-08, Universi t a t .  
Polit2cnica. de Cakalunya., 1994. 
M. Berry et al. The,  Perfect. Club beuc1ima.rks: 
Effective performa.iice evaluat.ion of siipercompu t- 
ers. The Intematioirnl Joiininl o j  Superconipaier 
Applications, pages 5-40, Fall 1989. 
[5] John L. Hennessy and David A. Patterson. 
Computer Architecture A Quaniiiative Approach. 
M0rga.n Kaufinann Publishers, 1990. 
[GI Kai Hwaiig and Zhiwei Xu. Multipipeline net- 
working for compound vector processing. IEEE 
Tmrrsactioias o s  Computers, 37( l), January 1988. 
[7] Norman P. Jouppi. The nonuniform distribution 
of instruction-level and machine parallelism and 
its effect. 011 performance. I E E E  Transaciions on 
Computers,, 38( 12):1645-1658,1989. 
IS] Norman P. Jouppi and David W. Wall. Available 
instruction level parallelism for superscalar and 
superpipelined machines. ASPL OS, pages 272- 
282. 1989. 
[9] Corinna. G .  Lee. Code Optimiters and Register 
0rgaitixtzort.s f o r  Vector Architectures. PhD the- 
sis, University of California a t  Berkeley, 1992. 
[lo] Larry McMahan and Ruby Lee. Pathlengths 
of SPEC benchmarks for PA-RISC, MIPS, and 
S PARC . CO.I!fPCOhr, 1993. 
[ l l ]  R,. M .  Russell. The CRAY-1 computer system. 
Coii~itrerricntioiis of the ACM,  21( 1):63-72, Jan- 
uary 1978. 
[12] Micha.el D. Smith, Mike Johnson, and Mark A. 
Horowtiz. Limits on multiple instruction issue. 
ASPLOS, pages 290-302,1989. 
113; Sriram Vajapeyam. Instruction-Level Character- 
izniioir of the Cray Y-MP processor. PhD thesis, 
University of Wisconsin, Madison, 1991. 
[14 Srira.111 \!a,ja.peyam and Wei-Chung Hsu. On the 
instructio~i-level characteristics of scalar code in 
highly-~ect~orixed scientific applications. IEEE 
Micro 2.5: pages 20-28, 1992. 
[Is] Sriram Vajapeyam, Gurindar S. Sohi, and Wei- 
Chung Hsu. An empirical study of the Cray Y- 
MP processor iising the PERFECT Club bench- 
marks. Iirtevnational Symposium on Computer 
Architect,irre, pa.ges 170-179, 1991. 
[16] Da.vid M'. Wall. Limits of iiistruction level paral- 
lelism. ASPLOS, pages 17G-188, 1991. 
[17] H .  Zima and B. Cllapinan. Supercompilers for 
parallel a1t.d irector computers. ACM Press, New 
York, NY, 1991. 
461 
