In this paper we present the results of a detailed simulation study of 
Introduction
In order to design new architectures one has first to properly understand the behavior of current architectures to be able to analyze its strengths and weaknesses and improve future designs. In t,his paper we are interested in the evaluation of a vector architecture [5] , together with its vectorizing compiler [17] . Since the introduction of the first register-register vector coniputer, the CRAY-1 [ll], compila.tion technology has evolved to ma.ximize the performa.nce t1ia.t programs written in high level languages can obtain from a. vector architecture. Nevertheless there have been few studies in depth about the relationship between vector architectures, vectorizing compilers and vector programs. In [15, 14, 131 the CRAY-Y-MP a.rchitecture is evaluated through a detailed s h d y of the program characteristics such as nuniber and type of instructions executed, basic block size, fraction of code vectorized, etc.
[9] presents a study on the register requirements of vector architectures and ana.lyzes what combination of number of registers and number of rea.d/write ports to the register file has the best. cost performance tradeoff.
In this paper we present, results obtained from the evalua.tion of a subset. of the Perfect Club [4] programs compiled using the Convex FC versioii V8.0 running on a single node of a Convex C3480 vector machine. To perform this research we have implemented a trace generation tool called dixie able to generate basic block traces from the execution of the programs. This basic block traces are fed into a siinulator tl4at we hwe developed t1ia.t. gives us detailed information, at the cycle level, of every event that. happens during the execution of the program. This is in contrast with the tools used in [14, 15, 131, which only provided statistical dat,a. to their authors. It is important to note that we a.re going to evaluate automatically vectorized progra.nis, and thus we will be studying the performance of the architecture together with its compiler.
In section 2 we present the measurement techniques 108&6192@6 $4.00 Q 1996 IEEE used in this paper. We describe the trace driven approach we have used to simulate tlie execution of some of the Perfect Club programs. In section 3 we present) the benchmarks used in this paper. In secbion 4 we present the abstract vector machine that. we will use to carry the experiments. In section 6 we will study the parallelism exploited by our abst-ract-niachine us- ing the output of the Convex compiler. SectZion G will look in detail into tlie reason that preveiii. tlie machine from extracting all the parallelism available and section 7 will study the effects of spill code 011 vector execution. Finally in section 8 we will present. our conclusions.
Overview of the measurement technique
Tlie machine 011 which the experiments were performed is a single processor of a. Convex C3480. This machine is ranked in the mini-supercomputer class, and has 8 processors and each one of them has a. sca1a.r execution unit and a vector execution unit,. We a.re interested in the vector beha.vior, and thus all measures have been taken using a single processor running in single-user mode. Tlie (33400 processor ca.11 be described as a register-register vect.or machine. Tlie compiler used in all cases is the Convex FC iwrsio~~. V8.0 with optiiiiizatioii level -0 2 (\vhicli implies vectorization) [l] . The vector CPU consist.^ of t8wo functional units. The first one ha.ndles all vector operations except multiplication, division and square root.. The second one handles all vector opera.tions. Ea.ch functional unit has access to 8 vector regist.ers. The vector registers me set up in register pa.irs. so t1ia.t. each pair has two r e d ports and one write port. Every vector register is ca.pa.ble of holding 121 elements of 64 bits each. The vector CPU implements fiilly flcsible chaining which means that. a.n opera.tion ca.n be chained to a previous vector operation currently in progress regardless of the cycle a.t. which ea.ch operation has started. Due to the va.ria.nce in response time that tlie memory system dways shows, vector coniputations can never be cha.ined to itecior lond inst,ructions. Vector stores can be cha.ined, though, to vector operations because a set of buffers isolates tlie vector CPU from the memory la.t,ency when sending c1a.t.a. 1.0 memory.
We have taken a tra.ce-driven a.pproa.ch t,o ga.t.her all the data presented in this paper. We Imve cleveloped a pixie-like tool called dixie 131 that is able t.o produce a trace of basic blocks executed as well as a. tmce of 453 the values conta.ined i n the vector length (VI) register. Tlie ability to trace the value of the vector length register is critical to have a detailed simulation of the execution of the programs, since each vector instruction caa execute with a potentially different vector leiigth. Thus, our measurements do not suffer from the problem reported in [IS] .
Disie is a tool that given an executable file will produce 1) a modified executable file with instrument,a.t.ion code that-will generate a trace and 2) a basic block descript.ion file that. maps basic block identifiers t.0 t.lie a.ctua.1 instr~ict~ions of each basic block. When you run the iiistruiiiented executable it will generate a tra.c.e of basic IAock identifiers and a trace of every value that is assigned to the vector length register. This two parallel traces a.re consumed by a cycle-level simulator t,liat. uses the basic block description file to siinulnte the execution of every single instruction and measure the dyna.mic behavior of the program. Each time t,he simula.tor finds an instruction that loads the vl register it. will consume a value from the vector length tra.ce. Dixie is able to tra.ce user and library code mid, t,lius, the simulation runs include the user vec.tor code plus all the vector code found in the fort.ra.11 nia,th libraries. It is important to note that we siniu1at.e the output. of a commercial compiler without int4roducing any modification in it and that this tracing method gives us absolute precision in all of our nieasorements.
The simula.tor we have developed 11% been based 011 an abstract version of IJie Convex a.rchitecture and will be described in section 4.
Benchinark Programs
We Imve selected a subset, of the Perfect Club programs as our I>encliniark programs. The Perfect Club appIica.tion codes are considered to be representative of large typical scientific and engineering programs. For our study, we have executed the thirteen codes on the C3400 in scalar and vector mode and we have obtained the speedups presented in table 1. Column two presents the cpu t.ime (in seconds) of each on of the programs when run in scalar mode. Column three presents t.he cpu time of the programs (also in seconds) when run in vector mode. Column four is the speedup of the vector versus scalar version. Column five preseiit,~ an estima.tion of the fraction of time spent executing vector code. This estinia.tion has been obtained inst~rument~ing all those basic blocks that had some vect.or code with code that. reads tlie hardware tio-~ers (TTR) of the (hives machine. Finally, column six presents the percentage of time that this vect,or basic blocks represent over the esecution time of ea.ch program. This column can be taken as a rough indicator of the degree of vectorization that. each prograin allows.
As it can be seen from this table, most of the programs do not benefit too much from vector esecution, and we believe that for the purposes of our study we should only examine the subset of programs that. really exploit the vector CPU 
The abstract vector machine
The simulator used to gather the performance clat,a. for the benchmark programs models an idealized version of the C3400 machine. MJe feel that. i n order 1.0 better understand vector macliines it. is important, t,o abstract low level details (like functionaI unit. lat,encies, technology imposed 1ia.rdwa.re. restrictions. meinory delays and so on) €rom our study and concvnt,ra.te in the general behavior of the progranis. While t.he details omitted in the simulator are very important. and deserve several studies in its own right, the conclusions obtained from the data. gathered with our simulator will still be valid. Since we \vi11 be looking a.t. the relative frequency of severa.l different. event,~, the inclusion of the aforementioned low level cletsails woiild not introduce significanto differences in 0111' results.
The architecture studied consists of a scalar pa,rt,, that we shall refer to as the SCAL func.tiona.1 unit,, and an independent vector part. The scala-r portion is able to execute one instruc.tion per cycle regardless of dependencies, functioiial unit hamrds or bra.iiching delays. The LD/ST unit. can only service one request t.o/froni inenlory at time, because the architecture simulabed has only one bus connecting the CPU to iiieniory. TIIP meinory system sirnulated is an ideal one t,ha.t. has a. 1 cycle latency 'and delivers one da-~.um per cycle, regardless of the stride used. The real C3400 architecture has one additional limitation regarding the memory system that we have chosen to simulate. Despite the fact the memory delivers one datum per cycle, we will not allow to chain the result of a vector load instruction with a vector computation i~~structioii. This limitation is a common problem t,Iia.t. mu1 tiprocessor vector architectures have to face beca.use t.he variance in response time of real memory systems makes it. difficult to predict the arrival time of a. dat.iini to the processor. Thus, while it is not. impossible t.0 c.lia.in vector computations to vector loads, a. reasoilable tnracleoff is to restrict this chaining. M'e believe i k is important. to simulate this feature of the Convex C3400 architecture in order to evaluate its impact. on performance. We run each one of the five programs froin the Perfect Club and simulate its execution cycle by cycle. The simulator reads a trace of basic block addresses and executes each instructioii in the basic block following the issue rules stated in section 4. At every single cycle we keep track of how ma.ny functional iinils are simultaneously busy. This number gives 11s a n idea of the amount of parallelisin present-in the program that. we are actually exploiting. Note t1ia.t in our simulated architecture the maximum pa.ra.llelism a.chievable (ignoring the scalar unit) is 3. In this section we will look into the parallelisin t1ia.t. the a.rchit.ecture is able to exploit, and in the following sections we will consider what are the factors that. limit. this pa.ra.llelisni. FU2 and LD/ST) and tlie scalar unit are working simultaneously. Columns sis to nine present, the fraction of cycles that the nia.cliine 147a.s in each one of the states. Columns six and seven present, tlie fraction of cycles that the "ihine was in ea.ch one of tohe states, using the weighted mean (each program contributes to the mean proportionately to i t s running time) and the arithmetic mean. Colunins eight. and nine are obtained when considering only the vector functional units and ignoring the state of the scalar unit. Notice how every two rows are exact in its vector portion and only differ in the activity of the scalar unit. We have "collapsed" every two rows by adding them and we have the results presented in columns eight, (tlie result-of collapsing the sixth column) and nine (the result. of collapsing column seven).
Row number 1 in table 2 corresponds to pure scalar execution. Even for our benchmark programs that are highly vectorizable there will always be some portion of scalar code mostly related to library code for input/output, scalar code generated to set up the environment for vector computation and code corresponding to portions of the program that current vectorization technology can not handle.
Row 2 represents tlie situation where tlie vector memory unit is the only functional unit working, while in row 3 we have the percentage of cycles that the loa.d/store vector unit and the scalar unit have been running simultaneously. If we look either at column 8 or 0 we can see that. the fraction of cycles spent in this two states is extremely high. Let's assume that scalar code is not useful for the computation, in the sense that, scalar code is just overhead code to compute addresses, perform calls, jumps, control loops, etc. Table 2 shows us that in 35.98% of the cycles we are irot producing any results. We are either moving data or doing "setup" work. This high number of unproductive cycles is due to several reasons. First, .all a.pplica.tions have initialization loops that just initialize the data structures to be used during the program which only have vector memory operations. Second, the architecture only has one memory bus, so whenever t,he instwction issue stage finds two consecutive vector memory opera.tions in the code. it, will stall waiting for the first memory operat,ion tlo complete. At. best, the decoder will be able to issue a Te\v sca1a.r instructions found between the two inen1ory opera.t.ions. but this only happens in a few number of cycles (1%). Third, the architectura.1 limitation of not, being a.ble to chain vector computation instructions to vector loa.ds is also responsible for stalling tlie ma.chine 1.0 wa.it, for a memory operation. In tlie nest. section wt? will quantify each one of these effects.
Rows 4 and 5 present. ra.t.her unusuad sitiia.t,ions. In these rows we have trliat. the 001y vector unit. \\:orking is FU1 and the scalar unit. ca.n be workiiig (row 5) or not (row 4). This can happen wlienever a.) there is no parallelism in the code, tJ1a.t is, we have a. single vector computation isolated between a. long scalar section of code or b) whenever t*liere has been a. port conflict, between the vector instruction and its sequential follower. Both cases are rakher unusud in t81ic progra.iiis studied, as we can see by the low (0.61%) fra.ct,ion of cycles that they represent.
In rows 6 and 7 we see some degree of overlapping between computation and inemory a.ccessing inst.ructions. We have that both the restricted vect,or Tunctional unit and the loa.d/st,ore vect,or unit. are working concurrently. the typical sequences of code t1ia.t. pi1 t, the machine in these two states are sequences where we have issued am instruction to functional unit. 1 and a memory instruction to the LD/ST unit ( t h y ca.11 be related or unrelated) and we encount,er in t,he instruction stream a) a computation instruction t,Iia. we w u m e that scalar code is not useful for the computation because it's just, overhead code to compuk addresses, perform calls. jumps, control loops. etc., and we also assume that3 vect80r loads a.nd st.ore are not part of tlie computation, we have a. very low average number of results coniputed per cycle. \V "e 1 lave that in 26.14% of cycles we a.re prodiicing 2 results. i n 26.42% of cycles we are produc.ing just. 1 result. and in 47.43% of the cycles we are producing 0 resii1t.s. 'l-'his gives us an average of 0.74 resu1t.s per cycle. Even t a king into account that this is a lower bound am1 tha.t, in fact, there is also scalar code that. ca.n not. he consitlered "overhead" since it. is a.ctually produciiig results, it is still rather far from the p e d performa.uce of 2 results per cycle.
Looking at table 2 globally, we cmi note some other interesting points. First, the fmction of code that. is hidden by vector opera.tions is ra.ther high. Adding t,he seven rows where the scalar unit is active and at least. one of the vector unit.s is active we see that this overlapping happens in 6.55% of all cycles. The percentage of cycles where we are executing scalar code without. overlapping is 11.45%. As in our simulator one cycle is equivalent to one scalar instruction, we can see that, 56.3% of all scalar instructions have been hidden by vector instructions. For the rest of instructions not overlapped, we believe that modern vector processors should have a superscalar processor to execute the sca.lar portion of programs, and thus the 11.45% of cycles could be further reduced.
Second, t.he t,otal number of cycles where the vecLor LD/ST functiona.l unit is accessing memory is very high. Since we assume a perfect. memory system, we caa consider t.1ia.I. in every single cycle that this functional unit was busy, a datum was being transferred from/t,o memory. Adding all rows where the LD/ST unit. is busy, we get, t,ha.t. i n a 73.13% of all executed cycles we were a.ccessing memory. On t,he other hand, we ha.ve alrea.dy seen how i n 79.7% of all cycles a result. was produced. The compa.rison between this two numbers gives us insight. into the balancing of the programs st,udietl. They turn out to be slightly compute l~ound in terms of total number of "abstract" opera.hions performed. Nevertheless, we have to insist in the fact. that. neither memory latencies nor functional unit latencies ha.ve been taken into account and these a.re t.wo very importa.nt fa.ct.ors that could change this bala.ncing. This is a. su tlject. tIia,t. deserves further inves tigat. ion.
Finally, we need to M t e r understa.1~1 what are the fa.ct.ors that. limit t.he machine and prevent it from a.chieving full efficiency. We ha.ve a.rgued that some of this fa.ct,ors could lie in the single bus architecture, the load chain problem or the nsyniinetry between the two fiinctional units. I n the next sectioii we will quantify the re1 a.tive i 11 I por tan ce of all this factors.
Liiiiitatioiis to instruction level par-
a1 1 e 1 i s iii Our a.rchit4ect.ure introduces several resource limit,a.t.ions t1ia.t ca.n be classified int,o: a) not enough fiinct.ional units, 11) the lack of the ability to chain a. computa.tion to a load instruction and c) conflicts in t,he port,s of the vector register file. There is still another intlirec t. I i nii tat ion introduced by the limited number of vector registers. It is d) the spill code inst.ruct,ions used to move da.ta. between registers and ~nemory when 110 free registers are a.vaila.hle to per-form a certain computation. In this section we will deal with limitations a), b) and c) aiid in the next section we will look into the spill code problem.
The simulator is able to collect. statistics about all the hazards that occur during the execution of the programs. When an instruction is not issued we determine all the reasons that prevented the instructioii from executing and store in a table the total nirmber of cycles that the decoder had to stall due to those reasons. For example, in the following code:
1. add vO,vl,v2 2. mu1 v2,vO,v3 3. add vO,v3,vl we see how the third instruction can not be issued for two reasons. First, it. has a read port conflict. with the two previous instructions. 1nstriict.ions I ant1 2 use the two read ports available in regisber ba.iik 0. 'I"hus. no other instruction can simultaneousl,y access a.ny of the registers in that bank. In particular, instriiction 3 has a read port conflict in its first operand (vo). Second, the two computational units in the vect,or cpu are busy and the third instructioii will ha.ve to wait. until one of the previous instructions finishes its execution and releases its functional unit. When this sibuation arises, the simulator stores in a table the total number of cycles that the third instruction had to wait prior to execution due to this combination of hazards.
As another example, consider the followiiig two instructions:
This piece of code is just moving data in nlemory. The second instruction stalls the decoder because it. has two conflicts with the previous one. First., our machine only has one load/store unit, so it's not. possible to start the store in parallel. Second, even if we had more busses connecting the CPU and memory we have decided that we could not chain instructions to the result of a load, so the store iiistructioti would have to wait until ,the load completed. We t a n i this lat.t.er situation as a "load chain conflict,".
The effects of limitations a), b) and c) caii be seen in table 3. Each row in tlie table corresponds to a different situation that stalled the machine. To iuiderstanrl this table, consider all vector computation instructions divided into two classes: a FU2-class instruction is an instruction that can only execute in the FU2 fiinctional unit. That is, we have three FU2-class instructioiis: mul, d i v and sqrt. The rest of vector computation instructions are FU1-class instructions because they can execute in both functional units. The first, five columns in table 3 have the following mea.nings: In column labeled FU2, a. 1 indicates t.liat tlie nia.chine was stalled because there was no functional unit available to execute a FU2-class instruction. In column labeled mi, a 1 indicates that the machine was stalled because there was no functional unit available to execute a FU 1-class instruction, which means that both functional units were busy. In column labeled LD, a 1 indicates that the machine was stalled because the vector load/store unit waq busy and could not accept more nieniory instructions. In column labeled PRT a 1 indicates that, a certain instruction was not issued due to conflicts in the read/wr.ite ports of the vector register banks. Finally, a 1 in column labeled LDX (load cbais) indicates that the machine was stalled because a computation was dependent on a load instruction and. thus, could not he cha.ined to it. The next two columns show the percentage that, each one of these Iiazard cornbillations represent over the total number of eseciit.ed cycles (rising the weighted and arithmetics inemis).
Row 0 presents the percentage of cycles that the decoder stalled clue to the inability to chain a computation to a. 1oa.d. [raving a 1 oiily in column LDX means that. we were ready to issue a computation instruction, we had a free functional unit, we had ports t.0 access the vec.tor registers iieeded as operands, but one of those operands was tlie result of a previous load still in progress. The percenhge of cycles logt in this situation is very high (16.42%). If we add together all r o \~ where LDX is 1, we see that the lack of chaining with loads is co-responsible for as much as 25% of stall cycles. Other vector machines, like the Cray-YMP C90 do not, have this chaining restriction and cam better utilize t,he functional units. Alternatively, if tlie machine had more registers, these cycles could probably be filled with useful computations by unrolling tlie loops and scheduling tlie operations such b1ia.t. vector loads and dependent computations could be moved apart. as much as possible. Anyway, bear in mind that. escept. for row 0, in all other rows the LDX problem is not. the only ca.use that stalls the machine. Thus, if we removed tlie "load chain" restriction we would be sure of eliminating those 16.42% of cycles a.cc.ounted in row 0, but probably not. many more.
11.0~ 1 presents the number of lost cycles due exclusively to port contention. Even though this row does not. represent. a very large fraction of all executed cycles, if we add toget*her all rows where column PRT has a. I , we see that. there is a high number of wasted cycles (14.86%) due to port contention. The architecture provides $ read ports and 4 write ports to the vector registers and tlie maximum possible simultaneous requests a.re 4 reads and 2 writes for the functional unitf 
3.

4.
5.
6. 7.
9.
10. 11. 12. 13. 14. If the compiler does not perforni software pipelining and/or loop unrolling, Lhe t-ypical startsing code of a loop is a sequence of loads that. will bring the da.t>a. to be operated on. This sequence of loa.tls will a.lways conflict in our architecture. Also, at. tlre eiicl of' the loop there is usually a sequence of severa.1 stmores t,lia.t will conflict between them and will a.lso conflict. wit.11 the load instructions a.t the beginning of the nest iteration of the loop. Note that in this row t.he only conflict is the memory unit, thus increasing the nuinber of busses to memory should decrease the percent,age of cycles stalled due to this reason. If we also add all rows where LD is 3 , we have that the 1ac.k of more nieiiiory units is co-responsible €or stalling t.hc nia.chine i n 33.9% of all executed cycles.
Row 4 presents a. very special case. Whenever we have a. 1oa.d follo\ved by a dependents store, we have two conflicts. First, we ha.ve a 1 i n c.olumn LD because we need the loa.cl/st.ore unita to issue the skore butB it's busy servicing the load. Second, even if we had a second nicn~ory unit, the load chain reqtriction would prevent u s from issuing the store. Thus, we also have a 1 in colunin LDX. Note that this situation is fairly rare, a.nd corresponds mostly to loops that do initialization work, like copying arrays or initializing memory. Row 6 correspond 1.0 a. still more special case in which the store following tlie load could also not be issued due to the fact t1ia.t the read ports of the corresponding regist,er h n k were both busy (most. probably, some other previous instmction is using them).
Ilow 5 is a niore common situation. Consider the follo\l;ing code:
I . add v2,v3,vO 2. St. v0,en-a.
Id effa2,vO
This code c a n casily arise in loops. The result of the addit.ion is stored in some array but. is no longer iieeded in the rest of the loop body. After instruction '2, tlie compiler knows that-register VO is dead, and reuses it to bring some other value from memory. \l'hcn t,rying 1.0 issue inst,ruction number 3, we have a. LD/ST functional unit. conflict, and also a write port conflict, t,o register VO, because the only write port to t,he register bank where vo belongs is busy servicing instruction nu niber 1. Ol>viously, there are other siti i a h n s where this combinat.ion of conflicts can arise.
Let.'s consider rows 7 through 10. When we have a 1 in coluinn FU1, its meaning is that both functional units were busy and so we could not issue a FU1-type instruction. This four rows are showing u s that the programs had some more parallelisni available t h . t could be exploited by adding more functional units. If we ignore for the moment that rows 8 , 9 and 10 present port/load chain conflicts, the lost-cycles that. could have been successfully used with a third functiona.l unit is a1 least a G.G9% of all executed cycles. Actiia.lly, if a third functional unit was added we could probably reduce more cycles of execution, but. this will remain a topic for future research for the moment.
As we have also noted in the previous sections, the pressure 011 the FU2 unit is very high due 1.0 the fa.ct that FU1 is not able to execute mult,iplica. 
Spill code
When the compiler is allocating regist,ers in a basic block file it may find itself withoutb any free register in which to store a result. In this sit.ua.t.ion, thc compiler has to insert special instructions, c.alled spill code that. will save a certain register to memory (spill it). t,o be able to reuse that register to store some ot,her va.lue. The contents of the register spilled to memory will be reloaded at some 1a.ter point in time to some other free register. This spill loa.d/store iiistructions are not. part of the coinputakioii but are a a overhead introduced by the compiler due to the limited size of the register file. However, spill c.ode is a. nega.t.ivr cont>ri-bution to performance only if it. actually increases the minimum number of chimes requircd to execute ine iteration of the loop where it. appea.rs. If a. loop wit.h spill is highly compute hound, it means tha.t. there will most probably be maay cycles where the LD/ST unit will be free to be used by the spill instruc.t.ions, and they will effectively be hidden by the coinput.a.tion instructions. 011 the other Iia,nd, if a loop is meinory bound, every single spill in~truct~ioii will Icngthen itas initiation interval at least. by vl cycles, where vl will depend on the length of the vector registers.
If we had an architecture .with infinite regist>ers there would be no spill code at. all. In this section we will study the effect of spill code on perforniance. There has been some previous work MP which has the same number of functional units as our a.rchitecture but has 3 paths to ineinory (two load busses mid one store bus) and thus has many more opportunities to hide those spill instructions. We believe that. in our single bus architecture the effect of spill code will he significa.nt1y higher.
To study the effects of spill on performance we have niodified the simulator so that each time it finds a vector spill instruction it ignores it. This way we are creating the effect. of having an infinite register file, and we a.re eliminating the negative effect of spill code. We run the programs a.gain with this version of the simulator and we obtain the tot.al number of cycles needed to execute them. We can see the effect of spill code by comparing this results with the results obtained with t.he origina.1 si mulator. Table 4 presents the results. First column in .table 4 is the total nuniber of cycles (in millions) needed to esecute the original program and the second colunin is the speedup obtained when eliininating all the spill from the programs. We can see the great variance in speedup betsweeii the different prograins. BDIA hap pens to have extremely long basic blocks, coming from very long loops, and' the compiler h a s to introduce a lot. of spill code to esecute them with only eight registers. We 1ia.w measured the niean size of BDNA vect.or basic blocks a.nd it's close to 120 instructions (not. including the scalar instructions). From this table we can conclude that in a single bus architecture spill code has a. nega.tive effect, 011 performance, especially for t,liose programs that have a high register pressure.
Coiiclusioiis and Future Work
In this paper we have presented quantitative measurements of the execution of vector code produced by a commercial coinpiler on a vector supercomputer. We have chosen a subset. of the Perfect Club benchmarks and executed it using a simulator. Results show t1ia.t the fraction of time executing vector co~nput~atioas is not as high as one would expect (around 50%). We have presented data about the fraction of utilization of the vector functional units and found that only in roughly 5% of the cycles are all vector coiiipritatioii units busy, while there is a 25% percent of cycles where only one of the two arithmetic units is working a.nd the last 50% of the cycles are either load/store cycles or purely scalar cycles. The major limitation that prevents full utilization of the machine is the single bus to memory architecture, which even in the case of adding infinite number of functional units and ports, would be responsible for stalling the machine in 33.9% of all. executed cycles. Relaked with the iiieniory problems is the inability to cha.in loads and computa.tion. This restriction is respoiisible for stalling the " d i e in a. combined 25% of all executed cycles. Additiona.1 limitations found were the lack of a second functional unit, able to perform multiplication and division (15.46% of executed cycles) and liiiiit>ed nurnber of ports 1.0 access the vector register file (14.86%)). We have a.lso seen that vector spill code is not negligible and t1ia.t. it produces an average slowdowii of 8%).
The tools we have developed for this study are currently being used to evaluate solutions to tlie problems reported in this paper. We are currently looking a.t. different alternatives to solve the spill problem, as well as reducing the number of conflicts in the vector register file ports. We are also investigatiiig new fiinctiona.1 unit schemes to reduce t>he number of lost cycles.
