Characterization of machines, by studying program usage of their architectural and organizational features, is an essential part of the design rocess. In this aper we re ort an empincal study of a singre processor of tEe CRAY Y-RP, using as benchmarks long-runnin scientific applications from the PERFECT Club benchmariset. Since the compiler plays a major role in determining machine utilization and program execution speed, we compile our benchmarks usin the state-of-the-art Cray Research production FOR-TU% compiler. we investigate instruction set usage, operation execution counts, sues of basic blocks in the prop$ms, and instruction issue rate. We observe, among other t ings, that the vectorized fraction of the dynamic rogram operation count ranges from 4% to 96% for our benc%narks.
1.
INTRODUCTION An understanding of instruction-level program behavior is essential in the architecture design rocess. However, experimental studies which collect data &at rovide such an understanding have, to date, been u b l i s h for only a few machines. Studies of a non-vector d S C architecture, the VAX [4,61, Detailed studies of vector processors, using as benchmarks long-running programs com iled aggressively for performance, have not been reports.
In this paper we present a stud of the CRAY Y-MP [21, using as benchmarks the PERFECqClub (51 set of a cation programs. The ro rams are compiled b the 8$
Research production FbRFRAN compiler, C F h , version
Permission to copy without fee 811 or pan of this material is granted provided that the copier are not made or distribuled for direct ~ommcrcial advantage. the A C M copyright notice and the title of the publication and its date appear. and nolice i s given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish. quires a fee andlor specific permission.
3.0, for a sin le processor of the CRAY Y-MP. We report
and analyze 3 namic instruction and operation frequencies, frrsuencies o?basic blocks of various srzes, and instruction issue rates. The data shed light on the use of a vector processor by a highly optimizing compiler.
In section 2 we describe our measurement methodolour benchmarks, and some caveats that should be E%kved in using the data reported. In section 3 we discuss the usage of vanous instructions by the programs. In section 4 we discuss the sizes and the vectorization of basic blocks of our benchmark programs, In section 5 we briefly discuss instruction issue rates. In section 6 we draw some conclusions from the study.
STUDY ENVIRONMENT

Measurement Methods
We use two methods of data collection in our study. First, most of the data presented in this paper are dynanuc counts such as instruction frequencies, basic block sizes, etc., which do not involve measurement of the time taken by the processor to execute the prog,ram: Such data can be colected very uickly by the fo lowm sunple, widely-used technique. ?ro rams are composes of basic blocks of instructions, ea& basic block being a maximal sequence of instructions with a single entry point (the first instruction) and a sinnle exit point (the last instruction). The branch instructio<s in the Program determine the number of times each basic block in the program text is executed, and the scauence in which the basic blocks are executed. Statistics ~~~~~~ ~
1~~~~
that are affected only by the frequcncyof execution of in&: vidual basic blocks, and not by the sequence in which the are executed, can be gathered easily and uickly by first coI lectine. data for the individual basic blocas in the program ~~~~~~ ~
that are affected only by the frequcncyof execution of in&: vidual basic blocks, and not by the sequence in which the are executed, can be gathered easily and quickly by first coI le Cndividual basic blocks i f t h e program and tKen scaling the statistics for each basic block by'thcexecution frequency of that basic block. The d amic execution freauencies of the basic blocks need to ccollected first:
~~ ~~~~
1~ ~~~~
instrumcntation of programs to collect this information ii routinely done (for exam le, 1141 ). We use a Clay Research production software toofjUhll'TRACE [Ill, to obtain CXC&tion frequencies of the basic blocks in our benchmarks. Another software tool analyzes the basic blocks in CRAY machine code and uses the basic block execution frequencies to scale the data collected for each basic block. Sxond, we present some data regarding the time taken for program execution. Such data are collected using the Hardware Performance Monitor (HPM) available on the CRAY Y-MP. The HPM is a set of hardware counters that can be turned on during program execution to collect certain statistics in an unobtrusive manner. For example, HPM monitors program execution time, instruction issue stage utilization, and the number of floating-point, integer, and memory operations executed.
Benchmarks
We use the PERFECT Club I51 ams as benchmarks in this per. Briefly, the PE&% ub benchmark sct is the resupof a large-scale benchmarking effort toward aiding supercomputer evaluation, and compiises of thirteen long-running supercomputer application pro ams chosen to represent the spearum of characteristics ogientific applications. These benchmarks are becoming widely accepted as standard benchmarks for supercomputer evaluation.
Pro am performance on any system is determined by the abgties of both the compiler and the processor. We com ile our pro ams, for a single processor of the CRAY Y-&, usin . the ray Research roduction FORTRAN compiler, C F d , version 3.0. CF& is an a p s s i v e compiler that optimizes code and veaorizes it forttgh rformance. For example, the compiler makes s ial efrfis to hide scalar memory latency: memo E instructions for operands that are consumed in x t u r e loop iterations are issued towards the end of the current loop iteration to hide memory latency. Since we stud compiled code, we view our study as one of the PERFEd Club benchmarks executing on the CRAY Y-MP system con rising a state-of-the-art vectorizin and o iminng compler, and the fine-grain parallel CfiAY Y-& processor.
In this paper we study the user routines of the benchmark programs. We do not include in our study the library routines executed by the benchmark programs. A practical reason for this is the fact that the basic block profiler currently available to us does not profile library routines'. However, a more important =son is the fact that many of the CRAY library routines are handcoded in assembly language due to o m n c e consideratias. This i n lies that piled code. By not including &e library routines in the stud we focus here on the performance of compiled FOR-T R A h code. Table 1 presents the execution-time, in CRAY Y-MP clock cycles or pulses (CPs), and the number of instructions executed for each of the benchmark programs. These numbers are for the execution of the entire program, i.e., they include both user and library routines. We observe that each proqam executes hundreds of millions of CRAY Y-MP instructions, and takes hundreds of millions of clock cycles to run. Thus, the benchmarks we stud are Ion benchmarfs that are commonly used for vector machine studies.
Caveats
All the data collected for the CRAY Y-MP using our technique of profiling at the basic block level are accurate, but for one exce tion that is due to the vector architecture of the CRAY Y-&.
A vector instruction executes identical operations on a number of data elements, the number being determined at run time, by the contents of a s ecial Vector Length (VL) register. Since we do not simuite program execution, the content of the VL register during the exccution of each individual vector instruction is not available to us. Assuming that each vector instruction operates on the maximum vector length of 64 clcments results in overlibrary routines 9 WL 1 utillze " the rocessor better t I! an comrunning a plications, as opposed to the keme Y s or ,ma% ' The m e n t version of the profder instruments code during compilation; the library routines, being either hand-coded in assembly language or being mmpiled directly into object form, are therefore not inshumented. estimation of several measurements; in reality, several vector instructions operate on many fewer data elements. To overcome this roblem, we use the average vector lengths reported for tI!ree different vector instruction classesfloatin -point, integer, and memory instructions -by the HPM, ?or each rogram. The data colkted using the average vector lengtis are far more accurate than data obtained using the maximum vector len th However, there still exists a small margin for ermr, cke 'to problems associated with usin? avaages of numbers. One, the avera e vector lengths o the indtvldual mstructions that form eack instruction class could be significantly different from the avera e for the whole class. Two, the average vector len tfs reported by HPM are averages over the execution opthe entire program, while we use these averages to stud on1 the user routines of the ro ram The average vector Yengtg of the user routines couyd ! e iite different from the average for the library routines, an] thus quite different from the average for the entire program execution. IO am measurements2 (some of which are not reported lereyand investigation of important basic blocks of some of the benchmark programs lead us to believe that the error introduced by this approximation is not significant, at least for the study reported in this aper However, one should keep this assumption in minlwhen using certain results.
We note in passing that simulating the programs in their entirety would require enormous amounts of CPU time, since simulations can be two to three orders of magnitude slower than the programs being simulated, and the 'For example, we use a methodology (not reported here) for quickly estimating the execution times of programs. The The processor can be partitioned into vector and scalar ortions; the memory interface of the processor consists orfour ports: three for data transfers, and one for 1 / 0 and for fetching instructions. The vector and scalar portions of the processor share floating-oint functional units, but have separate functional units otRerwise. The scalar portion can be viewed as consistin of an address-computation unit (an address unit, henceforth and a scalar-com utation unit, each having its own set of functional units. f i e processor has three register sets -a set of eight vector (V) registers with 64 elements each, a set of eight scalar (S) registers, and a set of eight address (A) registers. Furthermore, the S and A re 'ster sets have cones nding backup register sets, T and f of 64 registers each. !?backup register is used to temporarily hold values when the corresponding primary register set is full and a register needs to be spilled to make room for another value. The functional units of the processor are fully pipelined.
The CRAY Y-MP processor has nncparcrl (16 bit,), two arcel, and three-parcel instructions. Thcarchitccture is a LJAD/STORE architecture -mcmory is accessed cxplicitly, and only by data-transfer instruc6ons. All cornpitation instructions are register-register instructions. The processor can issue one-parcel instructions at a peak rate of one per clock cycle; two-parcel and three-parcel instructions need two clocks for issue. The processor has an instruction cache (called I-buffers), but no data cache.
Overall, the processor is highly pipelined to exploit fine-grain parallelism. The compiler attempts to idcntify and increase fine-rain parallchsm in the code to take advantage of thisfardware. For cxample, the corn iler unrolls loops to increase aralblism, and software pipeencs memory operations to tograte long mcmory latcnnes (i.e., it pre-loads in the current iteration some of the mcmory values used in the next iteration). Incrcasing parallelism in code, however, resdts in a need for more re 'stem. For scalar code, the compiler can takc advantagc oEhe backup registers provided, to tackle the need for rcgisters. Loop unrolling also results in a need for large instruction caches. The Cra Research compiler limits the amount of loop unrollin to txe sue of the instruction cache (512 parcels or 1K bytesf to avoid repeated I-buffer misses for loop instructions [161. ( The amount of unrolling is dependent on the size of the onginal loop body.) All these factors affect the dynamic execution charactenstics of an avolication oroeram: we identifv some of the effects of the abbve factors'in &e data presented in the next few sections.
Program Vectorization
The vectorization of a program is important to its er formance. Vector instructions execute on a series of ! at ; elements in a pi lined fashion; the hide functional-unit latency with piperne rallelism, andrperform better than a corresponding scalar Eop (see for example [81, Chapter 7). Thus, the fraction of program execution time spent executing vector instructions is a measure of the efficiency of program execution. This fraction is related to the dynamic frequency of vector operutions in the program, although the relationshi may not be linear, as per our experience. The former is t i e best metric of the vectorization of a rogram since it directly represents program performance. gowever, since we do not study statistics related to execution time in this paper, we will rely on the dynamic frequency of operations executed with vector instructions to measure program vectorization.
Folklore has it that many scientific rograms have as many as 90% oi their operations executediy vector instructions (for example, this is the number quoted along with the Lawrence Livermore Loops U31 ). However, scientific programs could be inherently unvectorizable due to recurrences (or, datadependences across the iterations of a loo ),ambiguous array subscripts, data-dependent branches ,&or subroutine calls inside loop bodies, etc. Furthermore, c m n t limitations of state-of-the-art com ilers prevent vectorization of some code that could actuaiy be run in vector mode if compiled by hand. Here we characterize the vectorization of a scientific workload by a state-ofthe-art compiler, to determine the current relative importance of the scalar and vector portions of a supercomputer.
We use the fraction of all program operations vector- We also observe that the fraction ofqememory operations vectorized is quite high for ten of the thirteen programs;
' We distinguish between operations and vector instructions throughout this paper. A vector instruction executes several operations, one on each vector element. An operation is equivalent to a scalar instruction. We note that the fraction of all operations vcdori/ed s ans almost the entire range for the benchmarks. lnsufar as ice benchmarks arc rcprcscntative of scientific workloads, we can cunclude that the awrage fraction of Operations vectorized is much less than the usually assumed 90%. For our benchmark sct, the average' vectori7ation is 62%. Considering the clustering of numbers in the overall vectorization column of Table 2 , we partition the programs into thrcc subclasses: highly-wctor, moderafely-wctor, and s d a r programs. We will present data only for these thrce subclasses of programs in the Est of this paper; spacc constraints and the volume of data prevent us from resenting information for the individual rograms. We crassify the programs as follows. QCD, SPkE, and TRACK make U the scalar benchmarks; ARC3D, BDNA, FLOj2, MDd: MGJD, and SPEC77 arc the Decfor benchmarks; and ADM, DYFESM, OCEAN, and TRFD make up the moderatel oeclor benchmarks, (DYFESM and TRFD arc on the bor& line bctwwn two subclasses; we use information about instruction-issue stall times, presented in section 5, to push them into the moderately-vcrlor category.) We believe that classifying the pro rams as above minimizes the loss of information c a u d by considering only averages of numbers, since the groups contain ro r a m with very similar characteristics. Additionally, t& 8assification helps us 'We usc the anthmctic mean of the vcctunzation ratios of thc individual benchmarks. This msurcs that dl programs are given equal weight, i r r a p u v e of the numba of insmctions or operations executed by them individually. identify certain characteristics of the three subclasses.
Instruction Usage and Operation Counts
In this section, we discuss the CRAY Y-MP instruction et usage and the 0pera"on mix in our benchmarks. We determine a benchmark subclass's usage of an instruction by averaging the normalized usage of the instruction by each of the rograms in the subclass; thus, all the programs of a subcyass are given equal importance, irrespective of the number of instructions executed by them individual1 Since vector instructions execute several operations ea, {; the count of various operations executed presents a better picture of a program's utilization of machine resources. The numbers for operation usage are computed by expanding each vector instruction into the avera e number of operations that it executes (as reported by$lPM) for each progam5. Table 3 classifies the operations executed by the nchmarks into various broad o eration classes that are resent in several architectures. &I example, 27.82% and p.52% of all operations in the moderately-vector benchmarks are floating-point and branch operations, respedvely. Data thus classified could be compared to data presented for other machines. Table 4 further subdivides the information presented in Table 3 into classes that correspond to the functional units present in the CRAY Y-MP.
IIf a vedor instruction were replaced by scalar code, the number of operations needed to implement an equivalent loop could be 2xVL or 3xVL. instead of the VL operations needed for thc vector instruction. We a s m e each vedor instruction executes VL operations. Table 3 , we observe that the vector benchmarks are almost entirely floating-point operations and memory references. On the other hand, the scalar benchmarks have comparable amounts of floating-point and integer operations; for the moderately-vector programs, floating-point operations are three times as frequent as integer operations.
By integer o erations we mean only those operations that are executd by the scalar-computation unit. Addresscomputation instructions, while also being integer operations, are executed by the addressamputation unit and are classified separately. However, scalar integer operations are also used to perform some address computation work, since the address unit is only 32 bits wide while the inte er datatype supported by the architecture is 64 bits long. &r example, when array indices . r e g s + as arguments to subroutines they are stored as ( bit) mtegers, and they hence have to be manipulated b the scalar unit. Since there is no easy way to determine wxether an operation camecl out in the scalar unit is for address computation, we classi integer operations camed out in the scalar unit as non address computation) integer operations.
The address mm utation instructions, executed in the address unit, are used g r generating the memory addresses needed by all scalar memory operations; in addition, the are also used for maintaining loop counters on the CRAY MP. Address operations are comparatively less f q u e n t than scalar integer operations in the scalar benchmarks, while the two are comparable in number for the moderate benchmarks. On the whole, scalar pro rams can be expected to have a higher proportion of adjress arithmetic operations since they also have a higher mportion of scalar memo operations. When memory reirences are vectorked, tye vector instruction implicitly does address.arithmetic, and hence we see fewer explicit address arithmetic operations for highly vectorized code. The data resented bear this out: scalar programs have about 75599address operations, while the vector programs have less than 2%.
Register Transfers
Most strikin ly, more than one-third of all operntions of the scalar bencamarks are operations used to transfer values between the various register sets in the processor (miscellaneous category in Tables 3 and 4 ). The proportion of these register transfer operations decreases as we move to more vectorized programs; the pressure on the non-vector registers is of course higher in scalar code. Although scalar programs have a hi h proportion of these miscellaneous operations, the contrifuhon to execution time of these operations could be quite small, and disproportional to their number. The deep pipelines in the CRAY Y-MP for computation instructions cause long waits in the instruction-issue stage for dependences to be resolved. Scalar code has a lot of de endences and hence has frequent instruction-issue stalls. &e compiler can hide the cost of the single-cycle register-register move instructions b scheduling them for execution durin these datad e p e d n c e stalls, thus essential1 executin ttem for free. n s is one example of the pssi&lity of a Yarge difference between the dynamic frequency of an instruction and its contribution to program execution time. This difference is all the more important in a machine that has several parallel, pipelined functional units, since several instructions can be executing simultaneously, thus making the attribution of program time to instructions more difficult.
Lar er register sets would naturally decrease the number of spill instructions. However, larger re 'ster Sets would not be worthwhile if they increase the macErie clock cycle, especially if the spill instructions incur little cost in execution time anyway. MOV instructions, on the other hand, are mainly a result of the re ister file and functionalunit architezture of the processor. %he functional units and the re 'ster files have been separated into A and S sets to p r o d more parallel, dmupled execution, and MOV instructions are a necessary part of this separation. Also, in the current Y-MP architecture, the A unit does not have shift, logic, and 64-bit integer calculation functionalities. Therefore, many of the MOV instructions move data from 6Register spilling refers to moving a temporary value in a rcgister to memory, to make room in the register set for another value that will be accessed before the value being spilled. In the CRAY Y-MP, register spilling usually only results in a movement of values between the primary (A 6r S) registers and the backup (9 & regis tns, and we term thesemovesspill instmctirmr. (up to 64 words long) between the d a r backup registers and memory; the number of words transferred is determined at runtime by the contents of a general-purpose scalar register. Since we do not simulate progm execution, and since the HPM does not monitor general-purpose registers, we do not have a c e to this value. Hence we me unable to expand each BLK_LD/BLK_ST instruction into the equivalent multiple operations it executes. overlap hazards between block reads and block writes of memory to be detected in software, and the CMR instruction can be used to ensure sequentiality of such memory references.) Hence BLK-STs are much more frequent than BLK-LDs for the vector programs.
Floating-Point Operations
The scalar benchmarks have an equal number of floating-point additions and multiplications. The vector benchmarks also have "1;"' numbers of these operations, with the difference that a most all of them are executed by vector instructions. We see a fairly good balance of these o erations again in the moderately-vector benchmarks. From Table 4 , we observe that memory operations are the single most frequent class of operations for the vector and moderately-vector benchmarks, and they are second only to the re ister register move operations in the scalar benchmarks. tonsidering that memory access is not a short latency operation, this justifies the extra attention aid to the s stem in the CRAY X-MP and the C&Y Y-MP ~~c ? %~l had a single memory port, while the X-MP and the Y-MP have three data memory rts). We note from Table 2 that for the non-scalar bencEarks, usually more than 90% of the memo rations are executed in vector mode (V-LD and V-SX. Ehen executed in vector mode, the memory latency for an individual operation is hidden by the ipelined nature of o erations, and this is significant for oer&mance. Therefore &e machine has less need to relv on
a data cache for fast memo acccsscs. Vector scatter/gathcr instructions (V-CATH a n y V-SCAT), which transfer data from a set of memory locations specificd in a vector register, are used quite infGquently even in the highly-vectGrized benchmarks. The need for these instructions is, however, dependent upon the nature of programs.
Address Computation
Most of the address computation instructions an! additions (A-ADDS). Address computation instructions an' used, for example, to add an index to a base register. Address com utation instructions are also commonly used in the CRAf Y-MP for incmmenting the loop counter. Address multiplication operations are infrequent.
Branch Instructions
As expected, branch instructions are most fre uent in scalar code. However, the branch fr uency in T&e 4 is much less than the usual 20% of 3 instructions or s o reported for general-pu ose programs I6,9,121. A significant reason for this is%e fact that the compiler unrolls loops. First, this eliminates several loopqontrol branches. Second, Iw unrolling results in the compiler enerating several spill%structions because of the increasd pressure on the primary registers. These instructions are not present in other architectures, and the decrease the proportion of branch instructions on the C d Y Y-MP. We also note that scientific code inherently has fewer branches than nonscientific code. We discuss the f q u e n of branches in more detail in the section on basic blocks.% addition to the frequency of branches, the nature of the branches is important to machine efficiency in executin programs. Unconditional branches, for example, do not iave to cause bubbles in the pipeline since the branch destination is known at comle time itself. Unconditional branches form a non-negligible 7.4% of all branch instructions for scalar code; their proportion is much less in the other two benchmark subclasses. Subroutine calls are implemented in the CRAY Y-MP by a special branch instruction that saves the current program counter (PC) at a specific location and branches to the subroutine. Havin a large number of subroutine calls can result in code that isqess vectorizable. For example, loops with subroutine calls are usually not vectorizable (except for some vcdor intrinsic function calls where a vector can be passed as an argument to the function and then the function is executed in vector mode). The data bear this out: close to 25% of the branches in the scalar programs are subroutine calls, while their roportion is around 10% of all branches for the other two genchmark sets. We also note from Table 4 that branches themselves are less frequent in the vector benchmarks.
Conditional branches are the most frequent of all branches, across all programs. Although conditional branches are detrimenta to performance, the more predictable loop-control conditional branches can be handled efficient1 Conditional branches in the CRAY machines arc decided rased on the contents of a register; the register used could be either A0 or SO. Usually the compiler uses conditional branching based on A0 for loop-control branches, since loop counters are maintained and incremented in the A unit. Conditional branching based on SO is used for implementin datadependent branches, such as if-thenclse constructs. fable 5 s lits conditional branches into the above two classes. &e data indicate that 50% of all branches in the scalar benchmarks are datade ndent conditional branches. Also, data-dependent brancEs are about two-and-one-half times as frequent as loop control branches.
Given that scientific code is dominated by loo s, one can expect most of these branches to occur within z o p bodies. Loops with datadependent branches within them are likely to be scalar; it is thus natural to find scalar code having a significant fraction of these branches.
We observe that for the moderately-vector and vector benchmarks the proportion of loopcontrol branches has fom "1 jndicahnl fewer datadependent branches per cop. T 1s is a goo reason why these benchmarks are more vectorizable than the scalar set.
BASIC BLOCKS
A basic block is defined to be a straight-line fragment of code with a sin le entry p i n t (the first instruction) and a single exit pointqthe last instruction). Once program control enters a basic block, all the instructions in it are executed. The entry point could be the start of a program or either the destination or fall-through location of a branch; the exit point is either a branch or an instruction preceding the destination of a branch (since the destination of a branch is the start of a new basic block).
The nature and sizes of basic blocks play an important role in determining program performance, because several compiler optimizations (such as local register assignment and code-schedulin ) are conducted wthin basic block boundaries unless tRe hardware or the compiler supports speculative execution of instructions that lie beyond as-yetunexecuted branches. Larger basic blocks rovide better opportunities for effective code scheduling. golklore has it that basic blocks are small -papers in the literature report average branch instruction frequencies of 15% to 20%, and thus small basic blocks, in general-purpose pro ams [8, 9, 121 . In addition to the size of the average basicgock, the distribution of basic blocks with respect to their sizes is important, since if both small and lar e blocks exist in significant numbers, the compiler coud incorporate different techniques to tackle the two varieties of basic blocks. For scalar programs however, 90% of the blocks are shorter than 21 instructions. Scalar programs, with more frequent branches, expectedly have smaller basic blocks.
Blocks that are larger than 125 instructions are nonnegligible in number, for all three program subclasses: they form about 3% of the instructions for vector rograms, about 2% for the moderately-vector programs, a n i about 1.5% for the scalar programs. The scalar pro rams have a large number of single-instruction blocks (atout 11%). Surpnsingly, the vector pro ams also have about 9% single- In addition to the nature of the application, the nature of the compiler and the machine play a significant rolc in determining the size of the basic blocks. A major reason for the large basic blocks is that the compiler unrolls loops to exploit parallelism, and to tolerate the long latencies of the CRAY Y-MP, eliminating severals loop branches in the lDue to the large variance in the data, the arithmetic mean is somewhat undsirable in characterizing the data. We prefer to use the median instead. ?his is where the graph a o s e s the 50% mark on the y-axis.
%e exad amount of loop unrolling. and hence the number of branch-eliminated, is dependent on the size of the original loop body since the compiler unrolls Imps up to the size of the instruction buffer (IK bytffi This is 07 course due to the fact that the dee pipelines exploit the arallelism in the programs even wit1 the small issue rate. k r example, the floating-point add unit has 7 pi eline stages; thus, after issuin int add 6 otier instructions can be issued %efore the E t i n g -p d n t add completes. Other factors, discussed later in this section, are also responsible for the low utilization of the instruction issue stage. We note, however, that there ma be hases durin program execution when the amount orparahism is higf and the issue stage is a bottleneck.
ed, the vector benchmarks have lower issue rates k z z p o t h e r benchmarks. Vector instructions execute several operations, and hence fewer instructions need to be issued for vectorized pro ams; if we were to consider operation issue rates instead, g e numbers would be much higher for the vector benchmarks. However, another important reason for the low issue rate is the CRAY Y-MI' hardware organization. There exists more parallelism amon6 the vector instructions of IO ams than among the scalar instructions, but the CRAY !' -& has a relatively limited amount of vector resources (vector registers, memor ports, functional units), resulting in instruction issue s t a d due to resource conflicts. The limited resources result in long instruction issue stall times because each vector instruction reserves all necessary resources for time proportional to the number of operations it executes.
The stalls in instruction issue seen in scalar code are mainly due to data dependences between instructions, sincc a floating-each instruction holds registers only for time proportional to the latency of the instruction's functional unit. The functional UNts themselves are pipelined and are not a bottleneck, since the can collectively accept scalar instructions at a much higler rate than the issue stage can issue them. We note that despite having a large pro rtion of spill instructions, the issue stage is not ve highrutilized b scalar code. This indicates that the spirinstructions are d e l being issued for free during clock nods which w o h otherwise have been stalls for data gpendences to be resolved. The highly optimizing compiler would in most cases be able to schedule the spills in this manner; we note that in some cases the spill instructions could be on the critical path and hence may not be executed for free. We also note that although the issue rate is less than 60% on the whole, considering the highly pipelined functional units in the CRAY Y-MP and the long latency for memory, the utilization of the issue stage is quite high.
The fraction of time spent issuing the second and third parcels is small for most of the nonscalar benchmarks. The scalar benchmarks, however, keep the issue stage busy for an additional 20% of the time to issue second and third parcels of instructions. This is because the two-parcel and three-oarcel instructions of the CRAY Y-MP are all scalar memdry operations, which are found in high numbers only in scalar code.
In conclusion, the issue stage does not a r to be a bottleneck in the CRAY Y-MP for the P E R F E d E b benchmarks. Improving the issue sta e utilization and program execution speed can be achieved%y increasing the resources in the vector unit (i.e., vector registers, functional units, and possibly memory ports) and by cuttine down the latency of scalar operations (IC., the functional units and memory).
CONCLUSIONS
We rcsented a study of the single processor of the CRAY Y d P usin as benchmarks rograms from the PER-FECT Club set. d e characterized t l e processor by resenting the instruction usage pattern for our benchmazs. We observed, amon5 other things, that on the CRAY Y-MP the vectorized fraction of the d namic o eration count ranges from 4% to 96% for our benczmarks. &e average fraction of all operations vectorized is 62% for our benchmark set. Instructions that move values between the scalar registers and their backup registers constitute a si nificant fraction of all the instructions executed. Very parge basic blocks (greater than a 100 instructions in size) are significant in number for the benchmarks studied. Furthermore, both small and large basic blocks contribute significantly to the operation count of the programs. Thus, it is worthwhile to concentrate on speeding up both small and large basic blocks. The instruction issue rate is less than 0.5 instructions per cycle for our benchmarks, leadin us to believe that the issue stage does not appear to be a %ottleneck during program execution.
