A model of high-performance computers is derived from instruction timing formulas, with ~3mpensation for pipeline and cache memory e£fects. The model is used to predict the performance of the IBM 370/168 and the Amdahl 470 V/6 on specific programs,/and the results are verified by comparison with actt;al performance. Data collected about program behavior' is combined with the performance analysis to highlight some of the problems with high-performance implementations of such architectures.
Introduction

General Goals
One of the most important tasks for a computer , designer is the evaluation of a computer architecture and its implementation. As two specific instances of that task, we consider (i) a comparison of the performance of the IBM 370/168 Model 1 and the AMDAHL 470 V/6, which are two machines with the same architecture but different implementations, and (2) an analysis of some of the properties of the IBM 370 instruction set.
The basic goal is to apportion the time spent by an executing program among the various system components such as the cache memory, the instruction pipeline and the individual instructions, so that resource utilization and system bottlenecks will appear. This is achieved by using models of the CPU of each machine which also provide estimates of the total CPU times.
The total time is important insofar as it is used to verify the accuracy of the model, since the predicted times are compared to the actual performance of the machines.
The decision to make implementation dependent measures of CPU performance for two members of a specific architecture family has several advantages: (i) Some of the traditionally difficult problems encountered when comparing two different architectures are not present, since many confounding factors relating to performance evaluation have the same effect on both machines. (2) The success of one of the levels of a complex system can often be measured by the characteristics of the levels below. Performance evaluation which is close to the implementation level * Work supported in part by the U.S. Energy and Research Development Administration under contract E(043)515.
+ Work done while a Visiting Scientist at the Stanford Linear Accelerator Center.
of a computer gives valuable design information at the architecture level.
(3) The speed of collection and the precision of the results are greatly enhanced by having tools that are tailored for a specific instruction set. (4) Practical and useful results can be obtained quickly, paving the way for more general studies.
Previous Studies
The evaluation of computer systems from the buyer's point of view has traditionally received a great deal of attention.
The system software often requires careful and tender tuning, and bottlenecks which can have dramatic effects on performance must be identified and removed. An abundant literature addresses these problems and provides techniques for solution [AGA75] .
The computer system designer has similar problems to solve, but most of the existing literature is not written for his viewpoint. One explanation for this phenomenon is the lack of feedback; users seldom complain about hardware design because they feel that their complaints will have little effect.
The result is a scarcity of information for use by the designer. Most of the studies closest to this work deal with the collection of data on instruction frequencies. The most frequent objectives involve (i) benchmark studies, (2) computer design, (3) language design, and (4) general programmer curiosity.
Some studies leave all interpretation to the reader, and become a useful source of primary data [GIB, CON] .
The studies most applicable to the computer designer's point of view often provide instruction frequencies, register utilization, opcede pairs, and static vs dynamic frequency comparisons, but little timing or performance information [LUN, FLY, WIN, HAN, AGA73, ANA, FOS71ab] . The language-oriented studies have provided similar information for specific languages, studying the match between the language and the machine code to which they must be translated [ALE72, HEN, ALE75] .
When their interest is only in performance evaluation, users have generally been advised to use benchmark runs instead of instruction mixes based only on instruction frequencies.
[ARB,SNI]. The use of timing information with these instructions mixes is made difficult by the lack of published information from the manufacturers, in particular for the high-performance machines. (Amdahl is an exception in this regard [AMD] ). This has forced users to produce their own documents [LIP, EME] .
The manufacturers themselves must have studied these questions, and some expurgated papers reveal glimpses of large-scale efforts and sophisticated tools but offer little results [VAN, HUG, MUR] .
The previous studies have shown that very few instructions (often four or five) represent 50% of those executed, and a few more (often 20 to 30) represent 90%.
This would seem to justify the idea that a few instructions will account for most of a program's behaviour and one can neglect instructions whose frequencies are below a certain threshold. Unfortunately this applies only to a specific program. No trend has been shown in the importance of instructions, because the instructions which make up the 50%, 90% and 100% groups of a program are dependent on the program, the programmer, and the language used. The only instructions which seem universally important are the branches, which most often account for about 15-30% of the instruction counts, but which still show wide variation.
The difficulty with the frequency analysis approach is that for performance evaluation the designer needs information about the instructions which account for most of the execution time. Attempting to derive performance conclusions from an instruction frequency list yields poor results because some instructions can hundreds of times slower than others. To obtain acceptable performance results the designer needs to consider machine dependent variables because they are required for precise evaluation of the instruction execution time.
The Instruction Timin@ Model
The Methodology
The models of the CPUs used here are based on the instruction timing formulas available from the manufacturers' documents which describe their computers [AMD, IBM] . These documents sometimes sacrifice details for ease of exposition (which is not to say that they are easy to read!) and represent only the best efforts of an engineer to describe the existing machine.
(In deriving the model for the Amdahl machine we were quite fortunate to get some help from the designers.)
The programs to be measured were traced in user state, and all the information required to compute the instruction execution time from the formulas was collected. A record was made of counts of occurrences, values of instruction variables used in the formulas, and information about memory performance. Typical variables depend on the specific instruction but may also depend on the implementation details. For example, the number of bytes moved is implementation independent, but measures of pipeline interlocks and timing delays are not.
Some variables depend on instruction environment and therefore require information about instruction .pair and triple distributions.
Two primary constraints caused us to trace only user-state instructions.
(i) Tracing system software, with the attendant performance degradation of at least 50 to i, would modify operating system behavior in timing dependent I/O sections. By tracing only in user mode, which is basically not speed dependent, we eliminate a source of error which would necessitate a complicated interpretation of the results.
(2) Tracing the operating system introduces a large number of problems involving the recording of the trace data. One standard solution is the use of samples rather than complete traces, but then the verification of the predicted CPU time is nearly impossible.
Since the timing formulas do not include the effects of cache memory misses, the cache memory is simulated for each machine. The cache penalty is added to the instruction execution time to obtain the expected program execution time. To verify the model the expected time is compared to the operating system accounting time corrected to compensate for the differences between the measurement methods.
The effects of instruction interaction, which can generally be attributed to pipeline resource interlocks, are rather explicitly accounted for in the Amdahl formulas. For IBM, however, the pipeline effects seem to have been averaged into the formulas in a way which was not clearly indicated.
This was a potential source of difficulty, but the effort required to obtain this information from the logic diagrams and microcode listings was prohibitive, and unjustified when an error of a few percent is acceptable.
The techniques used here are much more complex than benchmarking, but not as costly as total hardware simulation. The tools are general enough so they can be --and have been --used for other studies. The importance, however, lies in the ability to change the model variables to reflect proposed changes to the existing hardware and to accurately predict the performance effects of those changes.
Choice of Factors
The development of the CPU model has been greatly influenced by the idea of an evolving system of tools ~-development by successive refinement. A crude model and simple tools were first assembled and by successive iteration new tools, new measurements, and a nDre refined model were designed. We think this approach reduced the number of false starts and the elapsed time of the whole study by allowing us to concentrate quickly on the most important factors.
The CPU model used is an intermediate one between full simulation at the hardware register level and a machine-independent representation of performance. The decision to include some factors and exclude others was based on our estimation, often supported by experimentation, of the effect of those factors on the final results.
Some of the justification for the decisions are presented below.
The accuracy of the model is supported by the match between the program execution time as predicted by the model and the same time measured by the operating system during actual runs. Performance evaluation by benchmarking is repeatable only within 2-3% because of the large number of uncontrollable variables, and this therefore defines the required precision of the model. An examination of previously published instruction freguencies might suggest that the more frequent instructions are those whose duration is constant and therefore do not heavily depend on execution variables like the length of operands. If this were true, then those variables could be set to program-independent values without introducing a significant error in the result. To test this hypothesis, the program which computes execution times was given three sets of execution variables with which to predict program running time. One was a programmer's best guess of the true values, and the other two were the s~nallest and largest extremes which could realistically be expected. The results showed that an instruction could jump from 4% to 50% of the total time depending on the value of its variables with all others remaining the same. This is an unacceptable error, especially since errors in the variables for many instructions could combine to form large systematic errors. Most of the variables which affect execution time were therefore measured exactly or estimated from related measurements.
The predicted execution time is composed of the aggregate instruction timing results and a penalty for cache memory misses. The aggregate instruction timing results have already taken into account the instruction counts and basic execution speed, as well as the pipeline interlocks. The cache miss penalty depends on the reference pattern of the program, the cache organization, and the data flow pattern within the machine. The two machines differ rather markedly in those respects: the 370/168 uses aligned doubleword (8-byte) accesses and an associative set size of 8, while the 470 accesses unaligned fullwerds (4-bytes), uses a set size of 2, but has the same total amount of data (16K bytes).
There are also rather significant differences in the amount and type of instruction lookahead performed.
To accurately measure the cache penalty, the trace analysis program has a detailed simulation of the cache and instruction fetch mechanism of both machines.
Although cache memory miss ratios are known to be low [MER] , it is easily shown that the contribution of the time penalty for the misses is too large to be neglected.
If the miss ratio is 5%, with a 480 nsec penalty for a miss, 2 memory requests per instruction, and an average instruction execution time of 300 nsec (reasonable values for the 370/168) then the time for the cache misses represents 16% of the execution time.
Two other cache organization features must be considered in the cache penalty correction. For IBM, stores always access main memory ("store-through") which may cause extra delays. For Amdahl, there is an extra penalty when a 4-byte access crosses a cache line boundary.
These and the other cache corrections are not attributed to the instructions which caused them, but rather accLmnulated separately.
The execution time reported by the operating system includes all user-state and some supervisor-state instructions [BEN] , whereas the trace program measures only user-state instructions.
The time attributed to these supervisor-state instructions executed in the processing of user-initiated supervisor calls (SVCs) must be subtracted from the reported CPU time.
Measurements were made of the charged time for all the relevent SVCs as the programs were traced. The correction is very significant for almost all programs, since both the number and cost of the SVCs are high. For the 168, for example, the time charged varies from 107 usec for an I/O operation to 26 msec for opening a file.
Although the SVC time correction could have been measured for the original benchmark programs, they were somewhat modified in view of the substantial correction required (as much as 20%).
Wherever possible, the number of I/O operations was reduced by increasing the file blocking factors, but we did not otherwise alter the operation of the programs.
Despite this effort, the SVC time correction remained the factor which introduced the largest error in the measurements. We also added a FORTRAN numerical analysis program from which the I/O parts were excised, so that few supervisor services were requested.
Since supervisor-state and user-state instructions share the same cache, there will be some displacement of the user's "working set" from the cache in response to an SVC, which will manifest itself as a lower than normal hit ratio when the user's program is resumed. An unpublished note by Rossman suggested that this would have a significant effect [ROS] . To verify this we simulated the cache activity for one job with a large number of SVCs first assuming a 100% cache flush for each SVC, and then again with no flush; the number of cache misses changed by a factor of 10. Measurements showed that the actual fraction of the cache displaced by an SVC varies from 0.16 to 1.0, and that almost all non-trivial requests completely replace the cache.
Interrupts which occur during the execution of the program do not account for a significant increase in accounted time (since the user-state CPU timer is disabled during interrupt processing) but there could be an effect due to cache displacement caused by the interrupt routine.
On a heavily loaded machine interrupt rates as high as 4000 per minute are common, representing 16.4 ms of extra time (1.7% for IBM) to ccrnpletely refill the cache for each second of CPU time. Since most of those interrupts are due to other jobs, this effect was reduced to a negligible level by running the job on on otherwise idle system, so that only the few interrupts caused by the benchmark job itself could cause interference. This is unlike the SVC correction, for which no change in the number of cache flushes is possible simply by controlling the envirormlent of the benchmark run. Similar calculations for the effect of channel I/O transfers to memory show that they have even less effect on CPU performance. This is true both for IBM, where the channels transfer directly to main memory and invalidate corresponding cache entries, and for Amdahl, where the channels transfer into the cache.
Overview of the Measurement Programs
An interpretive trace program (TRACE) generates a record for each user-state instruction of the measured program. The record contains the instruction type, memory addresses referenced, and the other required information. These records are processed by a trace analysis program (ANALYSIS) which generates instruction counts, variable values, and memory access statistics such as cache memory miss counts, which are stored in a SLmlmary file. In order to avoid saving massive amounts of intermediate trace information (25 megabytes per traced second), the TRACE and ANALYSIS programs execute as coroutines. The combined overhead of the trace and trace analysis programs amounts to 300 seconds per second of real time. This compares favorably to other more detailed hardware simulations, where the overhead has been as high as 6000 seconds per second of real time [VAN] .
The summary file is converted into a count file by an intermediate program (CONVERT) .
The count file contains all the information required to compute the timing formulas for both machines condensed into about 500
numbers. An instruction statistics program (INSTAT) uses the count file and files of encoded instruction timing formulas to produce the final timing and performance information.
We devised several test programs for verifying the formulas and understanding the measurement factors.
A general instruction timing program (LTIMER) was designed for precise measurements of instruction times, cache memory miss penalties, SVC times, and the effects of SVCs on cache memory contents.
The Instruction Timing Formulas
An instruction may have several timing formulas associated with it, corresponding to different modes of execution.
Each individual timing formula may depend linearly on the variables (the most common case) or have a more complicated dependence. In general, three types of linear formulas are encountered. where R is the number of registers loaded.
Some formulas may involve variables which are concerned with the general environment of the instruction. These are often measures of the effect of pipeline interference which causes a delay in the execution of an instruction. Examples are the Amdahl variables S1 and DWD. S1 accounts for some cases of pipeline interlocks, and ranges from 0 to .065 usec depending on the "number of execution cycles attributable to the three words of the instruction stream following the instruction of interest"
[AMD]. DWD, which is either 0 or .0325 usec, compensates for the occurrence of a doubleword result instruction before the subject instruction, because the machine is fundamentally single word oriented.
Store (ST)
Amdahl .065+Si+DWD
When several formulas are associated with one instruction, each formula applies only to a specific case of its execution. For example, the Move Character instruction execution formulas depend in important ways on the degree of overlap of the two operands. and where B = nember of bytes moved W = number of words moved WB = number of bytes which must be moved to have the destination on a word boundary when b>63.
For all the individual linear formulas, we need only accumulate the counts and average variable values for each of the timing formula cases.
Unfortunately, some formulas are not linear in their variables.
Typical examples are the decimal arithmetic instructions, where the duration depends on the product of the lengths or the average value of the digits used. For these we compute the appropriate products of variables at the time the program is analyzed, and average these values for use by the other programs in an equivalent linear form. These cases of non-linear formulas are sufficiently infrequent to justify this special treatment, but the effect on timing values is too important to ignore them. A simpler approach would assume that the product of the averages is a sufficient estimate of the average 168 product, but the potential error is great.
The formulas are encoded as a string of records, each corresponding to the coefficient of a term in a subcase of a timing formula for a particular instruction; there are a total of 3200 variable names and coefficient values. A numbering and naming scheme was devised that allows variables which are com~non to many formulas to be propagated to all appropriate places, as well as giving individual identities to variables which are more specific.
Verification of the Model
Measurement of Cache Miss Penalty
Although cache miss penalty information is available from the manufacturers, it was difficult to interpret precisely what the effect on instruction time is. Since measurements are not difficult and the correction could be significant, the values were verified experimentally. To determine the cost of a cache miss, a test program simply fills the cache with known data. A second loop is then timed, in which either the same data is reloaded, or new data displaces the old.
The difference in time between the two versions of the second loop, divided by the number of cache misses caused by the loop which displaces the data, provides the cache miss time.
The value found for IBM is 480 nsec, which is not inconsistent with information from the hardware manuals.
For Amdahl, cache misses are found to cost 650 nsec, which also agrees with information from the designers.
Once the cache miss penalty is established, the effect of a supervisor request on the user data in the cache can be measured easily. In a similar fashion the cache is filled with known data, the SVC is issued, and the cache is refilled with the same data.
The second loop is timed, and compared to the identical loop when the SVC is not present. The time difference divided by the cache miss penalty gives the number of cache lines that were displaced by the SVC. Note that the second loop must fill the cache in the opposite order from the first loop, otherwise the LRU replacement algorithm would cause the original data to be removed instead of the data added by the SVC. Table 1 shows the fraction of cache displacement for some of the more common supervisor requests. ****** One of the most interesting differences of implementation between the two machines is the effect of data stores on the cache. The IBM approach is to always store data directly into main memory, and to update the cache only if the line already exists. The Amdahl machine updates the cache line if the data is present without storing into main memory. If the data is not in the cache, the line will be read from memory. If the replacement algorithm must remove a line which was modified in the cache, the memory is updated at the time the line is replaced.
The IBM method, called "store-through", has often been criticized because it requires a main memory access for all stores [KAP] . Although the store can proceed in parallel with subsequent instructions, any subsequent main memory accesses must be suspended until the memory becomes available. Since the timing formulas do not explicitly account for this effect, it is important to determine its magnitude.
There are three factors which combine to minimize the possible deliterious effects of the store-through policy used by IBM. The first is that the memory is organized with four-way interleaving of adjacent doublewords, so that consecutive stores may well reference separate memory banks. The second is simply that based on the opeode pair distribution we have accumulated, consecutive instructions which store data into memory are relatively infrequent.
The third is that even for pairs of such instructions, there appears to be a level of buffering for data that must be written to main memory, at least for the case when that data is also in the cache. A penalty appears only for the third consecutive store, and then is 360 nsec. The full write cycle time penalty of 640 nsec occurs only for the fourth and subsequent store. These factors are sufficient to justify not including a difficult-to-compute correction for store-through writes.
SVC Time Measurement
As previously discussed, the CPU time charged for SVCs was measured in order to be able to correct the time given by the operating system. The time charged for each SVC is often large and varies from program to program even for the same SVC type.
To account for these variations we measured the time charged to the user for each SVC as the benchmark programs were being traced.
The SVC correction computed by summing the measured SVC times is therefore quite accurate for the 168 because it was the machine used for the tracings. For the 470, the timing program LTIMER was used to give estimates of the average SVC costs. This latter method does not take into account the variation from program to program and the SVC corrections are much less accurate than for the 168. Table  1 shows the time charged for some important SVCs averaged over all programs.
It is interesting that the time charged for supervisor services is often comparable to what would be required if there were no operating system. For I/O operations, previous measurement have shown that the hardware I/O instructions (SIO, TIO, etc.) are incredibly expensive; i00 usec is not unusual [JAY] . This is to be compared with, for instance, the measured charge of 107 usec for the request to the operating system for an I/O operation. Note that both of these are more than two orders of magnitude larger than, for example, the 0.61 usec needed for a double precision floating point multiplication.
It would seem that improvements in the arithmetic units of computers have not been accompanied by similar improvements in the I/O interface despite the existence of I/O channels.
The Benchmark Jobs
The results presented here are derived from the analysis of seven benchmark jobs written at SLAC. Except for one (LINSY2) they were all production jobs written for purposes other than performance evaluation. To avoid biasing the results with artifacts from specific languages or programs, we purposely chose the three most used language compilers and programs compiled by them.
(i) FORTC is a compilation by the IBM Fortran-H optimizing compiler.
(2) FORTGO is the execution of the FORTRAN program compiled by FORTC. It is a numerical analysis program which solves partial differential equations.
(3) PLiC is a compilation by the IBM PL/I-F compiler.
(4) PLiGO is the execution of a PL/I program which accumulates and prints accounting sun~aaries from computer use information.
(5) COBOLC is a compilation by the IBM ANSI Standard COBOL compiler.
(6) COBOLGO is the execution of a COBOL program which reformats and prints computer use accounting information. 
Model validation
Verification basically consists of comparing the time predicted by our model for each benchmark job with the corrected real execution time. The time predicted for each benchmark, Tpred, consists of the following terms:
Tins, the total time predicted from the timing formulas, Which does not include the cache miss penalty. M * Tmiss, where M is the number of cache misses as reported by the cache simulator, and Tmiss is the cache miss penalty.
The number of cache misses includes the effect of SVC execution on the cache contents.
Tcross, the time penalty, for Amdahl only, paid when references to the cache cross a line boundary. The penalty is two cycles (.065 usec) for reads and three cycles (.0975 usec) for writes, and is computed using numbers provided by the cache simulator. Virtually all the penalty arises from instruction fetch, since none of the programs access unaligned data. There is no equivalent penalty for IBM because its larger instruction buffer prefetches enough so that two successive doublewords can be accessed without introducing an additional delay.
The corrected time for the actual execution, Trun, consists of the following terms:
Tacc, the time as given by the standard IBM accounting routines.
Tsvc, the time attributed to the user for the execution of all the supervisor calls, which must be subtracted from Tacc. Table 3 provides the values for each of these times for each of the benchmarks. For Tpred and Trun, the relative percentage of each of their components is given. The absolute error, Trun-Tpred, and the percent error, (Trun-Tpred)/Trun, appears on the last lines. The verification process points to large discrepancies between raw execution speed (Tins) and the speed as perceived by the user (Tacc).
The results for IBM are generally extremely good; for all except one program the differences between the predicted and actual running time are less than 2%. The agreement for Amdahl is not as good, but we attribute most of the error to the crude method for measuring the SVC time correction. A factor of two in the the SVC correction, which is certainly conceivable when an OPEN as measured on the 168 can vary from 6 to 33 msec, could easily account for all the the error. ****** Table  4 gives the opcodes which account for at least 50% of all instructions executed for each of the benchmark jobs. In addition to the frequencies of execution, the table gives the fraction of execution time attributable to each of the instructions listed. Note that it is time. Decimal (DP) accounts for 18.65% of the Amdahl time for COBOLGO, and Translate and Test (TRT) accounts for 5.38% of the IBM time for PLiC. The particular strengths and weaknesses of the implementations are apparent; the Amdahl implementation of DR suffers in comparison to IBM (FORgGO), whereas IBM fares rather poorly on STM. Certain dips in performance are clearly evident, and two such examples appear in COBOiC. The Execute (EX) instruction, which the Amdahl designers expected not to be important, is a particularly obvious problem, and has been noted before [EME] . The Exclusive Or Character (XC) instruction, which accounts for 8.31% of the execution time, is almest always a case of overlap discussed later, which IBM optimized but Amdahl did not.
Instruction Length
The 370 architecture has three instruction lengths:
2, 4, and 6 bytes, which loosely correspond to register to register, register to memory, and memory to memory instructions. Table 6 gives the fraction of each type encountered and the average instruction length.
The average instruction length does not vary considerably from program to program; the range is 2.92 to 4.49, with most programs around 3.6 bytes. The only exceptions are the COBOL programs, for which 6-byte storage to storage instructions predominate, and the LINSY2 program, for which 2-byte register to register instructions predominate.
Although the average does not vary considerably, the proportion of 4-byte instructions varies from 46% to 81%, and similarly 2-byte instructions vary from 15% to 60%.
The high fraction of 2-byte instructions for LINSY2 results from the fact that most of the instructions executed are part of a short (26 byte) inner loop that was highly optimized by the compiler. ****** pipeline which is independent of the instruction decoder, and therefore does not recognize branch instructions. A naive implementation results in a large number of unnecessary instruction fetches following a branch instruction, since the recognition of the need to fetch instuctions from the branch target comes too late.
To address this problem the 168 has a rather sophisticated mechanism by which both the instructions following the potential branch and the instructions at the branch target are fetched into two separate sets of instruction buffers. Although the fraction of success for potential branches seems to be a fairly consistent 60-80%, For most programs studied, branch instructions represent a considerable fraction of all instuctions executed (usually 15% to 30%). In five of the seven programs traced, at least one of the branch instructions (usually the simple conditional branch BC) appears in the 50% group.
In Table 7 , the column marked '% Count' indicates the fraction of all instructions executed that were potential branch instructions. The column marked '% Success' which follows, shows the fraction of those potential branches that were successful.
In the 370 architecture there are two classes of branches: unconditional branches, and conditional branches whose success depends on values at execution time. Each class contains both successful and unsuccessful branches.
The only unusual subclass is the unconditionally unsuccessful branch, which is a no-op instruction. The second part of Table 7 shows the fraction of branches in each of these four subclasses as a fraction of all potential branches encountered.
Branch instructions can create difficulties for pipelined implementations of computer architectures. The instruction fetch mechanism is often a stage in the 2383347 100.00%
In contrast, the 470 simply treats branch instructions as if they had memory operands, and uses the normal memory operand fetch mechanism to fetch the first two words at the branch target location. Pipeline complexity is minimized by having the execution unit determine the results required for conditional branches as early as possible. This is consistent with the very successful philosophy of the Amdahl designers to keep the pipeline as simple as possible.
Since we generally find that branch instructions represent a smaller percentage of the execution time for the 470 than the 168, it appears as though the decision to use a simpler mechanism was a good one.
Branch and Execution Distances
One of the conmDn critici~ns of the 370 architecture involves the absence of program-counter-relative branch instructions. Table 9 is a typical branch distance distribution which supports this attack, since 75-85% of the branch distances are within 2048 bytes of the program counter. The displacement of 12 bits used in RX branch instructions could therefore have been used for most branches so that base registers would have been unnecessary for most program references. The fact that 50-60% of the branch distances are within 128 bytes of the program counter indicates that even an 8-bit displacement could be used to considerable advantage.
Although 95-99% of the longer branch distances are within 32K bytes, there are still a substantial number of longer branches (8M bytes and above) representing calls to supervisor routines far from the user's program area.
Most programs show a few important peaks in the branch distance distribution corresponding to the important program loops.
Note that the asymmetry around the program counter is not sufficient to justify other than a symmetric signed displacament for relative branch instructions. ****** Table i0 shows information related to execution distances, which is defined to be the number of bytes of instructions executed between successful branch instructions. The last column gives the equivalent distance in ntamber of instructions, obtained by dividing the average execution distance by the average instruction length for that program. It would seam to be a reasonable estimate of the true average number of instructions between successful branches. ****** For most programs, the average execution distance is surprisingly small (less than 32 bytes, which is the cache line size) but the standard deviation is large. There are often isolated peaks for relatively large execution distances (see Table ii ). With the exception of the PLiGO program, which has the highest average execution distance, 77% to 85% of execution distances are less than 32 bytes. Distances less than 16 bytes account for 40-60% of the execution distances. This tends to justify the choice of 32 bytes for the linesize of the cache on both machines, at least as far as instruction fetch is concerned. This is also consistent with older designs for instruction fetch buffers, such as the IBM 360/91 which has a 64 byte instruction stack. ****** The measurement of opcode pair frequencies confirms that the overall frequency of an opcode is not independent of the surrounding instructions. Pair occurrences are also important in performance analysis because of pipeline interlocks and other miscellaneous issues such as memory store-through. Table 12 gives the five most frequent opcode pairs for each program. It is not uncommon for the measured frequency of those pairs to be 4 to 9 times greater than the product of the individual opcode frequencies.
An examination of the frequent opcode pairs fails to discover any pair which occurs frequently enough to suggest creating additional instructions to replace it. Many of the instruction pairs which do occur frequently are those that when combined would save only one opcode field since the other instruction fields would still be ****** required. Examples of this nature are test or compare instructions followed by conditional branches (TM/BC, C/BC). Many other frequent pairs are artifacts of the program structure; a simple exanple is the pair which consists of a loop branch and its target instruction. Alexander [ALE75] mentions the load-branch pair as an extremly frequent one for the XPL compiler (L-BC is 12.4% of the count). We find no pairs with such high frequencies, and in particular find the load-branch combination to be significant only in two of the seven programs.
Frequent pairs often result from peculiarities of software conventions; the subroutine-call instruction (BALR) is often followed by the unconditional branch (BC) because the first instruction in almost all subroutines is a branch around the name of the program.
For the FORTGO program, the extra branches (which could be easily eliminated by putting the name before the first instruction of the subroutine) cost 0.70% of the execution time of the entire program.
Many of the programs have a similar extra cost of between 0.5% and 1.0% due to the same convention.
The distinction between the distribution of instruction pairs executed and the static distibution of instruction pairs in the program text should be carefully made. Our results do not contradict findings based on static analysis [FOS71a, HEH] that certain pairs of instructions might be frequent enough to justify replacement by a single instruction to improve code density.
Registers and Address calculation
The 370 architecture expresses addresses as the s~xn of a 24 bit base value in a register with a 12 bit displacement in the instruction.
Some instructions allow an additional 24 bit quantity in another register to be used as an index. In all cases specification of register 0 for the base or index indicates that a value of zero is to be used in lieu of the contents of the register.
The hardware does not distinguish between registers which contain addresses and registers which contain index values, so the interpretation of statistics about base and index register utilization are difficult to relate to the program organization. Nevertheless information about the occurrence of zero in the register fields can be easily interpreted. Table 13 shows that it is very infrequent for instructions to specify the use of both index and base registers. Except for the program LINSY2, which is known to have many array references, 80% to 95% of the indexed instructions do not use both base and index registers.
A reorganization of the 370 addressing modes could profitably include a non-indexed mode in which the space saved is used for a longer displacement. ****** The distribution of register utilization for address calculation shows that no more than 3 registers account for most of the use. The others are used for address calculation less frequently, or are used for program accumulators.
Operand Lengths
The TRACE program accumulates the distribution of the lengths of all the operands for instructions for which the operand lengths are not implied by the opcode.
These operand lengths are either fixed and defined in other fields of the instructions (like the number of registers specified in the Load Multiple instruction), or are data dependent (like the number of bytes which must be referenced before an inequality is detected in a Compare Character instruction). These variables are required to calculate the instruction execution times.
For the purposes of exposition we have divided the variable operand length instructions into three classes:
(i) the multiple register load and store instructions (IM and STM), IM/STM. The STM and IM instructions save and load a contiguous set of registers designated by a starting and ending register. From one to sixteen registers may be moved by a single instruction. Table 14 shows a typical distribution (from FORTGO) of the number of registers stored and loaded. It is common for there to be two peaks, one for a low value of about 2 to 3 registers for accessing data stored in consecutive words, and another at a high value of ii to 15 registers for saving and restoring registers across procedure calls. The IM and STM are not used symmetrically:
for a given number of registers loaded or stored the frequency counts are often quite different. For the FORTGO program, the average n~nber of registers used for STM is 13.23, and for LM is 5.99. For beth machines, the marginal cost of storing one more register is smaller than the execution time of a load or store instruction, but there is a higher overhead for starting each instruction for IBM than for Amdahl. In both cases it is faster to use several store or load instructions when 3 or fewer registers are involved. Despite the fact that these instructions are never among the most frequent, they contribute much more to the CPU time than their frequency would suggest because of their long execution time. For the FORTGO program for example, the 0.67% of instructions which are STM account for 6.66% of the IBM execution time and 4.59% of the Amdahl execution time.
Character Instructions.
The second group of storage-to-storage (SS) instructions are those which specify a source and destination location for a character string and a single length for both operands in the range 1 to 256. One of the characteristics of these instructions that makes their implementation very difficult is that overlapped operands are allowed and must be treated a byte at a time.
This allows, for example, a single byte to be propagated throughout a string by a move instruction whose destination address is one greater than the source address, since the fields are processed left to right. Lower performance machines in the 370 family implement these instructions in all cases by processing each byte individually, but for high performance machines this would be too slow. Therefore both computers exhibit execution speeds for the non-overlapped cases which are much higher than that for overlapped.
For the IBM Move Character instruction, for example, the non-overlapped case takes 40 nsec per byte moved, but 240 nsec per byte of overlapped move.
On jobs for which MVC is a frequent instruction (PLiC and CO80LC) we find that the nonoverlapped case occurs about 50 times more frequently than the overlapped case. However, the average number of bytes ****** The overlapped MVC instructions are used primarily to fill a work area with a specific character, and are probably most used to initialize I/O buffers. This is confirmed by the peaks near 80 and 133 which correspond to card and line printer buffers. For programs which don't otherwise use MVC but still do I/O, the overlapped case is an even higher fraction of all occurrences of MVC.
For FFORTC, for example, the 6% overlapped MVCs account for 52% of the MVC time. Table 15 is the distribution of operand length for MVC instruction in FORTC. It is representative of the other distributions in the presence of large peaks for small values, and an overall average of 10.06 bytes. Since the startup overhead for these instructions is large, there is almost always a less expensive way to do the equivalent operation for a small number of bytes. For one byte, a IC/SYC combination takes less than half the time of a one-byte MVU on both machines.
Most of the other instructions in this variable operand class are much less frequent than MVC. Among them are the instructions for which the nt~nber of bytes processed may be much smaller than indicated in the instruction, such as Compare Character (CLC) and Translate and Test (TRT). For these instructions, ~he distribution of the length specified in the instructions is a poor indicator of the length actually used.
A typical examples is COBOLC, where the average CLC instruction specifies 4.53 bytes, but an average of ****** Another instruction of note is the Exclusive Or Character (XC) which is predominately used in total overlap mode in order to zero fields.
This fact was used to advantage in the 168, where the total overlap case is specially optimized to be 15 times faster than the other overlap eases.
This was not done for the 470, which explains that XC accounts for 9.6% of the COBOLC program for the 470, but only 3.0% for the 168.
Decimal Instructions.
The third group of storage-to-storage instructions consist primarily of those for decimal arithmetic. They appear in significant numbers only in the COBOI~O program. For that program, however, they account for 26.29% of the count, and represent 66.39% of the IBM execution time and 64.30% of the Amdahl execution time. These instructions can vary in execution time by as much as 16 to 1 depending on the operand lengths, but the large execution time arises despite the fact that relatively short operands are common. Most operands are 2 to 6 bytes long even though the maximum possible is 16. The average execution time of the Divide Decimal (DP) instruction is about 15 usec for both machines. Not suprisingly, the average instruction execution rate for the COBOLGOprogram (.810 MIPS for IBM, 1.353 MIPS for Amdahl) is drastically smaller than the average for all the programs (3.519 MIPS for IBM, 5.518 MIPS for Amdahl). Considering the popularity of COBOL as a programming language, these instructions, which require slow serial byte processing, represent a major degradation of the speed of the machines.
In view of the poor performance of many of the variable operand length instructions, their inclusion in the the architecture of a high-performance computer is questionable.
The absence of such instructions in machines like the CDC 7600 and the CRAY-i is indicative of their emphasis on high speed. The arithmetic which must occur before these instructions begin their data transfer suggests that it is quite difficult to optimize them for short operands. A compromise, if the execution of these instructions cannot be optimized, may be to supply simpler instructions from which the more complex character and decimal instructions can be composed, as illustrated by the byte instructions of the PDP-10. An immediate improvement could be obtained if compilers were to replace these instructions by faster equivalents when they are available, but this would require tailoring the compilers to specific models of the computer series.
Cache Effects
The correction due to cache misses ranges from 1% to 5% for IBM, but from 3% to 19% for Amdahl, indicating that the memory subsystem is a major bottleneck for the Amdahl machine. In some sense the memory architecture forces the 470 to lose some of the raw speed advantage of the CPU. There are two factors which contribute to the problem. The cache organization of the Amdahl machine produces from 1.7 to 3 times the number of cache misses, and the penalty for each miss is 1.56 times that for IBM. Thus the overall cache penalty for Amdahl is 2.5 to 4 times more than IBM, whereas the raw execution speed, defined as Tins (the time required to execute the instructions with no cache misses) is 1.9 times faster than IBM.
The loss due to the cache organization could have been eliminated, but to maintain the raw speed advantage would have required a cache miss penalty of 250 nsec, which would not have been economically feasible at the time. The dilemma of Amdahl may result from a mismatch between the MOS memory chips available commercially and its proprietary ECL ISI technology which is far more advanced.
Pipeline Effects for the 470
Because the timing formulas for the Amdahl machine include specific pipeline variables, we can assess their effect on the execution. The pipeline is optimized for 4-byte instructions which have single word operands, and any deviation causes potential conflicts with subsequent instructions.
The seven pipeline variables depend upon local instruction sequences ( for exammple Sl and DWD described earlier), and therefore cannot be computed from global averages.
The exact evaluation of these variables would require a complete and complex simulation of the pipeline at the time the program is traced. As a compromise, we use the pair and triple frequency data collected while tracing to reconstruct instruction sequences and average the variable value for each sequence.
In general, the speed degradation due to pipeline conflicts seems to be quite small. For most programs, each of the variables contributes less than 0.5% to the total execution time.
The only cases of a larger contribution are when the variables affect specific instructions which occur frequently. For the COBOLGO job, an average additional i.i cycles (35.75 nsec) is added to each decimal instruction. This represents a 1.35% increase in execution time.
For PLiGO, the doubleword store instructions result in an additional 1.17%. For LINSY2, the delay caused by late setting of the condition code needed for conditional branches adds 0.3%.
Although there are wide variations, these worst case examples demonstrate the overall good design of the pipeline.
Summary
A verifiable model of CPU performance using simple and reusable tools shows that basic CPU speed as seen by the user is significantly degraded by memory and operating system effects. This performance analysis, based on instruction timing rather than frequency data, shows also that a few instructions can be disproportionately costly.
Many traditional problem areas for high performance computers seem to be under control.
The instruction pipeline functions well and branching has little deliterious effect. Memory can be a bottleneck, but the effects of cache store-through policies are negligible. No popular instruction pairs cause particular difficulties, and they are often program-specific artifacts.
Program usage se6~s to be inconsistent with high-performance implementations in some areas. Decimal arithmetic may be convenient for some applications but is disastrously slow.
Storage to storage instruction operands are almost always short and those instructions have high startup costs. Some special cases allowed by the architecture (such as totally overlapped Exclusive-Or) must be individually optimized or performance will suffer. Interaction with the operating system is not only visible because of the time charged for its services, but also because it seriously affects the program miss ratio by disturbing cache memory contents.
These conclusions suggest that designers of high-performance computers should consider the following items to be important: (i) faster memory, (2) n~re efficient cache, (3) simple pipelines, (4) avoidance of instructions which require serial processing of small data elements, and (5) high-speed decimal arithmetic if it must be included at all.
Conclusion
The performance evaluation techniques described in this paper allow us to draw conclusions about the architecture and the implementation of two high-perfomance computers with the same architecture. The time spent by an executing program can be apportioned among the various system components. The confidence in the results derives from the verification of the model with actual performance. The accuracy exhibited by these techniques and the ability to change the timing formulas to reflect changes in an implementation allow the designer to predict the performance effects of those changes on future machines.
ACKNOWLEDGEMENTS
The considerable assistance and advice of Forest Baskett was essential to this work. John Banning was very helpful in criticizing an early version of the paper.
We thank Amdahl Corporation, and specifically Kornel Spiro, Manager of Computer Architecture, for their cooperation and for the generous use of an early version of the instruction statistics program originally developed at Amdahl.
We are indebted to Chuck Gray at the University of Michigan for running benchmark jobs on their Amdahl 470.
The original incentive for the analysis of machine traces is due to Harry Saal. It should be emphasized that the results and discussions are strictly unrelated to any current or future architectural efforts of the manufacturers involved.
