A model of high-performance ccmputers is derived from instruction timing formulas, with Empensation for pipeline and cache memory effects. The model is used to predict the performance of the IEh 37C/168 and the AmdaN 470 V/6 on specific programs,,and the results are verified by cunpacison with actual performance. Data collected about program behavior is combined with the performance analysis to highlight some of the problems with high-performance implementations of such architectures.
Introduction

General Coals
One of the most important tasks for a ccmputer designer is the evaluation of a canputer architecture and its implementation.
As two specific instances of that task, we consider (1) a canparison of the performance of the IBM 370/168 Model 1 ard the AJUXHL 470 V/6, which are two machines with the same architecture but different implementations, and (2) an analysis of some of the properties of the IBM 370 instruction set.
The basic goal is to apportion the time spent by an executing program -w the various system canponents such as the cache memory, the instruction pipeline and the individual instructions, so that resource utilization and system bottlenecks will appear. This is achieved by using models of the CPU of each machine which also provide estimates of the total CPU times.
The total time is important insofar as it is used to verify the accuracy of the model, since the predicted times are compared to the actual performance of the machines.
'Ihe decision to make implementation dependent measures of CPU performance for two members of a specific architecture family has several advantages: (1) Some of the traditionally difficult problems encountered when comparing two different architectures are not present, since many confounding factors relating to performance evaluation have the same effect on both machines.
(2) The success of one of the levels of a complex system can often be measured by the characteristics of the levels below. Performance evaluation which is close to the implementation level The computer system designer has similar problems to solve, but trost of the existing literature is not written for his viewpoint.
One explanation for this phenomenon is the lack of feedback; users seldan canplain about hardware design because they feel that their complaints will have little effect. The result is a scarcity of information for use by the designer. Most of the studies closest to this wxk deal with the collection of data on instruction frequencies. The most frequent objectives involve (1) benchmark studies, ( 2) canputer design, (3) language design, and (4) general progrmmner curiosity.
Some studies leave all interpretation to the reder, and becane a useful source of primary data (GIB, CCTJI. The studies most applicable to the canputec designer's pint of view often provide instruction frequencies, register utilization, opcode pairs, and static vs dynamic frequency canparisons, but little timing or performance informatron [LUN, FLY, WIN, HAN, AGA73, ANA, FOS7labl. The language-oriented studies have provided similar information for specific languages, studying the match between the language and the machine code to which they must be translated [ALE72, HEN, ALE75 because the instructions which make up the 509, 90% and 100% groups of a program are dependent on the program, the programmer, and the language used. The only instructions which seem universally important are the branches, which most often account for about 15-30% of the instruction counts, but which still show wide variation.
'Ihe difficulty with the f reguency analysis approach is that for performance evaluation the designer needs information about the instructions which account for most of the execution time. Attempting to derive performance conclusions from an instruction f rquency list yields poor results because scme instructions can hundreds of times slower than others. 'lo obtain acceptable performance results the designer needs to consider machine dependent variables because thy are required for precise evaluation of the instruction execution time.
The Instruction Timing Model
The Methodology
The models of the CPUs used here are based on the instruction timing formulas available from the manufacturers' docents which describe their computers lAHD,IBMI. These documents sometimes sacrifice details for ease of exposition (which is not to say that they are easy to read!) and represent only the best efforts of an engineer to describe the existing machine.
(In deriving the model for the Aa&ahl machine we were quite fortunate to get some help frun the designers.)
The pzgrams to be measured were traced in user state, and all the information required to caspute the instruction execution time from the formulas was collected.
A record was made of counts of occurrences, values of instruction variables used in the formulas, and information about memory performance. Typical variables depend on the specific instruction but may also depend on the implementation details. For example, the n-r of bytes moved is implementation independent , but measures of pipeline interlocks and timing delays are not.
Some variables depend on instruction envirorxnent and therefore require information about instruction pair and triple distributions.
'pwo primary constraints caused us to trace only user-state instructions.
(1) Tracing system software, with the attendant performance degradation of at least 50 to 1, muld modify operating system behavior in timing dependent I/O sections.
By tracing only in user mode, which is basically not speed dependent, we eliminate a source of error which would necessitate a canplicated interpretation of the results.
(2) Tracing the operating system introduces a large number of problems involving the recording of the trace data. Cne standard solution is the use of sam@es rather than cunplete traces, but then the verification of 1 the predicted CPU time is nearly impossible.
Since the timing formulas do not include the effects of cache memory misses, the cache memory is simulated for each machine. The cache penalty is added to the instructron execution time to obtain the expected program execution time. To verify the model the expected time is canpared to the operating system accounting time corrected to canpensate for the differences between the measurement methods.
The effects of instruction interaction, which can generally be attributed to pipeline resource interlocks, are rather explicitly accounted for in the Amdahl formulas.
For IBM, however, the pipeline effects seem to have been averaged into the formulas in a way which was not clearly indicated. This was a potential source of difficulty, but the effort required to obtain this information from the logic diagrams and microcode listings was prohibitive, and unjustified when an eeror of a few percent is acceptable.
The techniques used here are much more canplex than benchmarking, but not as costly as total hardware simulation.
The tools are general enough so they can be --and have been --used for other studies.
The importance, however, lies in the ability to change the model variables to reflect proposed changes to the existing hardware and to accurately predict the performance effects of those changes.
Choice of Factors
The development of the CPU model has &en greatly influenced by the idea of an evolving system of tools -development by successive refinement.
A introducing a significant error in the result.
To test this hypothesis, the program which canputes execution times was given three sets of execution variables with which to predict program running time. One was a progrannner's best guess of the true values, and the other ttm were the smallest and largest extremes which could realistically be expected. Ihe results showed that an instruction could jump from 4% to 50% of the total time depending on the value of its variables with all others remaining the same. This is an unacceptable error, especially since errors in the variables for many instructions could ccxnbine to form large systematic errors.
Most of the variables which affect execution time were therefore measured exactly or estimated fran related measurements.
The predicted execution time is canposed of the aggregate instruction timing results and a penalty for cache memory misses. The aggregate instruction timing results have already taken into account the instruction counts and basic execution speed, as well as the pipeline interlocks.
The cache miss penalty depends on the reference pattern of the program, the cache organization, and the data flow pattern within the machine. The two machines differ rather markedly in those respects: the 370/168 uses aligned doubleword (E-byte) accesses and an associative set size of 8, while the 470 accesses unaligned fullwards (4-bytes), uses a set size of 2, but has the same total amount of data (16K bytes).
There are also rather significant differences in the amount and type of instruction lookahead performed.
To accurately measure the cache penalty, the trace analysis program has a detailed simulation of the cache and instruction fetch mechanisn of both machines.
Although cache memory miss ratios are known to be low (MER), it is easily shown that the contribution of the time penalty for the misses is too large to be neglected.
If the miss ratio is 5%, with a 480 nsec penalty for a miss, 2 memory requests per instruction, and an average instruction execution time of 300 nsec (reasonable values for the 370/X8) then the time for the cache misses represents 16% of the execution time.
I\Jo other cache organization features must be considered in the cache penalty correction.
For IBM, stores always access main memory ("store-through") which may cause extra delays.
For Am&N, there is an extra penalty when a rl-byte access crosses a cache line boundary.
These and the other cache corrections are not attributed to the instructions which caused them, but rather accumulated separately. to these supervisor-state instructions executed in the processing of user-initiated supervisor calls (SVZs) must be subtracted from the reported CPU tine.
Measurements were made of the charged time for all the relevent SW-X as the programs were traced. The correction is very significant for almost all programs, since both the number and cost of the SvCs are high. For the 168, for example, the time charged varies from 107 usec for an I/O operation to 26 msec for opening a file.
Although the WC time correction could have been measured for the original benchmark programs, they were somewhat modified in view of the substantial correction required (as much as 20%). Wherever possible, the nrxnber of I/O operations was reduced by increasing the file blocking factors, but we did not otherwise alter the operation of the programs.
Despite this effort, the SVC time correction remained the factor which introduced the largest error in the measurements. We also added a FOW numerical analysis program from which the I/O parts were excised, so that few supervisor services were requested.
Since supervisor-state and user-state instructions share the same cache, there will be some displacement of the user's "working set" from the cache in response to an SK, which will manifest itself as a lower than normal hit ratio when the user's program is resrzned. An unpublished note by Possman suggested that this kould have a significant effect [RaS] . To verify this we simulated the cache activity for one job with a large number of SvCs first assuming a 100% cache flush for each .5WC, and then again with no flush; the nlrmber of cache misses changed by a factor of 10. Measurements To determine the cost of a cache miss, a test program simply fills the cache with known data. A second loop is then timed, in which either the same data is reloaded, or new data displaces the old.
The difference in time between the two versions of the second loop, divided by the number of cache misses caused by the loop which displaces the data, provides the cache miss time.
The value found for IBM is 480 nsec, which is not inconsistent with information from the hardware manuals.
For Amdahl, cache misses are found to cost 650 nsec, which also agrees with information from the designers.
Once the cache miss penalty is established, the effect of a supervisor request on the user data in the cache can be measured easily.
In a similar fashion the cache is filled with known data, the SW is issued, and the cache is refilled with the same data. The second loop is timed, and ccmpared to t:.: identical loop when the SK is not present.
The time Lfference divided by the cache miss penalty gives the number of cache lines that were displaced by the SVC. Note that the second loop must fill the cache in the opposite order from the first loop, otherwise the LRU replacement algorithm mu.ld cause the original data to be removed instead of the data added by the SVC. One of the most interesting differences Of implementation between the two machines is the effect of data stores on the cache. The IBM approach is to always store data directly into main memory, and t0 update the cache only if the line already exists. The Amdahl machine updates the cache line if the data is present without storing into main memory. If the data is not in the cache, the line will be read from memory. If the replacement algorithm must remove a line which was modified in the cache, the memory is updated at the time the line is replaced.
The IBM method, Called "store-through", has often been criticized because it requires a main memory access for all 
SX Time Measurement
As previously discussed, the CPD time charged for Sm was measured in order to be able to correct the time given by the operating system. The time charged for each SK is often large and varies from program to program even fcr the same SK type.
WI account for these variations we measured the time charged to the user for each SVC as the benchmark programs were being traced.
The SK correction computed by sumnning the meas~rcd SVC times is therefore quite accurate for the 168 because it was the machine used for the tracings. For the 470, the timing program LTIMER was Used to give estimates of the average SVC costs. This latter method does not take into account the variation from progrmn to program and the SVC corrections are much less accurate than for the 168. Tins, the total time predicted from the timing formulas, Lmich does not include the cache miss penalty.
M l Tmiss, where M is the nrnnber of cache misses as reported by the cache simUlator, and Rniss is the cache miss penalty.
The number of cache misses includes the effect of SK execution on the cache contents.
Tcross, the time penalty, for Amdahl only, paid when references to the cache cross a line boundary. The penalty is the cycles (.065 usec) for reads and three cycles (.0975 usec) for writes, and is canplted using numbers provided by the cache simulator. Virtually all the penalty arises from inStrUCtiOn fetch, since none of the programs access Unaligned data. There is no equivalent penalty for IBM because its larger instruction buffer prefetches enoqh so that tm successive doublehC)rds can be accessed without introducing an additional delay.
The corrected time for the actual execution, Trun, consists of the following terms:
Tact, the time as given by the standard 131 accounting routines.
1.56
Tsvc, the time attributed to execution of all the supervisor subtracted from Tact. In addition to the frequencies of execution, the table gives the fraction of execution time attributable to each of the instructions listed.
Note that it is tb. cammn for an instruction to have a ratio of 2 to 5 in -__-.
The particular strengths and weaknesses of the implementations are apparent;
the Amdahl implementation of DR suffers In &prison to IBM (FORGO), whereas IBM fares rather poorly on ml.
Certain dips in performance are clearly evident, arid two such examples appear in C(BOIC. -Ihe Execute (EX) instruction, which the Am&h1 designers expected not to be important, is a particularly obvious problem, and has been noted before IEMEI. The Exclusive Or Character (XC) instruction, which accounts for 8.31% of the execution time, is almost always a case of overlap discussed later, which IBM optimized but Amdahl did not.
Instruction Length
The 370 architecture has three instruction lengths:
2, 4, and 6 bytes, which loosely correspond to register to register, register to memory, and memory to memory instructions. In Table 7 , the column marked '% Count' indicates the fraction of all instructions executed that were potential branch instructions.
The colunn marked '% Success' which follows, shows the fraction of those potential branches that were successful.
In the 370 architecture there are two classes of branches: unconditional branches, and conditional branches whose success depends on values at execution time. Each class contains both successful and unsuccessful branches. The OdY lllluSUa1 subclass is the unconditionally unsuccessful branch, which is a n-p instruction.
The second part of Table 7 shows the fraction of branches in each of these four subclasses as a fraction of all potential branches encountered.
Branch instructions can create difficulties for pipelined implementations of computer architectures. The instruction fetch mechanism is often a stage in the *et.** For most programs, the average execution distance is surprisingly small (less than 32 bytes, which is the cache line size) but the standard deviation is large. There are often isolated peaks for relatively large execution distances (see Table 11 ). With the exception of the PLlGC p&am, which has the highest average execution distance, 77% to 85% of execution distances are less than 32 bytes. Distances less than 16 bytes account for 40-60% of the execution distances. This tends to justify the choice of 32 bytes for the linesize of the cache on both machines, at least as far as instruction fetch is concerned. This is also consistent with older designs for instruction fetch buffers, such as the IBM 360/91 which has a 64 byte instruction stack. The measurement of 0pEode pair frequencies confirms that the overall frequency of an opcode is not independent of the surrounding instructions. P3ir occurrences are also important in performance analysis because of pipeline interlocks and other miscellaneous issues such as memory store-through. Table 12 gives the five most frequent opcode pairs for each program. It is not uncamnon for the measured frequency of those pairs to be 4 to 9 times greater than the product of the individual opcode frequencies.
An examination of the frequent opcode pairs fails to discover any pair which occurs frequently enough to suggest creating additional instructions to replace it. f&any of the instruction pairs which do occur frequently are those that when canbined would save only one opcode field since the other instruction fields would still be '***** replacement by a single instruction to improve code density.
Registers and Address Calculation
The 370 architecture expresses addresses as the sun of a 24 bit base value in a register with a 12 bit displacement in the instruction.
Some instructions allow an additional 24 bit quantity in another register to beusedasanirxlex.
In all cases specification of register 0 for the base or index indicates that a value of zero is to be used in lieu of the contents of the register.
The Fran one to sixteen registers may be moved by a single instruction. Table 14 shows a typical distribution (fran FORTGO) of the number of registers stored and loaded. It is cammn for there to be two peaks, one for a low value of about 2 to 3 registers for accessing data stored in consecutive words, and another at a high value of 11 to 15 registers for saving and restoring registers across procedure calls. The IM and SJM are not used symetr ically:
for a given number of registers loaded or stored the frequency counts are often quite different.
For the FORGO program, the average number of registers used for SIM is 13.23, and for IM is 5.99. For both machines, the marginal cost of storing one m0re register is smaller than the execution time of a load or store instruction, but there is a higher overhead for starting each instruction for IBM than for Audahl. In both cases it is faster to use several shore or load instructions when 3 or fewer registers are involved.
Despite the fact that these instructions are never among the most frequent, they contribute much more to the CPD time than their fresuencv muld swaest because of their long execution tim& Fir the F@lW program for example, the 0.67% of instructions which are SIM account for 6.66% of the IBM execution time and 4.59% of the Amdahl execution time.
Character Instructions.
The second group 0 f storage-to-storage (SS) mstructions are those which specify a .SO"JXe and destination location for a character string and a single length for both operands in the range 1 to 256. One of the characteristics of tkse instructions that makes their implementation very difficult is that overlam operands are allowed and must be treated a byte at a time.
This allows, for example, a single byte to be propagated throughout a string by a move instruction whose destination address is one greater than the source address, since the fields are processed left to right.
Lower performance machines in the 370 family implement these instructions in all cases by processing each byte individually, but for high performance machines this wuld be too slov. Therefore both canputers exhibit execution speeds for the non-overlapped cases which are much higher than that for overlapped.
For were to replace these instructions by faster equivalents when they are available, but this would require tailoring the canpilers to specific models of the computer series.
Cache Effects
The correction due to cache misses ranges from 1% to 5% for IBM, but frcm 3% to 19% for Amdahl, indicating that the memory subsystem is a major bottleneck for the Amdahl machine. In sane sense the mamory architecture forces the 470 to lose some of the raw speed advantage of the CPU. There are two factors which contribute to the problem. The cache organization of the Amdahl machine produces from 1.7 to 3 times the number of cache misses, and the penalty for each miss is 1.56 times that for IBM. Thus the overall cache penalty for Amdahl is 2.5 to 4 times more than IBM, whereas the raw execution speed, defined as Tins (the time required to execute the instructions with no cache misses) is 1.9 times faster than IBM. The loss due to the cache organization could have been eliminated, but to maintain the raw speed advantage would have required a cache miss penalty of 250 nsec, which would not have beon economically feasible at the time. The dilemma of Amdahl may result from a mismatch between the MC6 memory chips available cumnercially and its proprietary ECL IS1 technology which is far more advanced.
Pipeline Effects for the 470
Because the timing formulas for the Amdahl machine include specific pipeline variables, we can assess their effect on the execution. The pipeline is optimized for 4-byte instructions which have single wDrd operands, and any deviation causes pritential conflicts with subsequent instructions.
The seven pipeline variables depend upon local instruction sequences (for exanraple Sl and LWD described earlier), and therefore cannot be canputed fran global averages. The exact evaluation of these variables muld require a complete and canplex simulation of the pipeline at the time the program is traced.
As a canpromise, we use the pair and triple frequency data collected while tracing to reconstruct instruction sequences and average the variable value for each sequence.
In general, the speed degradation due to pipeline conflicts seems to be guite small. For most programs, each of the variables contributes less than 0.5% to the total execution time.
The only cases of a larger contribution are when the variables affect specific instructions which occur frequently. 
