Abstract-Five major DFT algorithms were evaluated on seven different computers. The relative performances of these algorithms were related to the architecture of each computer by finding a relationship between the execution time and the instruction counts. The relative performance of these algorithms on other Computers is predicted, based on the knowledge of the computer architecture. On certain implementations, data transfers are more important than floating-point additions and multiplications when comparing DFT algorithms. On the average, data transfers account for a greater percentage of the execution time than floating-point operations.
The radix-4 algorithm was executed for a sequence length of 1024, the radix-2 algorithm was executed for sequence lengths of 512. 1024, and 2048, while the other three were executed for sequence lengths of 504, 630, 1008, 1260, and 2520.
Execution Time: Table I lists the time in milliseconds needed for each algorithm to run on each computer. The times are average values obtained by repeated execution of the algorithms. The inputs for the runs were taken from a digitization of e-' cos 2t where t is a multiple of 0.01. The same input data were used for every algorithm. Comparing the sequence lengths of 1024 with 1008, an ordering of the algorithms based on execution time was different for each computer. Executing the algorithms on a faster computer did not decrease the execution times of all the algorithms equally. For example, the radix-2 algorithm ran 2.4 times faster on the Cray-1 than on the Cyber 750, while the WFTA ran 5.5 times faster. The reasons for the unequal increases in performance are related to the computer architecture implementation and will be presented later.
Memory Size: The memory requirements and data array sizes have been determined in other studies and the'results are available in the literature [4] , [6] - [9] .
Instruction Counts: For all computers, an Assembly language listing of the Fortran source code was obtained and analyzed, The number of instructions executed was determined for each of the following categories: floating add/ subtract, floating multiply/divide, integer add/subtract/ multiply/divide, and data transfers. These results were obtained by dividing the Assembly language into sections executed as a unit. Then the number of each type of instruction in that section was counted. The total number of that particular instruction which was executed was obtained by multiplying the number of instructions for each section by the number of times each section was executed, and then adding the results of each section. The number of executions for each type of instruction for a sequence length of either 1024 (radix-2 and radix-4) or 1008 (MFFT, WFTA, and PFA) is listed in Table 11 .
The instruction counts for the other sequence lengths have been determined [lo] , [ 111. Typically, over 99.5 percent of the floating multiply/divide instructions were multiplies, while over 90 percent of the integer operations were additions or subtractions. The number of floating operations are dependent on the DFT algorithms, and are approximately equal to the number predicted in theory, while the number U.S. Government work not protected by U.S. Copyright of integer operations and data transfers are dependent on the compiler and the computer architecture, and are different for each computer. The relationship between the instruction counts, the execution times, and the computer architecture will be analyzed next.
COMPUTER ARCHITECTURE AND EFFECTS
ON PERFORMANCE In each architecture implementation and algorithm, the percentage of time taken by each instruction category was determined. This information can be used to determine which instructions have the greater effect on the execution time. In addition, the correlation coefficients between the different instruction categories and execution times were determined from the equation
where xi is the instruction count for a particular algorithm and sequence length, yi is the execution time for the corresponding algorithm and sequence length, 3 is the average instruction count, and J is the average execution time. The correlation coefficient indicates how linear the relationship is between the two variables. The closer the correlation coefficient is to one, the more linearly dependent C o m p u t e r A l g o r l t h m F l o a t i n g F l o a t i n g I n t e g e r Add/Sub Multllriv 0ps T r a n s f e r s the execution time is on that type of instruction. These correlation coefficients are listed in Table 111 . Next, a discussion is presented of the basic architecture of the seven computers and the. relative performance of the five algorithms. This is followed by a comparison of the algorithms on the different machines.
Cray-1:
The Cray-1 has eight address registers and eight scalar registers, as well as eight 64 element vector registers. The CPU also contains a high speed buffer of 64 operand registers and 64 address registers, and four instruction buffers each containing 64 registers. The Cray-1 CPU has separate pipelined functional units for scalar addition, floating addition, and floating multiplication, each of which can operate in parallel with the others. The Fortran source code was compiled using the CFT 1.09 compiler. The main CPU features are listed in Table IV 1121. Fig. l(a) shows the percentage of execution time taken by each instruction category. The WFTA and radix-2 have the greatest percentage of time taken by data transfers with 83 percent, while the MFFT has the smallest with 76 percent. The high percentage of time for data transfers is related to the instruction timings. An operand load requires 137.5 ns, a store or register transfer requires 12.5 ns, a floating multiply requires 87.5 ns, and a floating add requires 75.0 ns. On the average, 22 percent data transfers are loads of memory; thus the average data transfer time is 40 ns. The ratio of floating multiply speed to data transfer speed is 2.2. The values given here, and those for the CDC Cyber 750, assume no instruction overlap. The question arises as to which category to assign the time when instructions execute simultaneously. The method used here assigns the time to each category, thus counting the time more than once. Since Fig. 1 is given in percentages, adding the additional time does not significantly affect the results. Only the WFTA Assembly language uses vector instructions (lo), thus causing the WFTA to have the shortest execution time. Thus, the availability of vector operations on the Cray-1 benefited the WFTA. The correlation coefficients range from 0.88 for the floating multiplications and divisions to 0.97 for the integer operations, with the correlation coefficient for the data transfers being 0.95.
CDC Cyber 750:
The CDC Cyber 750 has eight operand registers, eight address registers, and eight index registers. Each operand register has a corresponding address and index register. Six of the register sets read from memory, while two write into memory. Placing an address into an address register causes the computer to initiate the desired fetch or store. The CPU has separate pipelined functional units for integer addition, floating addition, floating J multiplication, and floating division that can operate in parallel. An instruction stack of 12 registers provides high speed buffer storage. The Fortran source code was compiled using the FTN 4.8 compiler with option 2, which optimized the execution time. Table IV lists the main features of the computer [13] . Fig. l(b) shows the percentage of execution time taken by the data transfers. The radix-2 has the smallest percentage of time taken by data transfers, which is consistent with the fact that it is the fastest, while the WFTA has the largest percentage of time taken by data transfers. Thus, the speeds of the algorithms on the Cyber 750 are limited by the data transfer rate. The reasons the data transfers are so important is clear from the instruction timings. Fetching an operand from memory requires 475 ns while multiplying two floating-point numbers requires only 125 ns [13] . The correlation coefficients were calculated and are listed in Table 111 the fewest floating operations. For a sequence length of 1008, the MFFT has the most data transfers and is the slowest. The correlation coefficient for the integer operations is large because integer operations are used to perform the address calculations for data transfers.
ZBM 370/155: The IBM 370/155 has 16 integer and address registers .and four floating-point accumulators. The CPU can prefetch up to three instructions, and the fetch and execution cycles can be overlapped. High speed buffer storage of 8K bytes increases the average data transfer rate. The source code was compiled using the Fortran H level 21.8 compiler with option 2, which gives the most optimized object code. Fig. l(c) shows that the WFTA has the largest percentage of execution time taken by data transfers with 52 percent, while the radix-2 has the least with 25 percent. The correlation coefficients vary from 0.78 for the integer operations to 0.95 for the floating 'addlsubtract and data transfers. The execution time is most closely related to the number of floating-point additions and subtractions, which is exemplified by the fact that the PFA is the fastest algorithm on the IBM and has the fewest floating-point additions and subtractions. Since the computer has a multipurpose functional unit, the floating operations have a greater effect on the execution time, whereas the highspeed buffer memory decreases the dependence of the execution time on the number of data transfers.
DEC MX 11/780:
The VAX 11/780 has 16 32-bit registers, four of which are used by the operating system. The system has the optional floating-point accelerator and 8K bytes of cache memory. The source code was compiled using the UNIX f77 compiler. Table IV lists the main features of the VAX 11/780 [15] . The percentage of time taken by each instruction category could not be determined because DEC will not release the CPU instruction times [ 161. However, the correlation coefficients were determined and ranged from 0.66 for the floating multiplies and divides to 0.97 for the data tranfers.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ----------------_ ---
DEC PDP 11/60: The DEC PDP 11/60 has eight integer registers, but only six are usable by the program for operand storage, with the other two used as a program counter and stack pointer. The computer has an optional floating-point processor for performing floating operations which can operate in parallel with the main CPU, and contains six floating-point accumulators, of which only four are memory accessible. The CPU contains 2K bytes of high-speed cache memory. The source code was compiled using the F4P versions 3.0 compiler. Table TV lists the main features of the DEC PDP 11/60 [17] . Fig. l(d) shows that the WFTA has the largest percentage of execution time taken by data transfers with 57 percent, while the MFFT has the smallest percentage with 49 percent. The correlation coefficients for the PDP 11/60 range from 0.97 for the floating multiplications and divisions to near 1.00 for the data transfers. The execution time decrease provided by the addition of the floating-point processor is limited by the fact that two memory cycles are required to tranfer one operand [17] . In addition, computational overhead is required to set up the addressing for the floating-point operand.
DEC PDP 11/50: For the purposes of this study, the PDP 11/50 is identical to the PDP 11/60 except that the PDP 11/50 has no cache memory and it has a slower version of the optional floating-point processor. The source code was compiled using the F4P version 2.4 compiler. Table TV lists the main features of the PDP 11/50 [ 181.
As shown in Fig. l(e) , the radix-2 has the greatest percentage of execution time taken by data transfers with 66 percent, while the MFFT has the smallest percentage with 53 percent. The values of the correlation coefficients in Table I11 for Cromemco 2-20: The Cromemco Z-2D microcomputer uses a 2-80 microprocessor as its CPU. The 2-80 contains 12 8-bit registers which can be combined into six 16-bit registers. The CPU cannot perform multiplication or division and must rely on software routines for these functions. Memory is accessed through an S-100 bus. The source code was compiled using the version 3.21 compiler. Table IV lists the main features of the Cromemco Z-2D Fig. l(f) shows the percentage of time taken by each instruction category. The largest percentage of time is taken by the floating operations that are performed using software routines, which are slower than floating-point hardware. In particular, the execution time is closely related to the numbers of floating-point multiplications, since they take six time longer to perform than floating-point additions, Thus the PFA is the fastest algorithm, while the MFFT is the slowest. The WFTA algorithm could not be run in the Cromemco because of limited computer memory. Since the WFTA has the fewest floating-point multiplications, it should have been the fastest algorithm in this machine.
Comparison of Algorithm Performance: Table V lists the execution time ratios for four DFT algorithm between the seven computers for a sequence length of either 1008 or 1024. The values for the other sequence lengths can be similarly obtained from Table I . For example, comparing the Cyber 750 and the Cray-1, the WFTA has the greatest decrease in execution time. The WFTA is 5.5 times faster on the Cray-1, while the radix-2 is only 2.4 times faster than on the Cyber 750. Since the WFTA has a matrix structure, it benefited most from the vector operations available in the Cray-1. If the Cyber 750 and IBM 370/155 are compared, then the decreased execution times of the Cyber over the IBM for three of the algorithms are about the same. Since the Cyber data transfer time is slower than its floating operation time, algorithms with fewer data transfers, such as the radix-2, showed a decreased exe-~9 1 . cution time on the Cyber. The decrease of the execution time of the IBM 370/155 over the PDP 11/60 is greatest in the WFTA and PFA. The ratio of the floating multiply to data transfer time is about 7.4 on the IBM 370/155, while on the PDP 11/60 is only 2.0. Thus, the IBM 370/ 155 performs better than the PDP 11/60 on algorithms which have fewer floating operations but more data transfers, such as the WFTA. In comparing the PDP 11/60 and PDP 11/50, the WFTA executes 2.2 times faster, while MFFT is 2.7 times faster. Thus the PDP 11/60 shows a relative improvement on algorithms which have more floating operations, such as the MFFT. The ratio of floating multiply time to data transfer time is 2.7 for the PDP 11/50. The radix-2 algorithm shows the greatest decrease in execution time when run on the PDP 11/50 instead of the Cromemco Z-2D. The radix-2 is 28.4 times faster, while the PFA is 19.5 times faster. Thus, a greater improvement occurs in the algorithms with more floating operations.
PREDICTION OF ALGORITHM PERFORMANCE
Knowledge of the computer architecture implementation on which an algorithm will run can be used to predict To accoiplish this, we divide computer architectures into three different types: floating-point processors, data transfer processors, and vector processors. Also, the effect of the compiler on the execution time must be determined.
Floating-Point Processors:
Floating-point processors are those which execute floating-point operations well and whose execution time is limited mainly by the data transfer rates of the operands. These architectures typically have an average data transfer time which is greater than half of the floating operation time and a high percentage of time for data transfers. The CDC Cyber 750 is an example of a floating-point processor. These architectures execute algorithms with a minimum number of data transfers most efficiently. Therefore, ranking the algorithms according to the number of data transfers, the radix-2 or PFA would have the minimum execution time, depending on the compiler, followed by the radix-4, WFTA, and MFFT. In fact, the execution time could be roughly predicted using only knowledge of the number of data transfers, independent of the algorithm used. For example, the execution times for the algorithms run on the Cyber 750 are plotted against the number of data transfers in Fig. 2 . The execution time is directly proportional to the number of data transfers. The average error between the points and the line of best fit is 4.1 percent. The proportionality constant will vary with different computers, but is closely related to the data transfer rate of the computer. Thus, for an architecture like the Cyber 750, the fastest algorithm is the one with the fewest data transfers for that particular sequence length.
Datu Transfer Processors; Data transfer processors have their execution speed limited by the speed of their floating-point operations. These processors have a single mul- tipurpose functional unit and usually a high-speed buffer. The IBM 370/155, VAX 11.190, PDP 11/60, PDP 11/50, and Cromemco Z-2D are examples of data transfer processors. The percentage of time for the floating operations is the greatest. Thus, the number of data transfers does not predict the execution time as in the floating operations processors, as can be seen in Fig. 3 . The average percentage of error between the actual execution times and those predicted by the line of best fit is 16.0 percent, which is about four times greater than the error for the floating operations processors. Data transfer processors have a data transfer time which is less than half of the floating operation time, and execute algorithms with fewest floating operations most efficiently. For this architecture, implementation, either the radix-4 or PFA would be the fastest, followed by the radix-:!, MFFT, and WFTA, with the particular order of the last three dependent on the ratio of floating add to floating multiply time. Fig. 4 shows a plot of the execution time versus data transfers for all the computers studied.
Vector Processors: Vector processors, or array processors, have functional units specifically for vector operations. The Cray-1 is an example of a vector processor. Vector processors execute the WFTA, with its nested structure, most efficiently, followed by the radix-4, PFA, radix-2, and MFFT, with the order of the last four dependent on the other features available in the processor.
Compilers: In addition to the hardware employed, the compiler used by each computer must be considered. The CDC and IBM compilers have three levels of optimization to improve the execution times of the algorithms [20] , [21] . The DEC compiler has a fixed level of optimization [22] . All three of the compilers recognize and replace common expressions, remove invariant computations from within loops, retain frequently referenced variables in registers, and assign frequently referenced variables to registers across loops. The DEC and CDC compilers evaluate constant expressions at compile time, and the CDC and IBM compilers simplify subscript calculations by using additions instead of multiplications where possible. Compiler optimization will not affect the number of floating-point operations, but it will affect the number of data transfers and integer operations. The object code produced by the optimizing compiler (option 2) on the Cyber 750 is three times faster than the code produced by the nonoptimizing compiler (option 0). When compiled by the Fortran V compiler, the execution times were nearly equal with those produced by the Fortran IV compiler. Throughout this work we compared the optimized object code for each of the computers. Improvements made on the compiler could affect the number of data transfers and consequently the results presented here.
CONCLUSIONS
This paper presented an evaluation of five major DFT algorithms on seven different computers. The data clearly show that none of the algorithms is faster than another in all cases. The reasons why certain algorithms perform better on certain computers were explained in terms of the architecture, hardware, and software. The execution times were related to four different instruction categories: floating addlsubtract, floating multiply/divide, integer operations, and data transfers. The average correlation coefficients were 0.95, 0.86, 0.94, and 0.98, respectively. In all cases, the number of data transfers was highly correlated with the execution times. For floating operations processors, the number of data transfers was the best predictor of the algorithm performance, with a 4.1 percent error, and the radix-2 was the fastest algorithm. For data transfer processors, the number of data transfers did not predict algorithm performance as well, with a 16.0 percent error, and either the radix-4 or the PFA was the fastest algorithm. Computers with vector operations executed the WFTA fastest. This information can be used to determine which algorithm should be used for a particular computer in order to minimize the DFT computation time.
