In this paper, we present the design and implementation of source-to-source High Performance Fortran assistant Tool (HPFT) in DEC 3000 workstations. For a given sequential program written in Fortran 77, HPFT generates a vectorized, reuse exploited, and/or parallelized version for vector computers. Several new compilation schemes in vectorization, reuse exploitation, and multi-threading are designed in HPFT. Performance evaluator is developed for measuring the system performance. The user interface is also designed for programmer to capture the information related to the compilation and execution of program. Experimental results based on the Convex C3840 vector computer show that the developed HPFT enhances the system performance and usually reduces the program execution time.
Introduction. Vector computers such as Cray family and Convex series
are equipped with hierarchical memory and several CPUs to speed up the program execution. Multiple CPUs can work together to parallelize the execution of vector operations that are de ned by programmer or vector compilers. When multiple CPUs concurrently process a program, the parallelism of execution appears in vectorization and parallelization. Synchronizations are needed among these CPUs if their references of array data have dependence relation. The parallelism, memory
To whom all correspondence should be addressed. management, and synchronizations are the main factors for determining the execution time of a sequential program. To e ciently utilize the hardware design, these factors should be highly regarded.
In most vector computers, vector compilers are able to analyze the dependence relation existed in program and automatically perform the vectorization and parallelization. However, implicit vector operations or parallelism existed in programs can not be fully exploited by vector compilers. Most current compilers 10, 11, 14] provide compiler directives for user to manually de ne vector operations and multithread in program. An improper use of these directives will cause not only semantic errors but also ine cient execution. Unless that users are skilled in parallel program design, it is di cult to write an e cient program with explicit de nitions of vector and/or parallel operations.
Another critical issue to improve the system performance of supercomputers is to reduce the data movements between shared memory and vector registers. Exploiting reuse opportunities of vector register data not only reduces the workload of load/store functional units but also gains the following three advantages. First, arithmetic functional units may avoid waiting for load/store operation and have earlier startup time. Second, time spent on load operations can be saved. The execution time of program is thus signi cantly improved. Third, fewer load/store operations reduce the tra c of shared memory accessing. Most current compilers, however, are not capable of exploiting the implicit reuse opportunities. This motivates us to design and implement a High Performance Fortran assistant Tool (HPFT) that assists vector compilers in vectorization, reuse exploitation, and synchronization reduction.
Related translators developed in last decade can be found such as Parafrase 18], PFC 2, 3] , and SUPERB 23] . Parafrase is a source-to-source translator applied to a Fortran 66 or Fortran 77 program. The system can be retargeted to produce code for di erent classes of concurrent architectures, including register-to-register and memory-to-memory vector machines, array processors, and shared-memory multiprocessors. The PFC is a source-to-source vectorizer that translates from Fortran 66 or 77 into Fortran 90. Standard transformations are applied in PFC. These include if-conversion, induction variable recognition and substitution, constant propagation and deletion of dead code 2]. SUPERB was developed by Zima and coworkers. It translates Fortran 77 programs into concurrent SUPRENUM Fortran programs for the SUPRENUM machine 23] .
Di erent to these designs, the HPFT is an assistant tool to vectorize, parallelize, and exploit reuse opportunities for vector compilers. HPFT performs the -block decomposition such that maximum bene ts can be gained for vectorization and reuse exploitation. In the -block decomposition and vectorization phase, HPFT determines a vectorization vector for each statement in loop body. The vectorization vector is then used to reconstruct the parsing tree such that vector operations are automatically de ned. Several procedures are designed and implemented in HPFT to reconstruct the vectorized loop such that maximum bene ts can be obtained from the exploitation of the reuse opportunities of vector data. In addition, HPFT provides the multi-threading technique to partition the exploited vector operations into multi-thread such that the reuse opportunities are still preserved and the synchronizations are as few as possible. Sequential program thus can be translated into a high performance code.
Before execution, performance evaluator is designed for evaluating the performance of the translated code including degree of vectorization, reuse exploitation, and parallelization. In addition, after execution, it summarizes the information including the execution time and speedup. A user interface is also developed for user to easily set options provided by HPFT system and to capture the information o ered by performance evaluator. Several benchmarks, libraries, and scienti c application programs are taken as the input for measuring the performance improvement. Experimental results based on the Convex C3840 vector computer show that our implementation enhances the system performance and usually reduces the program execution time.
2. Design overview. The HPFT system has been implemented using C language in an X-window based environment. It consists of three main parts: the HPFT kernel, the user interface, and the performance evaluator. The HPFT kernel consists of several modules including the dependence analysis module, the -block decomposition and vectorization module 8], the reuse exploitation module 8], and the multi-threading module 9].
The HPFT kernel is a source-to-source translator, which is implemented in DEC 3000 workstations. It can be roughly divided into six phases as shown in Fig.  1 . The outputs of each phase are stored in database and will be the input of the next phase. In the rst phase, sequential program written in Fortran 77 is taken as the input and syntax analysis is made to construct the parsing tree. HPFT uses the abstract syntax tree as the intermediate representation. Several procedures are designed for manipulating the abstract syntax tree. Similar design of syntax tree of Adaptor tool 6] is utilized in HPFT. The Adaptor is a source-to-source translator designed mainly for message passing system. It transforms data parallel programs written in Fortran into host and node programs with message passing that can be executed on most parallel architectures. We make modi cation of syntax tree design of Adaptor to guarantee that parsing tree of a source program written in standard Fortran 77 can be generated.
After the syntax analysis, in the second phase, data dependence relation is investigated. In this phase, the GCD 24] and Banerjee 5] tests are used as the principal dependence tests. Then, HPFT constructs the extended dependence graph (EDG) for use in latter phases. The EDG is a graphical representation of dependence restriction existed in original program. Di erent from the conventional data dependence graph 24, 19] (DG), the constructed EDG additionally captures the statement order for each dependence relation existed in program. In latter phase, vectorization can be processed independently for each statement by examining the EDG. In the third phase, HPFT decomposes EDG into several strongly connected components. Then, HPFT extracts vector operations and combines the components that have reuse opportunities into a -block 24]. According to the EDG's restriction, the HPFT then automatically inserts proper compiler directives in parsing tree to de ne the vector operations. In the fourth phase, HPFT applies some heuristic rules to reconstruct the parsing tree such that implicit reuse can be exploited. Then, in the fth phase, the HPFT further optimizes the vectorized and reuse-exploited abstract syntax tree and automatically inserts directives to de ne multi-thread. Lastly, in the sixth phase, HPFT transforms the parsing tree into a high performance source code for vector compiler of Convex C3840.
The design of user interface provides a friendly environment for user to select compilation options and to capture the information sent by performance evaluator. Many library functions o ered by Motif 13] are used to build the menu-driven procedures. Fig. 2 displays the main functions of user interface design. The user interface of HPFT system provides several switches for user to specify their needs for target machine and source programs. For example, if the target machine is equipped with multiple CPUs, the user may set the 'vector-parallel' switch and specify the number of CPUs for HPFT generating a multi-thread version. Fig. 3(a) shows the snapshots of optimization set by user possibly including index shifting, loop peeling, loop unrolling, and array reorganization. By the user interface design, HPFT system cooperates with the user in determining the best transformation strategy.
Before execution, the performance evaluator o ers the information including the estimated CPU time and the degree of vectorization, reuse exploitation, and parallelization for user. After execution, the performance evaluator reports the results and performance information to user. Fig. 3(b) -(f) display some snapshots of HPFT execution. First, sequential program written in Fortran 77 is analyzed. According to the EDG translation, another version of program is generated by HPFT as shown in Fig. 3(b) . By setting the 'compilation information' to user interface as shown in Fig. 3(c) , users may view the processing of EDG and the corresponding program in each HPFT processing phase. The menu-driven procedures built in user interface module will be activated and the EDG will be displayed by graphic and tabular viewer. Fig. 3(d) and (e) respectively display the EDG before and after the process of -block decomposition. By setting the 'comparison' switch, HPFT reports the speedup and execution time of both the HPFT version and original version as shown in Fig. 3(f) . The design of HPFT system assists vector compiler of Convex C3840 to generate a better object code such that user's programs can be executed in high performance. 3 . Data structure and algorithms in HPFT design. In this section, we illustrate the new compilation techniques designed in HPFT for vectorization, reuse exploitation, and multi-threading. To extremely describe the concept of HPFT design, several arti cial examples are used. The measure of improvement of HPFT for real applications or benchmarks is located in latter section.
For a sequential program written in Fortran 77, the HPFT parses the program and then constructs the abstract syntax tree. HPFT then analyzes the data dependence relation and constructs the EDG of the original sequential program. 21 ] to present the data dependence relation existed in program. In HPFT design, we use EDG to present the data dependence restriction of a loop program. We combine the dependence vector and statement order into an extended dependence vector such that the vectorization can be performed on each statement independently and the reuse exploitation can be easily made. In what follows, we rst de ne the extended dependence graph. To illustrate the design of HPFT in vectorization, reuse exploitation, and multi-threading, algorithm and examples are then described.
Definition 3.1 (Extended Dependence Graph). An extended dependence graph EDG(N; E) of an n-nested loop L consists of a set of nodes N and a set of edges E. The 24] techniques to eliminate the output dependences and anti-dependences. The preprocessing also renames the equivalence declaration for constructing EDG. Program with variable extended dependence vectors has very few opportunities for vectorization, reuse exploitation, and multi-thread extraction. Thus, in HPFT, only loops with constant extended dependence vectors are considered to be reconstructed. Several vector compilers o er compiler directives for users to manually optimize the program. We implemented our methods for Convex vector compiler. To assist vector compiler of Convex C38 series extracting more or better vector operations from program, HPFT automatically inserts directive C$DIR Force Vector. In Convex C38 series, the Force Vector directive 10] will force vector compiler to vectorize the loop that follows. If the Force Vector directive is applied to an outer loop, the Convex compiler will move the speci ed loop to the innermost position and run it in vector mode. Similar directives can be also found in CRAY X-MP and IBM ES/9000 supercomputers 11, 14] . In this paper, to display the program in short, sometimes we use the vectorized statement A(1 : 128) = B(1 : 128) + C (1 : 128) to denote the generated code C$DIR Force Vector DO 1 I = 1, 128
The design of phases 2 to 6 of HPFT is described in what follows.
? Block decomposition and vectorization. Before vectorization,
HPFT partitions the EDG into several strongly connected components. Each component is called a -block 24]. If statement S i is not strongly connected to other statements, isolating S i can relax the dependence constraint of S i . Thus, -block decomposition will bene t to extract more vector operations from sequential loops.
However, -block decomposition possibly distributes two statements that have reuse opportunities into two di erent -blocks. The reuse opportunities will be lost. In HPFT design, the -block decomposition is mixed with the vectorization such that reuse opportunities can be preserved. Statements that are vectorized in the same loop and have reuse opportunities are not to be decomposed into two -blocks. Two advantages can be obtained from HPFT approach. First, the reuse opportunities can be preserved. Second, additional loop heading tests can be saved. 24] were introduced in previous study and some of them were implemented in compilers of vector machines. In HPFT design, the -block decomposition and vectorization phase integrates loop distribution, loop interchange, and loop fusion techniques to maintain the reuse opportunities. The -block decomposition design and vectorization design of HPFT are outlined in the following steps.
(1) Find all strongly connected components of EDG.
(2) Topological sort these strongly connected components according to the data dependence relation of these components. . If the answer is`yes', a zero value is determined to the rst element of V i and the vectorization is made on the outermost loop. Similarly, the value of the second element (k=2), : : :, and the nth element (k = n) of V i can be determined. A vectorization vector V i is then determined perhaps with two or more zeros. Because the Fortran is a column-major language, loop that controls the leftmost position of subscript of array variable will have the property of shortest memory stride. Under the consideration of the shortest memory stride, HPFT preserves one of these zero elements and sets other elements to 1.
Consider the following example. The EDG of loop L1 is depicted in Fig. 4 Combine components SC k?1 and SC k into a -block. Endif Endfor According to the decomposed EDG, reconstruct the parsing tree by applying the loop distribution and statement reordering techniques. According V i , for 1 i s, insert C$DIR Force Vector directives to parsing tree. In the -block decomposition phase, the HPFT will not decompose two consecutive statements if they will be vectorized in the same loop and have reuse opportunity. The EDG of loop L2 will be constructed in phase 2 of HPFT as shown in Fig.  5 . The HPFT looks ahead that S 2 and S 3 are both vectorized on loop J and the reuse distance of array B is one iteration gap of loop I. Thus, HPFT does not decompose the EDG into two -blocks. According to the EDG, as described in latter subsection (phase 4), loop L2 can be translated into the following vectorized and reuse exploited code by HPFT. For instance, A(2; 2 : 65) generated in vector register, say V R1, by S 1 at instance I = 3 will be immediately reused by S 2 at instance I = 3 without loading again from memory to vector register. Consider another example which is extracted from subroutine S084 of vector benchmark in NETLIB of NCHC (National Center for High Performance Computing). In the -block decomposition phase, HPFT will decompose the EDG. The EDG of L3 and the data structure of EDG are respectively shown in Fig. 6(a) and (b). Since S 1 and S 2 can be vectorized in di erent loops, the HPFT decomposes the EDG as shown in Fig. 6(c) . In phase 3, HPFT will perform vectorization and modify the parsing tree equivalent to the program L3 0 as follows. DO 11 J = 2; n S 1 : AA(2 : n; J) = AA(2 : n; J ? 1) + CC(2 : n; J)=AA(2 : n; J ? 1) 11 CONTINUE (L3 0 ) DO 2 I = 2; n S 2 : BB(I; 2 : n) = BB(I ? 1; 2 : n) + CC(I; 2 : n)=BB(I ? 1; 2 : n)
CONTINUE
The decomposition of the set of statements fS 1 ; S 2 g into two sets fS 1 g and fS 2 g causes statement S 1 vectorized in loop I. Bene ts of shorter memory stride accessing and vectorization are gained when executing S 1 . There are reuse opportunities of arrays AA and BB existed in S 1 and S 2 , respectively. These reuse opportunities will be further exploited in the reuse exploiting phase of HPFT. The HPFT automatically determines how to decompose the original EDG into several -blocks such that maximum bene ts from vectorization and reuse exploitation can be obtained.
Reuse exploitation of vector data. A considerable amount of re-
search has been done on this topic 4, 7, 12]. In HPFT design, loops with reuse distance equal to or larger than one iteartion gap can be further reduced to zero. The reduction of reuse distance is determined by several heuristic rules operated on EDG. In what follows, we will use an example to illustrate the process of reuse exploitation in HPFT. Improvements in real application programs are measured in latter section. In phase 4, the HPFT exploits the reuse opportunities existed in the parsing tree. Consider the following vectorized loop L4 whose parsing tree is the input of phase 4 of HPFT. generated by statement S 1 at instance J = 3 will be used by S 2 at instance J = 4. However, most supercomputers generate A(1 : 64; 3) in vector register, say V R1, at J = 3 and then load A(1 : 64; 3) again in another vector register, say V R2, at J = 4. This is because that V R1 should swap out the data A(1 : 64; 3) before executing J = 4 for keeping the generated data A(1 : 64; 4). Likewise, arrays B and C generated by S 2 and S 3 will be used by statements S 3 and S 1 , respectively. The reuse distance can be reduced by applying index shift method 15] or loop alignment method 1]. HPFT performs the index shift 15] procedure Shift on EDG of L4 to exploit the reuse opportunities of arrays A, B, and C. Let sets IN(S) and OUT(S) respectively denote the set of edges incoming to and outgoing from the node S in EDG. In the following, we will rst give the de nition of shift vector d e max which will be used to reduce the dependence distance between two references. The dependence distance of d e 2 is accumulated to d e 3 as shown in Fig. 9(b) . The iteration range of J on statements S 1 , S 2 , and S 3 is shown in Fig. 9(c) . HPFT then gather the intersection of these statements' running iteration range into one loop.
Translated by HPFT, parsing tree of loop L4 will be reconstructed into another parsing tree equivalent to the following loop L4 0 . The HPFT automatically replaces array C in S 1 by an additional array variable TEMP. The Convex C3840 vector compiler is capable of eliminating the load operation of S 1 and the store operation of S 0 3 for array TEMP since the variable TEMP is loop invariant. Thus, the load operation for TEMP of S 1 and the store operation for TEMP of S 0 3 will only be performed one time during the execution of loop J running from K to 639. Loop L4 000 is superior than L4 00 since the load operation of array C in statement S 1 of L4 00 is saved in L4 000 . The improvement of loop L4 000 in execution time is 14:65% compared to L4 00 in Convex C3840. As a result, reuse opportunities of arrays A, B, and C are fully exploited by HPFT. Compared to the version of L4, loop L4 000 saves 45:71% execution time.
For a given complex EDG, the order of each node (or statement) applied by the index shift operation may e ect the degree of reuse exploitation. Several heuristic rules 8] have been developed in Select procedure. These rules are designed for achieving the following two principles. First, HPFT selects node S that applying index shift operation Shift(S, d e max , dir) on S will reduce the maximum number of edges to a distance of zero or one iteration gap and mark the least nodes with "unadjustable". Since applying index shift operation on an adjustable node can exploit at least one reuse opportunity, the adjustable node can be treated as a resource of reuse exploitation. Second, in vector register reuse consideration, we hope that the dependence distance can be reduced to zero such that the memory load operation can be saved. Even if some of the reuse distance can not be reduced to zero when applying the index shift operation on node S, the reduced nonzero distance can bene t cache or memory accessing since the reuse distance has been shortened. Thus, when applying the index shift operation on node S, we hope that the operation can bene t more dependence vectors in set dir(S) by shortening their reuse distance and accumulate their distance on less dependence vectors belonging to dir(S).
The HPFT calls Select procedure to determine which node is suitable to be rst selected to apply the shift operation. For example, consider the EDG shown in Fig. 10(a) . HPFT rst selects S 2 to apply index shift operation. Applying the index shift operation on S 2 to minimize dependence vector d e 1 rst will exploit the reuse opportunity occurred by d e 1 and mark node S 2 with 'unadjustable'. Statement S 3 is still adjustable. However, applying index shift operation on S 3 rst to minimize d e 2 will exploit the reuse opportunity occurred by d e 2 and mark nodes S 2 and S 3 with 'unadjustable'. This will result statements S 1 , S 2 , and S 3 are unadjustable. Since the adjustable nodes are the resources of reuse exploitation, the HPFT will rst select node S 2 to apply the index shift operation. Similarly, in Fig. 10(b) , selecting S 2 rst to apply the index shift operation will exploit two reuse opportunities introduced by d e 1 and d e 2 . In another way, selecting S 3 rst to apply the index shift operation will exploit only one opportunity. Thus, HPFT selects the S 1 rst.
In the reuse exploitation phase, HPFT translates the parsing tree into a form that reuse distance can be reduced. Other techniques related to reuse exploitation such as loop unrolling 24], loop rerolling 24], and temp variable replacement 7] are also designed to cooperate with the index shift process. The reuse exploitation designed in HPFT kernel also bene ts to the multi-threading extraction as discussed in the latter subsection. The partitioned 4 sets, I = 1, I = 17, I = 33, and I = 49 can be assigned to 4 CPUs for concurrent execution as shown in Fig. 11(a) . Since the outermost loop J is a sequential loop, the 4 CPUs should be synchronized in each instance of J. Two disadvantages can be found in the multi-thread version of loop L5. First, there is no reuse opportunity exploited in vector register of each CPU. Second, there are 640 synchronizations which can be reduced in another multi-thread version translated by HPFT.
The HPFT rst exploits the reuse opportunities of arrays A, B, and C into another version of loop L4 000 as discussed before. Then, HPFT transforms abstract syntax tree of loop L4 000 into another syntax tree equivalent to following multithread version. The execution of loop L5 0 is shown in Fig. 11(b) . The version of L5 0 is superior than L5 in degree of reuse exploitation and the number of synchronizations. In loop L5 0 , reuse of arrays A, B, and C has been exploited. In total, there are 3*(639-3+1)=1911 vector loads saved. In addition, loop L5 0 has only one synchronization. Compared to L5, loop L5 0 saves 69:6% execution time.
If compiler applies loop interchange technique to loop L5, the number of synchronizations can be reduced to one. However, even if the loop interchange is possibly applied during multi-threading, in the following example, HPFT can also improve the execution time by reducing the number of synchronizations. Consider the following program which is extracted from vector benchmark of NETLIB in NCHC.
Example 6: An attempt to interchange loops I and J will cause semantic error. The reason is stated as follows. The vector data EQV 4(3 : 25; 6) are used by a CPU, say P 1 , at the execution of J = 2. The vector data EQV 4(24 : 46; 6) will be generated by another CPU, say P 2 , at the execution of J = 6. To guarantee that the data element EQV 4(24; 6) used by P 1 is old, the loop J should be kept in the outermost position to ensure the sequential execution. Partitioning vector length 90 into 4 sets produces 79 synchronizations. In the vectorization phase, the HPFT will translate the loop L6 into the following vector form. Interchange the mth level loop to the outermost loop in parsing tree. Apply loop spreading on the parsing tree and automatically insert TEMP variable to increase the degree of reuse exploiting. Insert multi-threading directive in parsing tree such that the generated program has directive Force Parallel before the outermost loop.
For a given program, the HPFT rst performs the -block decomposition and vectorization such that maximum bene ts can be obtained for reuse exploitation. Then, HPFT further inserts the directives to explicitly de ne more or better vector operations from sequential program. In the reuse exploitation phase, the HPFT reconstructs the original program such that the number of vector load operations can be reduced. In nal, the HPFT inserts directives to de ne the multi-thread with less number of synchronizations and generates another high performance version of loops for vector compilers. As discussed in the next section, with the assistance of HPFT, compiler of Convex C3840 usually obtains a better performance. 4 . Experimental results. In this section, the performance improvement of sequential programs translated by HPFT is measured on Convex C3840 supercomputer. The current version of Fortran compiler on Convex C3840 system is 7.0. Two versions of program are compared. For the rst version, we take original program written in Fortran 77 as the input of vector compiler of Convex. The object code generated by vector compiler is referred to the original version. Instead, another version is generated by the following two steps. First, the sequential program is taken as the input of HPFT. The high performance code translated by HPFT is then taken as the input of vector compiler of Convex C3840 in the second step. The generated object code by these two steps is referred to the HPFT version.
Loops selected as the original programs for test can be roughly cataloged into three classes. The rst class is the vector benchmarks that are extracted from NETLIB of NCHC (National Center for High Performance Computing). Two benchmarks are measured. The rst benchmark (referred as benchmark 1) consists of 107 subroutines of loops that are originally designed for testing the vectorization capability of PFC 2, 3] . In total, there are 65 subroutines can be vectorized by vector compiler of Convex. The 65 subroutines are considered as the original programs and the execution time of two versions, the original version and the HPFT version, is compared. In total, there are 11 subroutines improved by applying the HPFT system.
The second benchmark (referred as benchmark 2) consists of 45 simple vectorization tests, 50 subscript tests, 8 tests of rearranging the structure of nested loops, 36 tests involving branching, 35 ambiguity checking, 28 tests for external routines, and 49 tests for others. Because that the 45 tests of simple vectorization focus mainly on the testing of general vectorization capabilities, for most current vector compiler, they can obtain a good performance without the assistance of HPFT. Only 8 rearranging the structure of nests of loops that are related to the main focus of HPFT design are selected to be tested. Two loops of the 8 selected loops have signi cant improvement in execution time for HPFT versions.
The second class selected as the original programs consists of several libraries including subroutines of BLAS1 and BLAS2. The level 1 BLAS (Basic Linear Algebra Subprograms) and level 2 BLAS respectively perform the vector/vector and matrix/vector operations. All subroutines of BLAS1 and BLAS2 are designed in libraries for calls in most supercomputers. The subroutines of BLAS1 and BLAS2 used as the original source are also stored in NETLIB of NCHC.
In addition, the third class consists of several application programs. Most of these programs are selected from the numerical computation programs 17, 22] . In what follows, only those that are restructured or the directives are inserted by HPFT are listed in Table 1 .
The experimental results of execution time and speedup for these three classes of programs are summarized in Table 2 . For the three measured classes, two versions of several subroutines that have the same execution time (HPFT is not activated when translating these subroutines) are not listed in Table 2 .
Compared with the original version, the main factors that the HPFT version improves are classi ed into vectorization, memory con ict, locality, and the number of synchronizations. The e ect of these factors is analyzed as shown in Table 3 . In vectorization, the HPFT assists vector compiler of Convex C3840 to extract more or better vector operations from several subroutines. For instance, two statements S 1 and S 2 of subroutine S084 will be decomposed into two -blocks by HPFT process. Then the HPFT vectorizes the rst dimension and second dimension of S 1 and S 2 , respectively. However, these two statements of original version are not decomposed. The dependence relation existed in S 2 thus makes S 1 sequential execution.
Some programs such as S029, S030, S084, Loop 1 and Loop 2 of vector benchmark 2, FLSA, SLEGE2, and FDFT are vectorized in the second dimension of array operations. The HPFT performs the memory con ict reduction scheme to reorganize the memory allocation. Consequently, array accessing when performing vector operations has low frequency of memory con ict. The speedup of these programs thus has signi cant improvement as found in Table 2 .
In the reuse exploitation, the HPFT performs index shifting, temp vector variable, and loop spreading schemes to exploit the reuse opportunities. The reuse exploitation reduces the number of loads from shared memory to vector registers and frees the load/store functional units for other use. Execution time of subroutines S022, S023, S029, S030, S047, S048, S049, S084, S100, loops 1 and 2 in benchmark 2, BLAS2, and application programs are thus improved. In synchronization, the execution of HPFT version has fewer synchronizations among CPUs than one of the original version for some subroutines and programs.
The HPFT system extracts not only the implicit vector operations but also the reuse opportunities for a sequential program written in Fortran 77. The exploited vector operations and multi-thread are explicitly de ned for vector compilers by automatically inserting proper compiler directives in the translated code. Experimental results show that vector compiler cooperated with the HPFT system usually produces a more e cient code for users to early complete their program execution. 5 . Conclusion. In this paper, we have described the design and implementation of HPFT. The HPFT performs source-to-source translation and code tuning from a sequential program into a favorite execution form. Execution of sequential programs thus can be earlier completed due to the bene ts of more vector operations, higher degree of reuse exploitation, and fewer synchronizations among CPUs. Performance evaluator and menu-driven user interface are also designed in HPFT system to o er users a friendly environment.
