We propose a new universal High-Level Information (HLI) 
Introduction
High-performance microprocessors increasingly rely on parallel operations to speed up program execution. Recent superscalar processors fetch multiple instructions, dynamically find independent instructions from a set of reservation stations (or a window of instructions), and issue them in parallel to multiple function units [12, 13, 14] . Extensive research is under way to increase the exploitable instruction level parallelism (ILP) in a program and to widen the issue bandwidth from 4 to 8 instructions per cycle, or even up to 16 [19, 23] . Additionally, researchers have begun to explore threadlevel parallelism in which multiple threads of instructions can be simultaneously fetched by different thread-execution units for processing [22, 27, 30, 31] . These thread-execution units are more tightly coupled than processors in a multiprocessor system in that the order of instruction dispatching and retiring is tightly synchronized among different units. This type of LIR has served microprocessor compilers reasonably well so far mainly because sufficient ILP can often be exploited among scalar operations within a relatively narrow program scope. Such parallelism can either be detected by hardware without code transformation or it can be exposed relatively simply through code motion done by the instruction scheduler in the compiler.
To uncover additional parallel operations to feed the increasing hardware parallelism in the future, however, the compiler needs to analyze a wider program scope and higher-level data structures. The compiler must perform high-level program semantic analysis regarding arrays and pointers, such as analysis of data dependences, aliases, data flow, loop level parallelism, and summary use-modification information for procedure calls. Only a high-level IR (HIR) contains the necessary abstract syntax information to support this extensive analysis.
The LIR must continue to exist since the low-level machine operations are the instruction scheduling target. Indeed, some low-level operations may not even have a direct equivalent in the HIR. Hence, the problem becomes one of passing high-level semantic information from the HIR to the LIR.
We have designed and implemented a format, called High-Level Information (HLI) [4] , to facilitate the propagation of high-level semantic information from the HIR to the LIR. Several considerations have influenced our design:
Transportability: The HLI can be exported from a high-level analyzer, such as those used in sophisticated parallelizing compilers, to a microprocessor compiler that does not contain HIR and thus lacks a high-level analyzer.
Hierarchy:
The information regarding data dependences and aliases are organized in a hierarchy corresponding to the loop structures in the program. This reduces the complexity of the represented information and makes the access to such information easier for a back-end compiler.
Flexibility:
The HLI information can be updated if the program is modified after the HLI is produced, as occurs with many backend optimizations, such as statement reordering. We have implemented and experimented with a maintenance utility for the HLI to perform this updating.
In the remainder of the paper, Section 2 presents the formal definition of the HLI format, showing what information is extracted using the HIR and how it is condensed and passed to the LIR. Section 3 then describes our prototype implementation of this HLI within the SUIF parallelizing compiler and the GCC compiler back-end. Experiments with the SPEC benchmark programs [29] are presented in Section 4, showing how the use of the high-level information provided by SUIF improves the dependence information available to GCC and the execution time of benchmark programs. Some related work is discussed in Section 5, with our results and conclusions summarized in Section 6.
High-Level Information Definition
A High-Level Information (HLI) file for a program includes information that is important for backend optimizations, but is only available or computable in the front-end. As shown in Figure 1, The purpose of the line table is The HLI focuses on memory references and function calls, which are called items in the HLI representation. In the line table, each line entry corresponds to a source line of the program unit in the source file, and includes an item list for the line. In the item list, each item entry consists of an ID field and a type field. The ID field stores a unique number within the scope of the program unit that is used to reference the item. The type field stores the access type of the item, which may be a load, a store, or a function call.
Line table
Groups of items from the front-end are mapped to the back-end instructions by matching their source line numbers. However, this mapping information may not be precise enough to map items inside a group (i.e., a single source line) from the front-end to the back-end. To perform precise mapping, the front-end needs to know the instruction generation rules of the back-end and the order of items associated with each source line. Specifically, the order of items listed in the line table must match the order of the items appearing in the instruction list in the back-end.
Region table
To simplify the representation of the high-level information while maintaining precise data dependence information for each loop, we represent the high-level information of a program unit with scopes of regions. A region can be a program unit or a loop and can include sub-regions. The basic idea of using region scopes in the HLI is to partition all of the memory access items in a region into equivalent access classes and then describe data dependences and alias relationships among those equivalent access classes with respect to the region. As mentioned above, equivalent access classes use the IDs of sub-regions' equivalent access classes to refer to the items residing in their sub-regions. For example, the equivalent access class of sum in Region 1 uses the equivalent access class of sum defined in Region 3 to refer to memory access items 13 and 14 enclosed by Region 3.
Alias table
Two memory references referring to two distinct names in the same program unit may actually access the same memory location in the same execution instance of the unit. Such names are known as aliases. For convenience, we say that those two memory references are aliased. If the two references belong to different equivalent access classes in the same region, we again say that the two classes are
aliased. An alias table is created for each region to describe the possible alias relationships among the equivalent access classes of that region. If two equivalent access classes are marked as aliased, all of the memory access items represented by the two equivalent access classes are considered aliased.
Recall that an immediate reference and an embedded reference are not placed in the same equivalent access class in the same region. If they may access the same memory locations in that region, they are also marked as aliased. Note that aliasing is a binary relation between two equivalent classes.
A and B being aliases and B and C being aliases does not imply that A and C must be aliases. This is the primary reason that the HLI does not place all aliased references in a single equivalent access class. For a loop region, the alias table only describes the alias relationships among the equivalent access classes within a loop iteration. Loop carried data dependences are described in the LCDD table.
In the example in 
Loop-carried data dependence (LCDD) table
If the region is identified as a loop, the LCDD table will list all of the LCDDs at that loop level.
Loop-carried data dependences are represented by pairs of equivalent access classes defined at the region. Each pair specifies a data dependence arc caused by the loop. The data dependence type can be definite or maybe. In addition, each dependence pair includes a distance field. To simplify the representation of the dependence distance, the direction of a dependence is always normalized to be '>' (forward), that is, from an earlier iteration to a later iteration.
For the example shown in Figure 2 
Function call REF/MOD table

A Prototype Implementation
A version of the HLI described in the previous section has been implemented in the SUIF parallelizing compiler [33] and the GCC back-end compiler [28] . This section discusses some of the implementation details. Note, however, that the HLI format is platform-independent, and many of the implemented functions are portable to other compilers. Figure 3 shows an overview of our HLI implementation [4] in the SUIF compiler and GCC.
Front-end implementation
The HLI generation in the front-end contains two major phases -memory access item generation (ITEMGEN) and HLI table construction (TBLCONST). The ITEMGEN phase generates memory access items and assigns a unique number (ID) to each item. The memory access items for a source line, ordered by the ID, can be one-to-one matched to the memory reference instructions in the GCC RTL chain 1 for the same line. These items are annotated in the SUIF expression nodes to be passed to the TBLCONST phase.
1 RTL (Register Transfer Language) is an intermediate representation used by GCC that resembles Lisp lists [28] . An RTL chain is the linked list of low-level instructions in the RTL format.
The TBLCONST phase first collects the memory access item information from the SUIF annotation to produce the line table for each program unit. It then generates information for the equivalent access   table, alias table, and LCDD table for each region. Because it is dependent on both back-end compiler and the machine, separating the HLI generation into these two phases allows us to reuse the code for TBLCONST across different back-end compilers or target machines.
Memory access item generation (ITEMGEN)
The ITEMGEN phase traverses the SUIF internal representation (IR) to generate memory access items.
It passes this memory access item information to the TBLCONST phase by annotating the SUIF IR.
To guarantee that the mapping between the generated memory access items and the GCC RTL instructions is correct, the RTL generation rules in GCC must be considered in the HLI generation by SUIF.
Most of the memory access items correspond to variable accesses in the source program. However, when the optimization level is above -O0, GCC assigns a pseudo register for a local scalar variable or a variable used for temporary computation results. An access to this type of variable does not generate a memory access item. Since GCC does not assign pseudo registers to global variables and aggregate variables, they generate memory access items.
There are some memory access items produced in GCC that do not correspond to any actual variable accesses in the source program. These memory accesses are used for parameter and return value passing in subroutine calls. The actual number of parameter registers available is machine dependent. For each subroutine, GCC uses the parameter registers to pass as many parameters as possible, and then uses the stack to pass the remaining parameters. Hence, at a subroutine call site, if a memory value is passed to the subroutine via a parameter passing register, a memory read is used to load the value into the register. If a register value is passed to the subroutine via the stack, however, a memory write is generated to store the value to the stack. Similarly, at a subroutine entry point, if a memory value is passed into the subroutine via a register, a memory write is generated to store the value. If a register value is passed into the subroutine via the stack, though, a memory read is again used to load the value from the memory to the register.
A subroutine return value can also generate memory accesses that do not correspond to any variable accesses in the source program. One register is available to handle return values in the MIPS architecture [14] which we target in our implementation. When the returned value is a structure, the address of the structure is stored in that register at the subroutine call site. In this case, the return statement generates a memory write to store the return value to the memory location indicated by the value return register. If the return value is a scalar, the value return register directly carries the value, so no memory access is generated.
HLI table construction (TBLCONST)
The HLI table construction phase traverses the SUIF IR twice. The first traversal creates a line table for each routine by collecting the memory access item information from the SUIF annotations. It also creates a hierarchical region structure for each routine and groups all the memory access items in a region into equivalent access classes.
The second traversal of the IR visits the hierarchical region structure of each routine in a depth-first fashion. At each node, it gathers the LCDD information for each pair of equivalent access classes and calculates the alias relationship between each pair of equivalent access classes. All of the information propagates from the bottom up. If the SUIF data dependence test for a pair of array equivalent access classes in a region returns zero distance, the two equivalent access classes are merged. Otherwise, the test results are stored into the LCDD table. Then, all the pointer references that may refer to multiple locations are determined. An alias relationship is created between the equivalent access class for each pointer reference and the equivalent access class to which the pointer reference may refer. Next, the equivalent access class information and alias information is propagated to the immediate parent region. At the completion of these two phases, the HLI is ready to be exported to the back-end.
Back-end implementation
Importing and mapping HLI into GCC
The HLI file is read on demand as GCC compiles a program function by function. This approach eliminates the need to keep all of the HLI in memory at the same time, relieving the memory space requirements on the back-end. The imported information is stored in a separate, generic data structure to enhance portability. Mapping the items listed in the line table onto memory references in the GCC RTL chain is straightforward since the ITEMGEN phase in the front-end (Section 3. 
Using HLI
Information in the HLI can be utilized by a back-end compiler in various ways. Accurate data dependence information allows aggressive scheduling of a memory reference across other memory references, for example. Additionally, LCDD information is indispensable for a cyclic scheduling algorithm such as software pipelining [18] . Interprocedural information provides the back-end compiler more freedom to move memory references around function calls. High-level program structure information, such as the line type, may provide hints to guide heuristics for efficient code scheduling.
To provide a common interface across different back-ends, the stored HLI can be retrieved only via a set of query functions. There are five basic query functions that can be used to construct more complex query functions [5] . There are another set of utility functions that simplify the implementation of the query and maintenance functions (Section 3. In GCC's Common Subexpression Elimination (CSE) pass, subexpressions are stored in a table as the program is compiled, and, when they appear again in the code, the already calculated value in the table can directly replace the subexpression. Without interprocedural information, however, all the subexpressions containing a memory reference will be purged from the table when a function call appears in the code since GCC pessimistically assumes that the function can change any memory location. In Figure 4 , an HLI query function to obtain call REF/MOD information is used to remedy the situation by selectively purging the subexpressions on a function call.
The example in Figure 5 shows how the HLI provides memory dependence information to the instruction scheduler. It is used in Section 4.2 to measure the effectiveness of using HLI to improve the code scheduling pass. 
Maintaining HLI
As GCC performs various optimizations, some memory references can be deleted, moved, or generated. These changes break the links between HLI items and GCC memory references set up at the mapping stage, requiring appropriate actions to reestablish the mapping to respond to the change. Further, some of the HLI tables may need updating to maintain the integrity of the information. Typical examples of such optimizations include:
The CSE pass, where an item may be deleted. The corresponding HLI must then be deleted.
In the loop invariant removal optimization, an item may be moved to an outer region. The HLI item must be deleted and inserted in the outer region. All the HLI tables must be updated accordingly.
In loop unrolling, the loop body is duplicated and preconditioning code is generated. The entire HLI components (tables) must be reconstructed using old information, and the old information must be discarded. The HLI maintenance functions have been written to provide a means to update the HLI in response to these changes [5] . The functions allow a back-end compiler to generate or delete items, inherit the attributes of one item to another, insert an item into a region, and update the HLI tables.
Changes such as the CSE or loop invariant code removal call for a relatively simple treatment -either deleting an item, or generating, inheriting, moving, and deleting an item. Loop unrolling, however, requires more complex steps to update the HLI. First, new items need be generated as the target loop body is duplicated multiple times. The generated items are inserted in different regions, based on whether they belong to the new (unrolled) loop body or the preconditioning code. Data dependence relationships between the new items are then computed using the information from the original loop.
An example of updating the HLI tables for the loop unrolling pass is given in Figure 6 .
Benchmark Results
To demonstrate how the HLI can be used, we experimented with several programs taken from the SPEC benchmark suite and other sources. High level information is passed from the SUIF front-end to the GCC backend for each program. We then measured the reduction in the number of dependence arcs identified by the GCC memory dependence checking routines when using HLI compared to using only GCC's normal dependence checking capabilites. We also measured the improvement in execution time made possible when GCC used the HLI to improve several of its back-end optimization passes within basic blocks. 2 Our implementation uses the SUIF parser twice (see Figure 3) . After the program foo.c is compiled and optimized by SUIF, the optimized C file foo.opt.c is generated. This code is then used as the input to the HLI generation and GCC. When foo.opt.c is fed into the SUIF parser again for the HLI generation, it causes unrecoverable errors in some cases. We are currently developing a front-end compiler that will eliminate such difficulties.
Program characteristics
Aiding GCC's dependence analysis
The HLI can potentially enhance several back-end optimizations by providing more accurate memory dependence information when GCC would otherwise have to make a conservative assumption due to its simple dependence analysis algorithm. Four optimizations in GCC were instrumented with the HLI to utilize this more accurate memory dependence information to improve the performance of the resulting code.
Instruction scheduling (Sched) is an important code optimization in a back-end compiler. With this optimization, instructions in a code segment are reordered to minimize the overall execution time. A crucial step in instruction scheduling is to determine if there is a dependence between two memory references when at least one is a memory write operation. Accurately identifying such dependences can reduce the number of edges in the data dependence graph, thereby giving the scheduler more freedom to move instructions around to improve the quality of the scheduled code.
In the Common Subexpression Elimination (CSE) pass, all subexpressions that reference memory are removed from the table since GCC must assume that any memory reference will change these subexpressions. Distinguishing memory references according to the data dependence information provided by HLI will maintain the subexpressions whose memory is independent with the current memory reference in the table.
In the loop invariant code removal (Loop) pass, a memory reference can be moved out of a loop only when there are no other memory references in the loop that could possibly be aliased with the current memory reference. Since HLI will potentially reduce the number of data dependences within the loop, more memory instructions could be taken out of the loop. This then increases the likelihood of a memory operation becoming loop invariant.
In the register local allocation (Local) pass, the first step is to find the symbolic registers that are equivalent to a single value throughout the compilation. These are grouped to a single register. In this step, there is one special case-after one register is defined, it is stored to one memory location within a single basic block. If no other memory store operations between the above two instructions could be aliased with the memory location, the two instructions can be combined into one and the register could be eliminated completely. This reduces the total number of instructions and allows the register allocator more degrees of freedom. The more accurate data dependence information provided by the HLI can help in finding more of these types of instruction pairs. For the programs tested, Figure 7 compares the total number of dependence queries made (i.e., do A and B refer to the same memory location?), the number of times the GCC analyzer answers yes (meaning that it must assume there is a dependence), the number of times HLI answers yes, and the number of times both GCC and HLI answer yes. The values shown are normalized to the total number of dependence queries. The subsections of each bar correspond to each of the optimization passes studied. Since the height of the bars in the figure corresponds to the number of data dependences that must be assumed, the lower the bar, the more accurate the corresponding analyzer. While the figure shows the normalized number of queries, Table 2 provides the absolute number of total queries and the number of queries made in each pass.
The results show that using HLI can reduce the number of data dependences substantially. Most of the programs exhibited a reduction of over 60% in the number of dependence that must be assumed when using HLI compared to GCC's standard analysis. Floating-point programs obtain more reductions than integer programs on average. Table 2 shows that for both integer and floating-point programs, over half of the data dependence 
Impact on program execution times
To study the performance improvement attributable to using HLI in GCC's four optimization passes, the benchmark programs were compiled in six different ways: without HLI, using HLI in all four optmization passes (All), using HLI in the Sched pass only (Sched), using HLI in the CSE pass only (CSE), using HLI in the Loop pass only (Loop), and using HLI in the Local pass only (Local). The integer programs achieved relatively small speedups compared to the floating-point programs.
It is known that the basic blocks in integer programs are usually small, containing only 5 -6 instructions on average. Furthermore, it is likely that each basic block contains few memory references. This is indirectly evidenced by comparing the total number of dependence queries made in the different programs tested (Table 2) . Typically, an integer program requires fewer than one fourth the number of dependence tests needed by a floating-point program.
To summarize, using the HLI in the four optimization passes tested reduced the number of dependence edges by over 60% in the programs tested. This reduction translated into a moderate improvement in execution time. Expanding the scope of these optimizations beyond basic blocks can be enhanced by using the HLI. This expansion should lead to a more substantial reduction in execution times. [3] , Panorama [11] , and PTRAN [25] , and commercial Fortran compilers, such as KAP [16] and VAST [32] , have taken such a source-to-source approach.
Related Work
Computer vendors generally provide their own compilers to take a source program, which has been parallelized by programmers or by a parallelizing compiler, and generate multithreaded machine code, i.e., machine code embedded with thread library calls. These compilers usually spend their primary effort on enhancing the efficiency of the machine code for individual processors. Once the thread assignment to individual processors has been determined, parallelizing compilers have little control over the execution of the code by each processor.
Over the past years, both machine independent and machine specific compiler techniques have been developed to enhance the performance of uniprocessors [6, 7, 15, 20, 21] . These compiler techniques rely primarily on dataflow analysis for symbolic registers or simple scalars that are not aliased.
Advanced data dependence analysis and data flow analysis regarding array references and pointer dereferences are generally not available to current uniprocessor compilers. The publicly available GCC [28] and LCC [9] compilers exemplify the situation. They both maintain low-level IRs of the input programs, keeping no high-level program constructs for array data dependence and pointerstructure analysis.
With the increased demand for ILP, the importance of incorporating high-level analysis into uniprocessor compilers has been generally recognized. Recent work on pointer and structure analysis aims at accurate recognition of aliases due to pointer dereferences and pointer arguments [8, 34] .
Experimental results in this area have been limited to reporting the accuracy of recognizing aliases.
Compared with these studies, this paper presents new data showing how high-level array and pointer analysis can improve data dependence analysis in a common uniprocessor compiler. the low-level analysis and optimizations are largely unavailable today. Our effort has taken a different approach by providing a mechanism to transport high-level analysis results to uniprocessor compilers using a format that is relatively independent of the particular parallelizing compiler and the particular uniprocessor compiler.
Conclusions and Future Work
Instead of integrating the front-end and back-end into a single compiler, this paper proposes an approach that provides a mechanism to export the results of high-level program analysis from the frontend to a standard back-end compiler. This high-level information is transferred using a well-defined format (HLI) that condenses the high-level information to reduce the total amount of data that must be transferred. Additionally, this format is relatively independent of the particular front-end and backend compilers.
We have demonstrated the effectiveness of this approach by implementing it within the SUIF front-end and the GCC back-end compilers. Our experiments with the SPEC benchmarks show that using this information in four optimization passes of GCC substantially reduces the number of data dependences compared with using standard GCC dependence analysis algorithm only. The increased flexibility provided by this reduction allowed GCC to improve execution time compared to using only the low-level information normally available in GCC.
We expect that the HLI mechanism proposed in this paper will make it relatively easy to integrate any existing front-end parallelizing compiler with any existing back-end compiler. In fact, we are currently developing a new front-end parallelizing compiler 3 that will use the HLI mechanism to export high-level program information to the same GCC back-end implementation used in these experiments. The HLI will be used more extensively in back-end optimizations besides those done within basic blocks. We believe that global optimizations will provide HLI more potential to improve the performance of applications. Furthermore, compilers for future wide-issue processor architectures, such as the Multiscalar architecture [27] , the Superthreading architecture [30] and the Trace processor [22] , may benefit substantially from HLI when generating highly optimized codes to exploit the available hardware parallelism.
