Abstract
Introduction
As designers migrate toward the more economical software/hardware co-design paradigm, the role of the compiler in application-specific systems becomes more and more prominent. The increasingly popular A S P (Application Specific Instruction-Set Processor), for example, becomes a much more effective tool if it is accompanied by a compiler capable of taking advantage of its application-specific properties. The ASIP offers a balance between the two extremes of ASICs (Application Specific Integrated Circuits), and general programmable processors. It offers the advantage of custom hardware for certain tasks (like the multiply-adder found in many DSP chips) as well as the flexibility of an instruction set. With a compiler capable of generating efficient code for its customized instruction-set, the ASIP becomes a highly flexible application-specific tool, capable of being reconfigured in a short turnaround time.
The typical design of ASIP systems begins with the processor. Application specific hardware extensions, which the designer anticipates will improve Ihc performance of a system are added to extend a base processor. Next, the compiler is developed (if the ASIP is even to have a compiler), using various techniques to try and take advantage of the new hardware extensions. There are two problems with this process. The first problem is that the designer must decide what hardware extensiom will be most beneficial for the system without knowing how well the compiler is going to be able to take adv'antage of those extensions. The second problem is that the compiler writer is left with the difficult task of translating a high-level language program into efficient code that takes advantage of the dedicated features of the ASIP -a task that may not always be possible to a satisfactory degree.
We are advocating a different kind of relationship between an ASIP and its compiler. This relationship should be a symbiotic one, both hardware and compiler working together cooperatively, to produce the most efficient system possible for the least cost. To this end, it is imperative Lliat the compiler become integrated into the design process, assisting the designer in determining how to best customize the processor for a given set of applications. This is particularly true for optimizing compilers today, which may be capable of much more than simple translation. Optimizations like looppipelining and beyond basic-block scheduling are capable of altering the program graph in non-obvious ways, potentially exposing opportunities for performance enhancement that weren't visible before.
In this paper, we present a fr,amcwork for relating advanced compiler optimizations to die design of an ASIP. The compiler is used to assess the hardware needs of a suite of applications to which the ASIP is to be tuned, providing feedback to the designer to aid in the selection of hardware extensions to the processor. By incorporating the compiler into the design process, the designer is provided with more information about the potential performance of the system at an earlier stage.
Related Work
There have been several projects which incorporated the compiler into the design process. In [l], the high level synthesis component (called Piper) of the ADAS design system generates a re-order table for use by the compiler of the system. This reorder table is a way to communicate information about the application-specific properties of the chip being designed to the compiler in an automatic fashion. This system helps make the compiler development an easier task (even automatic), but the communication is one way -there is no feedback from the compiler provided in the design process. The PEAS-I system, presented in [2], begins with a set of candidate instructions, and based on results of an application program analyzer, chooses the subset of instructions which provides the best performance for the applications under consideration, under the constraints of chip area and power consumption. This system uses the compiler as part of the design process, but only in a limited way -it restricts the crtndidate instructions to be those that could be generated by the compiler as it was written.
There have also been several different approaches proposed recently on how best to automatically customize an instruction set processor to a particular application type. One approach, taken by I-Iolmer in [3], is to begin with a completely defined dam path, and then design the instruction set by combining the microoperations defined by the datapath into instructions to optimize the cycle count and code size of a set of benchmarks. In [4], an instruction-set matching and selection methodology for DSP and ASIP code generation is presented. This methodology provides a way for the compiler to take advantage of the application-specific properties of an ASIP, but does not look at communication from the compiler to the ASIP design. Finally, in [5], a technique called 'bundling' is presented for generating an instruction set for an ASIP based on the results of an analysis tool. This work also looks at chainable sequences that can be detected in the control-data flow graph of a program. It does not, however, look at the elfect of compiler optimizations on these sequences.
Qverview of the proposed approach
Our approach is based on incorporating the compiler into the design process. Using modem optimization techniques, compilers are capable of program graph altering transformations which may expose properties of an application that are not immediately obvious. These exposed properties can then potentially be exploited by providing application specific hardware to take advantage of them. The complete process, as depicted in Figure 1 , starts with an optimizing compiler gathering information about a sample benchmark set. This information is provided to the ASIP design stage, where application specific hardware is synthesized based on the information provided by the compiler. The final product is the customized ASIP, as well as an optimizing compilercustomized to the ASIP by incorporating the optimizations that were used in the analysis phase. This general scheme provides a fr'mework for the integration of the compiler into the design process. Tlie next step is to define more precisely the flow of information between the compiler and the ASIP design. Because of the diversity of compiler optimizations and customized hardware for ASIPs, it is difficult to define this information flow in a general fashion. We will start, instead, by looking at one hardware optimization in particular, and exploring how that optimization can be used more effectively with input from the compiler.
Operator chaining -an initial study
As an initial study into the effectiveness of incorporating compiler feedback into the ASIP design process, we have chosen to look to take a common application-specific optimization, operator chaining [6], and use a compiler to aid in the detection of operation sequences which would be best implemented as chained operations. Tlie MAC (multiply and accumulate) instruction found in many DSP processors (like the 'TMS320C5x from Texas Insmments [7] ) is an examplc of a cliained operation. Data is passed directly from one operation to the next, avoiding the overhead of storing the intermediate result back to a register file. as well as the fetching and decoding of an additional instruction.
In order for operator chaining to be an effective optimization for a given application, the application must have operation sequences with data flow between each operation, matching the chained operations implemented in the ASIP. For example, DSP applications often have a frequently occurring sequence consisting of a multiply operation whose result is used as an operand to an addition operation. Thus, DSP applications can often take advantage of the MAC instruction available in many DSP processors.
Since the chaining of operations depends on the ordering of the instructions, we have chosen to (initially) explore the relationship between advanced parallelizing compiler optimizations and the detection of chainable operation sequences. By taking advantage of the compiler's ability to move operations around in the program, we provide the designer with a much broader range of possibilities when selecting which operations to implement as chained sequences. The approach will be to use a pxdlelizing compiler to compile a suite of application programs, and then, perform an analysis on the operation sequences that are available for implementation as chained sequences. Once the analysis is complete, the results can be used by the system designer to determine which operation sequences to implement as chained instructions.
By using parallelizing compiler optimization techniques, we are able perform much more extensive sequence analysis. Previous efforts to identify frequently occumng sequences in programs were restricted to the operation ordering created by the compiler, which is derived from the sequential statements in the high-level language in which the application was programmed [8].
Our analysis adds the ability to alter the program graph of the compiled application through the use of advanced instruction-level parallelizing scheduling techniques. In particular, we utilize a technique called percolation scheduling [9], which provides a set of semanticpreserving transformations allowing the movement of operations both within and across the basic blocks of a program (constrained, of course, by the data dependencies in the program). This allows us to search a much broader set of possibilities for potential sequences because we are no longer constrained by the sequential nature of tlie source program.
Experiments
Our initial experiments to test the effectiveness of incorporating compiler feedback into the ASIP design process concentrated on relating parallelizing compiler optimizations to the detection of chainable sequences. Using feedback from the optimizing compiler, we were able to uncover a large number of potential sequences with relatively high frequencies which would be suitable for implementation as chained operations. For an operation sequence to be suitable, it must have data flow from the result of a preceding operation to an operand of a succeeding operation. Only those sequences exhibiting this property were considered during these experiments.
The analysis of the benchmark programs was conducted as shown in Figure 2 . In step 1, benchmark source programs were first compiled by a front end compiler -a version of the Gnu C Compiler (gcc) which was modified to generate a 3-address code. This 3-address code was then used as input to a simulator (step 2). in addition to the sample data, to provide profile information for each of the applications. The resulting 3-address code with profile information was then optimized in step 3 using the UCI VLIW compiler [lo]. Three levels of optimization were performed -1) no optimization, 2) full optimization with loop pipelining and percolation scheduling but without register renruning, and 3) full optimization with loop pipelining, percolation scheduling, and register renaming. Each of the resulting optimized progr'am graphs wa5 then fed to the sequence analyzer (step 4). which performed it branch and bound search on the graph to detect all potential operation sequences. 
Benchmarks
For our initial experiments, we selected a set of DSI' benchmarks on which to perform our analysis. These benchmarks constitute a wide range of DSP applications, ranging from a complete implementation of edge detection using two-dimensional convolution, to a simple bspline stream filter. Table 1 lists each benchmark along with its description and the data used as input to the benchmark. Several of these benchmarks were adapted from examples presented in [ll].
Results
The sequence detection analysis outlined in section 5 was performed for each of the benchmarks in Table 1 . The analysis was performed for sequences of length two, three, four, and five. The results of the analysis for each benchmark were a set of sequences suitable for implementation as chained operations. Each of the sequences has an associated dynamic frequency which is the percentage of execution time for which that sequence accounts as calculated from the profile information collected in step 2 of the sequence detection process (see Figure 2) 
Combined benchmark sequence analysis results
This first set of figures, shows the frequencies of all sequences detecled across all of the benchmarks combined., This information was collected by performing sequence detection for each individual benchmark, and then combining the results of all the benchmarks together. The analysis was performed using three levels of optimization in the compiler so that the effects of the optimizations on sequence detection could be assessed. Figures 3 and 4 show the effects of the three levels of optimization which were used during the sequence analysis. In these graphs, the horizontal axis represents the individual sequences found, sorted in order of decreasing freqeucy. The vertical axis represents the dynamic frequency of each sequence, or how many cycles of computation time this particular sequence accounted for during the simulations. Results from the length three and five sequence detection analyses are omitted to save space. We found that performing percolation scheduling and loop pipelining during the analysis phase significantly improved the detection of operation sequences. The code motions allowed us to see data flow occurring across the boundaries of basic blocks and detect that data flow as potential sequences. We also found, however, that the optimization of register renaming tended to have a negative effcct on the detection of sequences. Register renaming is an effective optimization for moving operations as high as possible in a program, so that they may be scheduled earlier and thus take advantage of any parallel resources available. When this optimization was used during sequence analysis, however, it tended to move operations which had data flow between them (i.e.. those operations suhble for chained operation implementation) away from each other, communicating only through the ren'amed register, thus eliminating the potential operation sequence.
The actual sequences which were detected for the complete set of benchmarks during sequence analysis confirmed some widely held beliefs regarding what operation sequences are most beneficial for DSP applications, in addition to uncovering some surprises which may not have been considered before as potential chained instructions for DSP applications. Table 2 shows several example sequences and their corresponding frequencies using the three levels of optimization.
As expected, the sequence multiply-add occurred in relatively high frequency, verifying that the MAC instruction is indeed a good choice for DSP processors. In addition to the MAC instruction, however, there were several other operation sequences which also occurred in high frequency. The add-multiply sequence, although it did not occur very often naturally in the code, did exist in high frequency after using code motions to expose it. The majority of these sequences were found in loops which had been pipelined. There was often an addition in one iteration of the loop, whose result was then used by a multiply in the next iteration of the loop. This data flow was not detected by the straight-forward analysis, but was uncovered by using loop pipelining. 
Individual benchmark results
The next set of results being presented, shows the number and frequencies of the sequences detected for each benchmark individually. The results for each benchmark are listed separately within each graph, showing the individual ,sequences detected for that benchmark. Sequences whose dynamic frequency was less than 5% were not reported (again, results from length three and five sequences are omitted to save space).
Sequence Coverage
Another way to measure the effectiveness of sequence detection, is to determine llie coveriige obtained by implementing a set of chained operation sequences. The highest coverage using the fewest number of operation sequences will be the best solution, both in terms of area and speed. In order to compare the coverage that could be obtained both with and without compiler optimizations, we used the sequence detection analyzer tool to iteratively uncover the sequences with the highest frequency. Once the sequence with the highest frequency was found for a given benchmark, the sequence detection analyzer tool was run again, this time ignoring any occurrences of the high-frequency sequence already found. This process continued iteratively until no sequences of any significant percentage were left to uncover. The analysis was performed both with and without the parallelizing optimizations in order to assess the impact of the optimizations on sequence detection.
We found that by using feedback from our optimizing compiler, we were able to achieve higher coverage rates with fewer operation sequences than could have been achieved without the compiler input. The results of these analyses (on a subset of the benchmarks) are presented in Table 3 . 
Conclusion
We have presented a framework for providing feedback from the compiler to the design of ASIPs. This frinework en~iils relating individual compilation techniques to hardware optimizations and extensions in die development of ASIPs. By using the compiler in the design process, the A S P designer is presented with a wider range of possibilities, and the assurance that the hardware extensions chosen will be used effectively by the compiler. We also presented results of an initial study on using parallelizing compiler techniques to detect operation sequences suitable for implementation ;is chained instructions. This study showed that the use of the compiler in assessing the hardware needs of an application can be particularly effective.
We are currently exploring the relationship between an ASIP and its compiler by looking at additional compiler optimizations and how they can impact the choice of hardware extensions for ASIP designs. In particular, we are interested in providing feedback on the use of multiple-issue instruction-set architectures by characterizing the instruction level parallelism of an application suite using compiler optimizations.
Bench-I Opt I Sequences I Frequency I Coverage I
