Instruction-level parallelism in a single stream of code for non-numerical applications has been the subject of many recent researches. This work extends the analysis to symbolic applications described with logic programming.
Introduction
Architectural approaches that exploit instruction-level parallelism from a single instruction stream play an important role in the improvement of uniprocessor performance.
In particular, advances in compilation techniques [3, 111 have demonstrated that VLIW architectures [4] can reach interesting performance on numerical code, but show considerable limitations when applied to non-numerical applications. This has lead to the development of dynamic scheduling approaches [7] that try to identify parallelism at execution time through the use of more complex control parts.
Instruction-level parallelism in non-numerical applications has been the subject of recent researches [l, 8, 12,141 , that have produced different results about the amount of exploitable concurrency, depending on the adopted computational paradigm and the considered hypotheses on data and control dependencies.
As most researches have focused their attention to imperative languages (i.e. C and Fortran), the purpose of this work is to extend the analysis to another computational paradigm applied to the class of non-numerical applications. In particular, we have chosen the logic programming paradigm, and Prolog as the target language, since it represents an interesting alternative way to approach a symbolic problem.
As recent studies have shown [6, 13] , the performance of sequential Prolog is getting closer to imperative languages. From thii standpoint, there is a renewal of interest in using Prolog for non-numerical applications. On the other hand, the trend in general purpose computation toward instruction-level parallelism poses the question whether we can find the same amount of parallelism of imperative languages with a logic programming approach.
Previous works [2] on static parallelism in Prolog have demonstrated that global compilation techniques (i.e. Trace Scheduling) can only reach degrees of parallelism between 2 and 3. This is due to the nature of the abstract execution model of the languages, and, in particular, to:
l the difficulty of operating a successful alias analysis, for the absence of array data structures;
l the difficulty of managing loops that operate on pointer data structures, and in particular the impossibility of knowing the effect of a traversal;
l the high frequency of branch instructions (14% [2]), that requires aggressive speculative execution to exploit concurrency;
The purpose of this work is to report a set of evaluations and to show tradeoffs among the various options available in a dynamic scheduling approach to Prolog. We are also interested in finding out what are the real advantages of dynamic against static scheduling and under which hypotheses one of the two approaches is better. Finally, we want to determine if the conclusions reported for imperative languages about instruction level parallelism can be applied to other computational paradigms like logic programming.
1.1
Instruction Although the research on compilation techniques to restructure and parallelize programs with pointer data structures is producing interesting results [5], there is still much work to be done. In alternative, we can evaluate whether dynamic scheduling approaches can reach a worthwhile speed-up under realistic hypotheses.
Dynamic Scheduling Architectures
Dynamic scheduling machines are architectures that include a mechanism to identify the instructions that can be executed in parallel at run-time [7] . The current interest in dynamic scheduled machines arises both from commercial and theoretical reasons, as parallelism extraction is trasparent to the user and more precise information are available at run-time with respect to compile time.
On the other hand, the complexity of the control circuitry for the dynamic scheduling algorithm, the time limitation for its execution, and the fact that parallelism can come from instructions far away in the code, are the problems still to be solved.
We can identify two phases in the dynamic scheduling process: s Parallelism Extraction:
operates transformations to increase the possibility of picking independent operations, according to the program dependencies.
l Resource Mapping: maps the operations to the machine resources, according to the architectural constraints
In this work, we only consider the Parallelism Extraction phase, assuming that the target architecture is always able to execute all the parallel operations that we can generate, up to a certain amount. This simplification is necessary to distinguish between the factors deriving from the lack of parallelism and those generated by the machine resources.
2
The Simulation Framework
Datapath Model
We consider a simplified datapath model of the architecture, equivalent to the datapath of a VLIW machine, and composed of a given number of units that can execute any instruction in a single cycle. Fetch: we follow the flow graph of the program and copy instructions to the fetch buffer. When we encounter a branch, we follow the flow along the most probable direction. To this aim, we use static information about the branch probability obtained by a previous run of the program with the same input data. When an instruction is an indirect jump, we normally stop the fetch process, unless the corresponding prediction option is enabled, as we will detail in the following paragraphs.
Dependency Analysis: we build the data dependency graph of the fetched instructions, by considering all dependencies (true, anti, output and memory), except control dependencies that are handled by the speculative execution mechanism.
Issue: we identify the instruction that are data-ready and assign them to the resources, until the resource limit is exceeded, thus building a parallel microinstruction that will be executed in a single step. The selection of the instructions to be executed is guided by a simple linear heuristic (based on the original pro gram position). As we assume a simplified resource model, we do not need sophisticated heuristics (like list-scheduling) that should however be consider with more realistic resource configurations. The following paragraphs detail the different available simulation options and the techniques that are used to implement them.
Speculative Execution
The simulator supports speculative execution of instructions. If an instruction originally after a branch is executed before the branch, the values of its output registers are saved in a "checkpoint stack" corresponding to the branch. If the instruction is a memory write operation, we save also the content of the accessed memory location. With this approach, when a branch is mispredicted, it is possible to recover the correct state before continuing execution. The recovering process involves two activities:
l Bookkeeping. we must terminate the execution of all the instructions originally before the branch, to preserve the program semantic. This is necessary since the issue procedure might execute the branch before other instructions preceding the branch in the original program, and generating variables that are live off the branch. The name "bookkeeping" derives from the similarity of this phase to the copy mechanisms in a global compiler [3]. s Rollback: we must "undo" the actions of all the instructions that have been speculatively executed before the branch. This is done by restoring the values stored in the branch checkpoint stack. This operation requires a time overhead that is parameterized in the simulator. When not specified, we assume an (optimistic) overhead of 1 cycle for each failed branch and 1 cycle for each recovered memory operation.
This algorithm enables out-of-order execution of any instruction operating on registers, including out-of-order branches with no additional mechanism.
Note that, to achieve the same behavior with a compiled approach, we risk an exponential growth of the number of copied instructions, as it is shown in [3] . For this reason, many compiled techniques constrain the branches to be executed in-order (sometimes in parallel). 
Memory Alias Analysis
The algorithm for speculative execution still does not allow to execute out-of-order memory operations. In fact, we need to maintain data dependencies among memory accesses every time we cannot determine that there is no alias. As we have outlined in the previous section, there are no consolidated techniques that can be successfully applied to alias analysis of Prolog code. For this reason, it is unrealistic to assume a perfect alias detection at compile time.
With dynamic scheduling, we can operate a run-time alias analysis: we eliminate the memory dependency between two memory instruction as soon as we have computed their addresses and have observed that they are not the same. This technique is less powerful than a perfect static alias analysis, as the decision can be taken only after the address computation (see Figure 2) , however it can be extremely useful when a compiler cannot disambiguate.
Renaming
An important issue in the exploitation of instruction-level parallelism is the reduction of output dependencies. While The simulator supports a renaming option, to remove all output dependencies among instructions of the fetch windows before the dependency analysis. In our model, we assume to have infinite registers.
We note that the combination of speculative execution and dynamic renaming can be the source of potential semantic errors, unless handled properly. In fact, when a branch is mispredicted, it is necessary to recover the effect of renaming so that the rest of the code receives the correct values in the correct registers. To do this, we need to record the lifetime of every renamed register and to move the last correct value to the original location before proceeding to the rollback phase and the succeeding execution.
Predicting Indirect Jumps
Prediction of branches is fundamental but not sufficient to expose all parallelism in a program. A valuable amount of concurrency could be lost if we are forced to stop fetching instructions when we encounter an indirect jump operation. This phenomenon is emphasized in Prolog due to the massive presence of recursive procedures that are difficult to expand. The different techniques that have been proposed to predict indirect jumps [9] are typically based on the correspondence of CALL/RETURN statements. Unfortunately, in Prolog it is possible that the return address of a procedure is changed within the called procedure, in case of selective backtracking or "cut" mechanisms.
As the research on efficient prediction techniques for indirect jumps in Prolog is beyond the purposes of our work, we have implemented a simple algorithm that predicts the address based on the value of the jump target present in the corresponding register during the fetch. Then, the guess is correct every time the called procedure does not change its return address.
Indirect jumps derive mainly from Prolog proceed instructions in case of success and fail procedures in case of failure. Proceed instructions jump to the content of the Continuation Pointer register, that is set by the caller and give way to correct guesses. Fail instructions jump to the content of the Failure Address, that is loaded from the stack and causes a misprediction.
This simple technique reaches a misprediction probability of 2&50% (increasing with the fetch window size). The average branch misprediction probability is much lower (around 10%) and approximately independent on the window size.
Combining Arithmetic Instructions
The last technique we consider involves a simple program restructuring. Sometimes, renaming is not sufficient to expose all the parallelism: this happens, for instance, in loops where the iterations cannot be completely overlapped due to dependencies among the induction variables of different iterations. In this case the sequentiality can be eliminated only if we generate the induction variables in parallel for all the iterations contained in the fetch window. This procedure is well-known and widely adapted in the unrolling phase of a vectorizing compiler, and it can also be implemented in a dynamic scheduled architecture. We can recognize the associativity of pairs of operations and combine them in a semantically equivalent form to increase the parallelism (Figure 3) . We call this operation "combining" of operations, and, obviously, combining without renaming is meaningless. The simulator is able to combine simple arithmetic operations (additions and subtractions) when one operand is constant, during the renaming procedure every time an output dependency is encountered.
The Oracle machine
As a term of comparison for our experiment, we also support an ideal execution model, known as oracle [8, lo], in the simulator. The Oracle model performs perfect memory alias analysis, perfect renaming, perfect branch/jump prediction and has an unlimited fetch window.
In the simulator, the performance of the oracle machine is computed by assigning the execution time to an operation as soon as its operands are ready. In thii way, we eliminate all control-flow dependencies, as well as all output and anti-dependencies. Obviously, the oracle model is unrealistic from an implementation point of view.
The only optimization that we do not consider in the Oracle computation is the combining of arithmetical operation, since it involves a restructuring of the program.
The Experiment
This section describes the experimental framework. To measure parallelism we used a set of small-sized Prolog benchmarks, and we run the simulations on a wide range of configurations, obtained by varying the fetch window size and the different optimization options.
Our benchmark set is extracted from the standard Aquarius Prolog Suite [13] and consists of programs that cover a wide range of symbolic application, including list management, theorem proving, rule-based system and database search.
The benchmarks are compiled from Prolog to an intermediate code for the Berkeley Abstract Machine (BAM) execution model by means of the Aquarius compiler [6, 13] .
The BAM code of each program is translated to our instruction set, and we run a profiling simulation (for a sequential machine) to collect the statistics about branch probability to be used for dynamic prediction.
Results and Analysis
Following the guidelines of [l, 14] ,we have defined a set of configurations of increasing complexity, that we indicate with the acronyms described in Table 1 . We also include two static approaches: St-bb, limited to basic blocks and St-ts, with the Trace Scheduling technique. The simulation were run for each of the six dynamic models with different fetch window sizes ranging from 4 to 256. All the configurations assume a maximum number of instructions per cycle (40) that can be considered unlimited for our applications.
In the following paragraph we present the results of the simulations of the benchmarks and the different machine configurations. We show performance measurements in terms of speed-up and the influence of the overhead for mispredicted branches.
Speed-up
As we are not proposing a particular machine model, we express performance in terms of speedup versus a purely sequential execution, measured by dividing the number of sequential cycles by the number of parallel cycles. Figure 4 shows a log-linear graph of the harmonic mean of the speed-up of all benchmarks for each considered configurations as a function of the fetch window size. Table 2 shows the details for most of the configurations. Compilation techniques constrained to basic blocks are outperformed by a dynamic approach even with small fetch windows and simple configurations. The performance of global compilation techniques ( i.e. Trace Scheduling) is competitive with some of the dynamic approaches. With the base dynamic configuration (B) we need at least a window size of 128 instruction configuration to get an improvement. In alternative, with more complex configurations (BR,BMR), fetching 8-16 instruction is sufficient to perform better than a static approach. In both cases, if we consider implementation cost issues, we need to introduce a remarkable complexity (either in terms of dynamic code optimizations or fetch window size), in contrast with the architectural simplicity of the static compilation approach. The most important optimizations seem to be Renaming and the combination of Renaming with Memory alias analysis. If we do not operate Renaming, performance is not substantially better than a static 
Effects of Branch Misprediction Overhead
All previous data have been obtained by assuming that we lose only one cycle for every mispredicted branch and for every recovered memory instruction incorrectly speculated. These hypotheses imply a rather complex machine with the capability of changing simultaneously the state of all the internal registers to a previously saved checkpoint (one for every branch in the fetch window). To individuate a possible performance bottleneck, we can evaluate what is the influence of a larger overhead (in terms of cycles) for the rollback phase. We have computed the values of speedup for the best configuration (BMRIC) by assuming the cost of every mispredicted branch/jump of 1 (default), 5, 10 and 20 cycles. The average results on the benchmarks (Figure 5) show that there is a dramatic performance degradation as the overhead increases. Moreover, in presence of heavy overheads (10 and 20 cycles), an increase in the fetch window size causes a performance decrease. This can be explained by the fact that, although the global prediction probability remains substantially unaltered, the absolute number of mispredicted branches increases with a wider window.
The performance degradation factor is rather disappointing, and indicates that even the most complex optimizations are vain if the architecture does not enable a fast recovery of mispredicted branches. 
Conclusions
In this paper we have presented an analysis of different dynamic scheduling techniques to increase the instructionlevel parallelism of single-stream architectures for Prolog applications.
We have measured that it is possible to reach a speedup between 2 and 7 for the considered benchmarks, with an average sustained parallelism around 5 for the most complex configurations. We have also observed that there is a saturation for most benchmarks when the window size is around 128. Saturation happens with much smaller window sizes (32-64) without the important optimization of renaming and memory alias analysis.
A significant consideration that emerges from the analysis of the results is the importance of the accuracy in the prediction algorithm. The combination of an imperfect prediction technique and the high frequency of branch instructions in Prolog establishes a solid barrier to parallelism. Regardless of the fetch window size, the actual window where we can find useful parallelism will always be limited by the number of instructions dynamically executed between two successive mispredictions. This value is around 70 with our hypotheses and gives an intuitive justification of the limited speedup obtained with respect to the oracle performance.
The analysis has also shown the importance of renaming with memory alias analysis as the key technique to boost performance, and the necessity to keep the time penalty of recovering a mispredicted branch as low as possible not to make all the other optimizations vain.
Furthermore, we have observed that, to reach a parallelism that is considerably higher than a static compilation approach, we need to introduce a remarkable architectural complexity, either in terms of dynamic code optimizations techniques (like renaming, dynamic memory alias analysis and combining of arithmetic instructions), or in terms of an increase of the fetch window size.
The most important result of our work is the discovery of a worthwhile amount of parallelism in Prolog, in the same range of imperative languages for the same class of non-numerical applications. This is significant if we consider that Prolog represents a real challenge for instructionlevel parallelism extraction, due to the particular characteristics of the language.
Many questions still remain open (not only for Prolog), in particular about what optimizations we can implement in hardware at acceptable costs, and what kinds of interactions between the compiler and the architecture have to be exploited to optimize non-numerical applications.
