Competitive abstract machines for Prolog are usually large, intricate, and incorpórate sophisticated optimizations. This makes them difñcult to code, optimize, and, especially, maintain and extend. This is partly due to the fact that efñciency considerations make it necessary to use low-level languages in their implementation. Writing the abstract machine (and ancillary code) in a higher-level language can help harness this inherent complexity. In this paper we show how the semantics of basic components of an efficient virtual machine for Prolog can be described using (a variant of) Prolog which retains much of its semantics. These descriptions are then compiled to C and assembled to build a complete bytecode emulator. Thanks to the high level of the language used and its closeness to Prolog the abstract machine descriptions can be manipulated using standard Prolog compilation and optimization techniques with relative ease. We also show how, by applying program transformations selectively, we obtain abstract machine implementations whose performance can match and even exceed that of highly-tuned, hand-crafted emulators.
Introduction
Designing and implementing competitive "abstract" (or "virtual") machines is not without dimculties. In particular, the extensive code optimizations required for performance make development and, especially, maintenance and further with compilation to native code and just-in-time systems, where a sizable part of the emulator machinery is still there in the form of runtime libraries.
We started with an efficient, WAM-based abstract machine for Prolog initially coded in C and we rewrote parts of it in a variant of Prolog {C¡) which we have termed ImProlog and which both extends and restricts Prolog. ImProlog can be translated into very efficient C and at the same time its semantics is cióse enough to Prolog so as to be able to reuse many compilation techniques (certain analyses, specialization, etc.). This allows obtaining highly optimized and specialized emulators while avoiding obscure, redundant implementations or overuse of C macros. In addition, the combination of this approach with an emulator generator makes it possible to carry out non-trivial optimizations, such as instruction merging, automatically.
A Prolog Variant to Describe Virtual Machines
In this section we will describe our C¡ language, ImProlog, and the analysis and code generation techniques used to produce highly efficient code from it.
New Features in the Language
ImProlog adds two features to Prolog that can be modeled as new language constructs (expressible, however, within standard Prolog):
Native types and operations on them: They are opaque ("hidden" types in terms of the Ciao module system and assertion language [4] ), and used to reflect in C¡ the basic data representations of Ce and the data types required by the abstract machine (e.g., integers, floats, tagged words, etc.).
Mutable variables (mutvars):
They associate an identifier (which can be any first-order ground term) with an arbitrary term. Two operations are defined over mutable variables:
Access: @MutVar acts as a functíon which returns the valué previously stored in MutVar.
Assignment: MutVar <= Valué assigns Valué to the identifier MutVar.
The assignment is imperative and non-backtrackable. If MutVar is a free variable then a new, unique identifier is allocated for it. If it is a ground term, it is used as identifier. Its behavior remains unspecified otherwise. Figure 1 shows an example of ImProlog code which defines how to dereference a variable to reach a term. Similarly to the standard algorithm, it follows a reference chain and stops when the valué pointed to is the same as the pointing term. Note the use of mutable variables and the operations on native types tagof/2 and tagval/2, which check the tag of a tagged word and retrieve the valué of the tagged word, respectively.
The extensions included in ImProlog can easily be defined in full Prolog, as shown in Figure 2 (we assume that new_id/l returns a new, unique identifier in each cali and that a trivial syntactic transformation makes goals Q(X, Y) and Y = QX equivalent). As Q/2 can be expressed in Prolog, we would not need any additional machinery to write (and run) our virtual machine in a Prolog system and as a Prolog program, should we want to make that experiment. But that would clearly not be without an immense performance penalty (at least without complex optimizations), which is against our initial aims. By making these new constructs natively known by the compiler, and restricting their application to the cases which are useful to describe the virtual machine, we can compile them efficiently time-and memory-wise, and they become easy to map onto low-level primitive constructs of Le-
Conditions to Ensure Efñcient Code Generation
As shown in [5, 6, 7] and other work (see [8] for more references), generation of highly efficient executables from logic programs heavily depends on reducing the computational overhead that supports the extended semantic capabilities of Prolog for the specific cases in which the full power of the language is not needed. This generally requires a wealth of compile-time information regarding types, modes, determinism, non-failure, and other properties of the program. This information is generally inferred by means of static analysis. 1 When such information can be inferred, optimizations are performed, and less efScient code is generated otherwise. However, since our initial goal was to ensure efñciency, we will, instead of allowing the generation of suboptimal code, impose a number of constraints on the ImProlog code that can be written when describing the abstract machine: precisely those that will allow an almost direct (often oneto-one) translation to Le code. The compiler will raise an (efñciency-related) error while processing the code that describes the virtual machine and abort its generation if the necessary conditions are not met. This is obviously too drastic a solution for general programs, but a good compromise in our application.
Program analysis combined with program assertions allows the compiler to identify when it is safe (or possible) to genérate code based on these constraints. The conditions that must hold after analysis are that code must be deterministic (with optional support for failure continuations, as in if-then-else constructs, but not for full non-determinism), and that no garbage collection, trailing, or boxing should be required. The analyses used to ensure that those restrictions hold are listed in the next section.
Analysis
Following the order in which they are applied in the compiler, the analyses used can be divided into three main groups.
TVaditional Prolog Analyses: These include analyses for types, modes, determinism, and non-failure. They are instrumental to decide the best data representation and to detect which pieces of code may require choice points or failure continuations. They are performed using the abstract interpretation-based analyzer in CiaoPP [9] . As CiaoPP was designed with extensibility in mind, knowledge about ImProlog native types and associated operations can be given to CiaoPP via (Ciao) assertions, without having to actually change the analyzer. Assertions are also used to state the types, modes, etc. of externally defined facilities and routines (so that they can be taken into account by the analyzers) and to declare properties to be met at the entry point of each abstract machine instruction, which is typically written as a predicate. This information includes implementation decisions such as the use of short or long native integers, etc.
In addition to assertions, the type of some mutable variables may be further restricted by knowledge about the location they refer to or by type-constraining program calis. For example, mutables for X(i) registers are always bound to elements of type 'tagged'. A typed specification of the assignment operation could be written as follows:
where id_type/2 relates an identifier with the ñame of its type, and Type (Val) is a higher-order cali which states the type of Val. As we will see later, this knowledge helps in unboxing and analysis of mutables. Type analysis can ensure that Type (Val) always holds and it can therefore be harmlessly removed. This additional information makes mapping to C much easier.
Imperative State Analysis: Analysis of the valué of mutable variables requires tracking their (imperative) state, which is updated using rules that reflect the actual operational semantics (i.e., sequential execution of OR-alternatives, etc.). Since C¡ programs are limited to the deterministic case, the complexity of this analysis is reduced with respect to a more general case. The domains used are precise enough to identify an abstraction of some properties of mutable variables (e.g., whether they represent an X register, a Y register, a heap location, etc. 
Analysis for Unboxing:
This analysis tries to determine whether the type of some variable is known at all points where it is reachable. If so, then there is no need to reserve space for a tag to check its type at runtime. This requires a previous pass to determine the scope of the identifiers for mutable variables in order to establish in which program points they may be accessed. This is also needed in order to assign memory locations at compile time to the mutable variables created within the body of a predicate and which are not allocated on the heap. Since non-determinism is not allowed, and according to the compilation scheme we follow, if a variable ñame cannot be reached outside the scope of a predicate it can be safely mapped to a (local) C variable. A conservative approximation, which is easier to check and precise enough, is the following: the variable ñame can be read from, assigned to, and passed as argument to other predicates, but it cannot be assigned to anything else than other local variables.
Code Generation
The information provided by the analysis is used to optimize code generation, especially in order to partially evalúate away whole sections of code (e.g., simplifying conditionals, reducing calis to true/noop, etc.). The algorithm extends that of ciaocc [7] to support ImProlog and also simplifies it in view of the constraints on the code specified in Section 2.2.
Predicates that may or may not fail are mapped to C functions with boolean or void return types, respectively. Generation of code for several clauses or predicates in the same C function and jumping to C labels is also supported (e.g., to transform recursions into loops). Additionally, an interface to internal compiler modules is provided. This makes it possible to invoke instruction compilation from within the emulator generator.
Schematically, compilation distinguishes among control constructs, external C functions, and builtins. Compilation of control is as follows:
-A block Gl, G2 is translated to the code for Gl having its success continuation pointing to G2, followed by the code for G2. -The construct Gl -> G2 ; G3 is compiled into an if-then-else, where Gl is compiled in a context where the failure continuation points to G3. G2 and G3 are compiled in the same context where the whole construct appeared (i.e., success / failure continuations point to where Gl -> G2 ; G3 did).
For a goal G which calis a C function f (), arguments are compiled (see later) and then f O is called. If the predicate is semi-deterministic, the emitted code checks the return code and, if necessary, a jump to the failure continuation is made. When G corresponds to a built-in, its compilation proceeds as follows:
-true does nothing. -f ail is translated to a jump to the failure continuation. -A <= B is translated into assignment instructions. If A was not initialized it is declared. -A = B is handled as follows:
• When A is unbound and B is ground (and also for the symmetrical case), the builtin is translated into the declaration of A plus an assignment statement that moves the valué of executing the compiled code corresponding to B to the memory location associated with A.
• When A and B are both ground, the builtin is translated into a comparison of the valúes resulting from executing the compiled code of both expressions.
Note that although full unification may be assumed during program transformations, it is ultimately reduced to the two cases above. This has to be possible in order to avoid bootstrapping problems: e.g., (full) unification, also defined in ImProlog, should not be based itself on a full unification built-in.
Prolog logical variables and mutable variables are mapped to C variables (which can be global, local, or be passed as function arguments). The type of those C variables is extracted from the declarations and using type inference. Due to the determinism of ImProlog, trailing is unnecessary.
During compilation a symbol table keeps track of the type and memory location (or C variable) associated to each variable. All variables have to have an associated type in order to perform unboxing (an error is flagged otherwise), and all types are either native types or mutables whose valué is of a native type. For a variable whose associated C type is Te, a declaration of variable named V, with C type Vt, is emitted, and the associated memory location is set to Mem, as follows:
-If the variable is not mutable, Vt is Te and Mem is V.
-If the variable is mutable:
• if its scope is local, then Vt is Te and Mem is V, or • Vt is (Te *) and Mem is *V, otherwise.
For simplicity we assume that goal arguments have been normalized and only variables or @ expressions appear. Compilation of arguments, assuming that the memory location for A is Mem, is as follows:
-@A is translated to Mem (and A must be a mutable variable in this case). -A is translated to &Mem (if A is mutable), or -A is translated to Mem otherwise.
Generating Emulators with ImProlog
We now sketch how WAM instructions can be described using ImProlog and how the full emulator is assembled using a generic abstract machine generator.
Defining WAM Instructions in ImProlog
The definition of every WAM instruction in ImProlog looks just like a regular predicate, and the types, modes, etc. of each of their arguments have to be declared using (Ciao) assertions. Figure 3 shows the definition of an instruction which tries to unify a term and a constant. The pred/1 declaration states that the first argument is a mutable variable and that the second is a tagged word containing a constant. The predicates deref/l (from Figure 1 ) and bind/2 (also a defined predicate) are used in the instruction definition. The general compilation process to C, described later, is able to unfold (if so desired) the definition of the predicates called by u_cons/2 and to propágate information from the code inside the instruction in order to optimize the resulting piece of the emulator. After the set of transformations instruction definitions are subject to, the generated C code is of high quality.
Our approach has been to define a reduced number of instructions (50 is a ballpark figure) and let the merging and specialization process (see Section 4) genérate all instructions needed to have a competitive emulator. Note that efficient emulators tend to have a large number of instructions (hundreds or even thousands) and many of them are variations (obtained through specialization, merging, etc.) on common blocks [10, 11] . These common blocks are the simple instructions we aim at representing explicitly in ImProlog.
In the experiments we performed (Section 5) the emulator with a larger number of instructions had 199 different opcodes (not counting those which result from padding some other instruction with zeroes to ensure a correct alignment in memory). Starting with a simple instruction set makes it easier to maintain instruction sets and to make sure that they are consistent. Complex instructions are generated automatically in a (by construction) correct way.
Assembling the Emulator
To avoid the burden associated with the coding and £c-dependent details of the emulator, we chose to use here the framework previously described in [3] , where instruction semantics and bytecode representation are independently handled and assembled together using an emulator compiler. Using the terminology of [3] we define the relation between LA and CB by means of several pieces:
Ai ene which declares how bytecode encodes LA instructions and data (e.g. X (0) is encoded as the number 0). Ai dec which declares how bytecode should be decoded to return the initial instruction format in LA (e.g., for an instruction which uses as argument an X register, a 0 means X(0)). Ai arg which expresses how LA expressions are translated to Ce, e.g., how X(0) goes to x[0] (assuming X registers end up in an array).
Higher-level instruction definitions in L¡ (which abstract away bytecode representation issues) and program assertions are processed to genérate:
Ai de¡ which contains the definition of each instruction in the language LA in terms of Le code. Aiins' which describes the instruction set with opcode numbers and the format of each instruction, i.e., the type in LA for each instruction argument.
The instruction set Aiins 1 is generated by reading the information for each instruction contained in the assertions, interpreting types as LA elements, and assigning opcodes to each instruction, either automatically or via user annotations. The definition of Ai def is based on cgen, that generates Le code from L¡ as defined in Figure 4 . In this figure, memstorage stands for a look-up table which relates each £^-level variable argí with its type and location in Le, aiThe pseudo-instruction faüureJns takes care of causing a failure. Some LA instructions are not supposed to fail (e.g., pushing a choicepoint), while others, such as performing a unification, can fail. In the former case cgen is able to discard the else part and simplify the then part; in the latter case, jumps to failureJns are inserted in the appropriate places.
The components Ai ene and Aiins 1 are used to genérate the LA to LB compiler back-end. The rest of the components and Aiins 1 are used by the emulator compiler. The emulator has to understand LB and therefore it has to agree in its format with what the compiler back-end emits. Note that the overall emulator structure is largely independent of the code of the instructions. 3 A summarized definition of the emulator compiler and how it uses the different pieces in A4 can be found in Figure 4 . The scheme of the generated emulator code is some what similar to what the Janus compilation scheme [12] produces, although in the Janus case the continuation to every cali (in the source code) is known statically. The compiler can therefore genérate a direct jump to a fixed label, while in our case the continuation can in principie be any program point which comes from the bytecode program itself and is not known until the emulator is being executed. Figure 3 , which unifies a term living in some variable with a constant, we can derive a specialized versión in which the term is supposed to live in an X register. states precisely that, assigns the (symbolic) ñame ux_cons to the new instruction, and specifies that the first argument lives in an X register. The declaration:
Example 1. Code for a specíalízed instruction. From the instruction in
:-ins_entry(97, ux_cons).
indicates that the emulator has an entry with opcode 97 for that instruction. Figure 5 shows the code generated for the instruction (right) and a fragment of the emulator generated by the emulator compiler in Figure 4 .
We want to note that we deliberately stay within standard C: the use of C extensions (such as storing labels in variables, which are provided by gcc and used, for example, in [13, 14] ), is outside the scope of this paper.
Automatic Generation of Abstract Machine Variations
Substantial work has been devoted to abstract machine generation strategies such as, e.g., [10, 11] , which explore different design variations with the objective of achieving highly optimized emulators. By making the semantics of the abstract machine instructions explicit in a language like ImProlog, which can be easily processed automatically, such variations can be formulated mostly as automatic transformations. Adding new transformation rules and testing them together with the existing ones becomes a relatively easy task.
We will briefly describe some of these transformations, which will be experimentally evaluated in Section 5. Each transformation is identified by a two-letter code. We make a distinction between transformations which change the instruction set (e.g., creating new instructions) and those which only affect the way code is generated.
Instruction Set Transformations
New instructions are currently synthesized from existing ones by explicitly unfolding shared pieces of code, by merging instructions (different or not), and by performing specialization for some operand valúes, types, or locations.
Instruction Merging [om]:
Merging generates larger instructions from sequences of smaller ones, and aims at saving fetch cycles at the expense of an increased switch size. This technique has been used extensively in highperformance systems (e.g., Quintus Prolog, SICStus, Yap, etc.). The performance of different combinations has been studied empirically [10] , but in that paper new instructions were generated by hand, although deciding which instructions had to be created was done by means of profiling. In our framework all that is needed in order to emit code for a merged instruction is a single declaration. Merging is done automatically through code unfolding based on the definitions of the component instructions. This makes it possible to define a set of optimal user rules for merging.
Instructions with a Variable Number of Operands [vo]:
For some instruction families a number of instructions (e.g., unify with void) can be collapsed into a single instruction with a variable number of operands. Code generation emits a loop whose internal iteration code comes directly from the single instruction definition.
Instructions for Built-ins [ib]:
Calling external library code or built-ins often requires ad-hoc instructions (to make the appropriate parameter conversión, etc.). A single family of instructions that cali a foreign C function can be used to do that, and this is the default option. The same instruction can then be specialized for a predefined set of built-ins, thus generating a special instruction set that includes faster calis to, e.g., arithmetic operations.
TVansformations of Instruction Code
Some transformations do not créate new instructions, but perform instead different optimizations on already existing instructions by manipulating the code or choosing alternative translation schemes.
Unfolding Rules [ur]:
Simple predicates are unfolded throughout the code before compilation. In the case of instruction merging, unfolding is used to merge the code of two or more instructions into a single piece of code. In some cases unfolding can be limited so that common pieces of instructions can be shared. This transformation enables or disables a set of predefined unfolding rules.
Different Tag Switching Schemes [ts]:
Tags are used to detect dynamically the type of basic data (atom, structure, number, variable, etc.) contained in a machine word, so that different actions can be taken depending on this type. The corresponding tag switching code is a heavily-used operation which is worth optimizing as much as possible. This option generates either an automatic C switch (when enabled) or a set of predefined switch patterns based on tag encodings (when disabled).
Connected Continuations [je]:
Tests (or other actions) are sometimes unnecessarily repeated because they appear at the end of an operation and at the beginning of the next one. They are redundant at this point, because they are bound to fail or succeed depending on their behavior in the previous operation. For example, in the fragment deref (T), (ref (T) -> A ; B), T is checked to test whether it is a reference just before exiting deref/l. Code can be generated that jumps directly to the implementation of A or B depending on the result of this test. This option enables or disables the optimization.
Read/Write Mode Specialization [rw]:
WAM-based implementations sometimes use a flag to test whether heap structures (Le., the memory representation of functors) are being read (matched against) or written (created). According to the valué of this flag, several instructions adapt their behavior with an if-then-else. A common optimization is to partially evalúate the switch inside the emulator loop to genérate two different, parallel switch structures, one for each of the read/write possibilities. We can genérate instruction sets (and emulators) where this optimization has been turned on or off.
Experimental Evaluation
We will report here on experimental data regarding the performance which was achieved on a set of benchmarks by a collection of emulators, all of them automatically generated through different combinations of options. In particular, by using all compatible possibilities for the transformation and generation options given in Section 4 we generated 96 different emulators (instead of 2 7 = 128, as not all options are independent; for example, vo needs om to be performed).
This bears a cióse relationship with [11] , but here we are not changing the internal data structure representation (and of course our instructions are all coded in ImProlog). It is also related to the experiment reported in [10] , but the tests we perform are more extensive and cover more variations. Additionally, [10] starts off by being selective about the instructions to merge; this is a point we want to address in the future by using instruction-level profiling.
Our initial point was a "bare" instruction set comprising the "common basic blocks" of a relatively efficient abstract machine (the "stock" abstract machine of Ciao 1.10, itself an independent branch off the original SICStus Prolog 0.5/0.7 emulator, and with performance currently just below modern SICStus versions). Figures 6 to 7 summarize overall results for the experiments, as the data gathered -96 emulators x 13 benchmarks = 1248 performance figures-is too large to be examined in detail here. In those figures we plot, for three different cases, the resulting speed of every emulator using a dot per emulator. Every benchmark was run several times on each emulator to arrive at meaningful time measures, in a Linux machine with a Pentium 4 processor and using gcc 3.4 as C compiler. Although most of the benchmarks we used are relatively well known, we include a brief description in [15] .
In order to encode emulator generation options in the corresponding dots, each available option in Sections 4.1 and 4.2 is assigned a bit in a binary number (a '1' means activating the option and a '0' means deactivating it). Every valué in the y axis of the figures corresponds to a combination of the three options in Section 4.1, but only 6 combinations are plotted due to dependencies among options. Options in Section 4.2, which correspond to transformations in the way code is generated, are represented with four bits which are encoded as 16 different dot shapes (shown in each figure). Every combination of emulator generation options is thus assigned a different 7-bit number and a different dot shape and location. The x coordínate represents the relative speed w.r.t. the hand-coded emulator currently in Ciao 1.10, which is assigned speedup 1.0.
Of course, different selections for the bits assigned to the y coordínate and to the dot shapes would yield a different picture. However, our selection seems intuitively appropriate, as it addresses separately two different families of transformations. Indeed, Figure 6 , which uses the geometric average 5 of all benchmarks to determine the overall performance, shows a quite well defined clustering around eight centers. Although it is not immediate from the picture (it has to be "decoded"), poorer speedups come from not activating some instruction creation options (which, for the stock emulator, really means deactivating them, since merging and specialization was made by hand quite some time ago, and the resulting instructions are already part of the emulator).
As a side note, while this figure portrays an average behavior, there were benchmarks whose results actually tracked this average behavior quite faithfully. An example is the the doubly recursive Fibonacci, which is often disregarded as unrealistic but which, for this particular experiment, turns out to predict very well the (geometric) average behavior of all benchmarks. All in all, this picture (or, rather, the method which led to it) tries to reveal families of optimization options which give similar speed by showing dot clusters. Interestingly enough, once a set of generation options for CB is fixed, the changes in the generation of Ce have (in general -see below) a relatively low impact. The general question which options should he used for the "stock" emulator to be offered to general users is answered by selecting a set of options somewhere in the topmost, rightmost cluster.
In any case, there are combinations of code generation options which achieve a speedup of 1.05, on average. While this may appear modest, consider that by starting with a simple instruction set (coded in ImProlog!) and applying systematically a set of transformation and code generation options, we have managed to match (and exceed) the time performance (memory performance was untouched) of an emulator which was hand-coded by very proficient programmers, and in which decisions were thoroughly tested along several years. Moreover, the transformation rules we have applied in our case are of course not the only ones, and we look forward to performing a more aggressive merging guided by profiling (merging is right now limited in depth to avoid a combinatorial explosión in the number of instructions). Similar work, with more emphasis on the production of languages for microprocessors is presented in [16] , where a set of benchmarks is used to guide the (constrained) synthesis of such a set of instructions. Figure 7 shows two cases of particular interest. The plot for queensll is a typical case which departs from the average behavior but which still resembles it. As a relevant difference, a much better speedup (around 1.25) 6 is achieved with some combinations of flags. On the other hand, the plot for crypt presents a completely different landscape: a plot where variations on the code generation scheme are as relevant as variations on the bytecode itself. This points to the need to find other clustering arrangements which shed some light on the interactions among different emulator code and bytecode generation schemes. Our experiments, however, lead us to think that in some cases the behavior tends to 6 Which of course means that some benchmarks do not get any speedup. be almost chaotic, as the lack of registers in the target architecture (i86) makes optimization a difficult task for the C compiler. This is supported by similar experiments on a PowerPC architecture, which has more general purpose registers, and in which the results are notably more stable across benchmarks. The overall conclusions for the best options and speedups remain roughly the same, only with less vari anee. Table 1 tries to isolate the effeets of sepárate options. It does so by listing, for each benchmark, including the geometric average, which options produced the best and the worst results time-wise. While there is no obvious conclusión, instruction merging is a clear winner, probably followed by having a variable number of operands, and then by specialized calis to built-ins. The first and second options save fetch eyeles, while the third one saves processing time in general.
It can come as a surprise that using sepárate switches for read/write modes, instead of checking the mode in every instruction which needs to do so, does not seem to bring any advantage. A similar result was already observed in [11] , and was attributed to modern architectures performing branch prediction and speculative work with redundant units. Therefore, short if-then-else statements might get both branches executed in parallel with the evaluation of the condition. Besides, implementing read/write modes with two switches basically doubles the size of the core of the emulator. A similar size growth happens when extensive merging is performed. In both CclSGS el side effect is that of an increased cache miss ratio and the corresponding reduced performance. 
Best performance vo ib om
We have designed a language (ImProlog, a variation of Prolog with some imperative features) and used it to describe the semantics of instructions of a bytecode interpreter. ImProlog, with the proposed constraints, makes it possible both to perform non-trivial transformations (e.g., partial evaluation, unfolding, merging, etc.) and to genérate efficient low-level code (using the cgen compiler) for each of the emulator instructions. Different transformations and code generation options can be applied, which result in different grades of optimization / specialization and different bytecode languages.
The low-level code for each instruction and the definition of the bytecode can be taken as input by a previously developed emulator generator to assemble full, high-quality emulators. Since the process of generating instruction code and bytecode format is automatic, we were able to produce and test different versions thereof to which several combinations of code generation options were applied.
We have also studied how these combinations perform with a series of benchmarks in order to find, e.g., what is the "best" average solution and how independent coding rules affect the overall speed. We have in this way as one case the regular emulator we started with (and which was decomposed to break complex instructions into basic blocks). However, we also found out that it is possible to outperform it by using some code patterns and optimizations not explored in the initial emulator, and, what is more important, starting from abstract machine definitions written in ImProlog. We intend to continué this line of exploration of improved abstract machines and incorporating them in the standard Ciao distributions.
