Most Prolog machines have been based on specialized architectures. Our goal is to start with a general purpose architecture and determine a minimal set of extensions for high performance Prolog execution. We have developed both the architecture and optimizing compiler simultaneously, drawing on results of previous implementations. We nd that most Prolog speci c operations can be done satisfactorily in software; however, there is a crucial set of features that the architecture must support to achieve the best Prolog performance. In this paper, the costs and bene ts of special architectural features and instructions are analyzed. In addition, we study the relationship between the strength of compiler optimization and the bene t of specialized hardware. We demonstrate that our base architecture can be extended to include explicit support for Prolog with modest increase in chip area (13%) and yet attain a signi cant performance bene t (60 to 70%). Experiments using optimized code that approximates the output of future optimizing compilers indicates that special hardware support can still provide a performance bene t of 30 to 35%. The microprocessor described here, the VLSI-BAM, has been fabricated and incorporated into a working test system. / This paper is a revised and expanded version of \Fast Prolog with an Extended General
INTRODUCTION
Logic programming in general and Prolog 27] in particular have become popular for rapid software prototyping, natural language translation, and expert system programming. Prolog's use of dynamic typing, backtracking, and uni cation place heavy computational demands on general purpose computers. In an attempt to achieve ever higher performance, several special purpose architectures have been proposed and built. Early Prolog architectures 37] were microcoded interpreters. Because no compilation was done, performance was disappointing. Higher performance processors 2, 9, 20, 25] have since been based on the Warren Abstract Machine (WAM) 36]. Their instruction sets were derived from the WAM to support execution of Prolog programs. These processors are special purpose, microcoded engines that depend on parallel execution of operations within each relatively coarsegrained instruction for high performance. Initial designs implemented only the instructions that supported the WAM and depended on a host processor for non-WAM computations. To support Prolog built-ins (primitive Prolog operations provided by the system) and system I/O, newer designs incorporate general purpose instructions to minimize dependence on a host. Alternatively, the use of a simple, non-WAM instruction set better supports compiler optimization. Several such special purpose reduced instruction set architectures have been proposed for logic programming 11, 18, 19, 24] . These architectures include primitives that support the use of tagged data, pointer dereference, and multi-waybranches. Our hypothesis is that providing support for both compiler optimization and low-level operations can best be accomplished by extending a simple general purpose architecture to support Prolog without compromising the general purpose performance.
The performance improvements of recent general purpose architectures over older architectures can be traced to research in which both the compiler and architecture were developed together 13, 16, 22] . Architectural features that cannot be used by the compiler or that cannot demonstrate performance improvement are not included. Likewise, architectural features are added that support often used primitive operations. We have adopted this approach from the beginning of our project.
It has been conjectured that commercial special purpose symbolic processing architectures are doomed because they are not commodity items, and consequently, economics prevent them from staying on the leading edge of implementation technology. However, if the architectural features necessary to improve symbolic performance are modest and do not interfere with the general purpose architecture, then as more chip area becomes available, future implementations of general purpose processors can deliver high performance symbolic computing in a standard product. We hope that our work is a step towards this result.
The VLSI-BAM design was begun at University of California, Berkeley as part of the Aquarius project. The Aquarius project group relocated to the University of Southern California in 1989 where the VLSI-BAM chip and a custom cache board were completed, fabricated, and tested.
This paper presents the design of a processor based on the Berkeley Abstract Machine (BAM) and motivates its design with the results of our preliminary studies. We also present a discussion of the optimizing compiler, a cost/bene t analysis of the architectural features, and the simulated performance. Section 2 summarizes the processor architecture and hardware implementation. Section 3 presents the instruction set along with the results of our studies which motivated instruction selection. The compilation of Prolog programs is described in Section 4, and in Section 5 we present a cost/bene t analysis of the special features and instructions. Section 6 gives the performance results. The nal section concludes with a summary of our results.
PROCESSOR ARCHITECTURE AND IMPLEMENTATION
The VLSI-BAM processor is a general purpose, single chip, pipelined processor with extensions to support Prolog execution (Figure 2 .1). Both data and instruction words are 32 bits, and most instructions execute in a single cycle. The main features for Prolog are tag manipulation (integrated into arithmetic and the memory system), a double-word data port to memory, special branch on tag support, and several instructions to support our execution model for Prolog. The architecture is presented in detail along with our motivations in the subsections below. Retaining a core general purpose architecture imposes constraints on the symbolic extensions. For example, the processor should be able to handle tagged data items as single entities, with no special treatment for the tags. We discuss the rami cations of this on the word format and the virtual memory system. Then we present the architecture's register structure and memory interface. Finally, we present some details of the implementation such as the pipeline structure and our mechanism for multiple-cycle instructions.
Word Format
Prolog does not require the user to specify the type of a data item. This requires that run-time type checking be implemented by adding a tag to each data item to encode the type of that item. Many Prolog processors handle the tag and value elds separately. This approach does not satisfy our goal of integrating tagging into a general purpose architecture. Instead, we use a standard 32-bit word length and place the tag in the most signi cant four bits of the word. Arithmetic computations and addresses, however, use the entire 32-bit word, so general purpose computations are not a ected by Prolog's use of tags. Tag values xed by the hardware are those for non-negative integers (0000) and negative integers (1111). This selection of tags for integers is a common technique used by Lisp implementations on general purpose machines 26] . We have also xed the tag value for variable pointers (tvar = 0001) to increase the number of bits available for branch displacements in several Prolog speci c instructions. 3 All other tag values are software de ned. Our Prolog implementation uses tags similar to those of the WAM.
Segmented Virtual Addresses
One consequence of using both the tag and value as an address is that each data type is mapped into its own area of virtual memory; however, Prolog's execution model places data with several data types on the same stack or heap. One possible solution is to mask (zero) the tag bits of the address before using it to access memory. This solution is not satisfactory when applied to applications not using tags (for example, C programs). To avoid this di culty, we have introduced a segment table that maps the most signi cant six bits of an address to a twelve-bit value (Figure 2.2 ). An address before mapping is referred to as a short virtual address (SVA), and the 38-bit address resulting from the mapping is referred to as a long virtual address (LVA). This memory segmentation scheme is similar to the segmentation used in the 801 processor 7]. The 801 uses segmentation to extend the virtual address space; however, our primary motivation for using segmentation is to allow multiple data types to be mapped to the same LVA segment. Mapping two bits in addition to the tag bits allows the use of several memory areas for a given data type, each area using a di erent mapping. 4 At one extreme all data types can be mapped to the same LVA segment (this is equivalent to masking the most signi cant six address bits). At the other extreme, all SVA segments can be mapped to distinct LVA segments. In our current implementation of Prolog, variable, list, and structure pointers are mapped to the same LVA segment, whereas the environment/choice point stack, the trail stack, and the symbol table are mapped to separate segments.
Another use of segmentation is for sharing data in a multiprocessor system. In this case the 38-bit LVA is used as the global virtual address and sharing of data by cooperating processes is done at the segment level.
Memory Interface
The high memory bandwidth requirement of Prolog dictates separate instruction and data buses (Figure 2.1 ). In addition, we have expanded the data bus to doubleword width. A double-word data bus is motivated by Carlson's study 5] of the ar- His results show that the best performance/cost tradeo occurs when the architecture provides a double-word port to data memory. A double-word memory port improves the performance of term creation and speeds block transfers to and from environments and choice points. Some previous Prolog processors support fast choice point creation and restoration through the use of specialized bu ers or shadow registers 9, 24]. Such hardware solutions are costly and do not t our goal of maintaining a general purpose architecture. Instead, we rely on double-word memory operations and on compiler optimization to minimize shallow backtracking 33].
Our processor design is tightly coupled with the cache design. We decided against on-chip caches since, in our case, it is more appropriate to use processor chip area for architectural features and use fast, dense static RAM chips for large caches. To speed cache accesses, however, protection violation and consistency checks and address tag comparison are done on-chip. More details about the cache interface are given in 6].
Base Architecture
All programmer visible processor registers are accessed as two sets of 32 registers: the general purpose register set and the special register set. The general purpose registers are used for procedure argument passing, temporary storage, and as stack pointers. The only general purpose register with a preassigned use is the continuation pointer (r31). This register is implicitly set to the return address by the call instruction. All other uses of the general purpose registers are de ned by software convention.
The special registers provide access to the processor status word (PSW), program counter (PC), partial product/quotient register (PQ), segment mapping table, cache interface con guration registers, and a set of fteen extra registers (s0-s14).
Implementation Details
The execution pipeline consists of ve stages (Figure 2. 3). All instructions that modify registers or memory do so in the last pipeline stage. Register bypassing forwards results of the ALU and memory read pipeline stages to instructions in earlier stages of the pipeline. Hardware pipeline stalls are provided in hardware to insure correct execution of both load and store operations. If data from a load instruction is used by the next instruction, then the next instruction is delayed by a cycle. Also, memory instructions immediately following a store are delayed by a cycle. Store instructions require access to the cache during both the M and W pipeline stages|M to provide the address to determine the cache hit/miss and W to provide the data to be stored.
I
instruction fetch R register read A ALU M memory read W register/memory write FIGURE 2.3. VLSI-BAM Processor Execution Pipeline All instructions are 32 bits with a 6-bit opcode and xed source register format. Instruction execution is controlled by an opcode pipeline that operates in parallel with the execution pipeline. Each stage of the opcode pipe decodes the opcode associated with that stage of the execution pipeline. Multi-cycle instructions and conditional instructions are implemented using \internal opcodes" 21]. The internal opcodes of multi-cycle instructions are fetched from a PLA and inserted into the opcode pipeline. When an internal opcode is inserted, no instruction is fetched during that cycle. Thus a single external opcode can invoke a sequence of internal opcodes to provide for often used complex operations (for example, pointer dereferencing). Internal opcode insertion is also used for atomic synchronization operations, for pipeline interlock delays, and for trap and interrupt handling. Conditional execution is implemented by conditionally replacing an opcode in the opcode pipe with an internal opcode. Our design uses 55 external opcodes and 24 internal opcodes; of the internal opcodes, nine are related to traps (trap, rft), 13 implement multicycle instructions (dref, stx, std, pushd, las, jmpr), and two implement conditional operation instructions (uni, pusht).
\Fast tag logic" is used to implement single-cycle tag-compare-and-branch instructions. The fast tag logic consists of an extra register le that duplicates the tag portion of the general purpose register le and special tag comparison logic that allows quick tag comparison and branch. Previous Prolog processors 9] have also duplicated tag bits to accelerate branching on tag value.
The general purpose register le has two read ports (one single-word and one double-word) and two write ports (both single-word). This port structure provides the bandwidth required by single-cycle double-word memory accesses without greatly increasing the complexity of the register le design. Figure 2 .4 is a photomicrograph of the VLSI-BAM chip. The layout consists of three rows of functional units. In the top row, the left-most quarter contains the register index latches, register bypass logic, fast tag comparison logic and fast tag register le. The middle half of the row contains instruction address circuitry that consists of three branch o set adders (two for the three-way branches, swt FIGURE 2.4. Photomicrograph of VLSI-BAM Chip Die and swb, and one for btg), an incrementer, multiple latches that hold the PC value of each pipeline stage, latches for trap handling, and the partial product/quotient register. The right-most quarter of the top row contains the pipeline control logic that consists of decode PLAs and latches for holding the opcode of each pipeline stage.
The left three quarters of the middle row contains the main data path. The general purpose register le is on the left end and the barrel shifter and ALU are toward the right end and pipeline latches lie in between. The right quarter of the middle row contains condition code logic, logic for trap detection and prioritization, and random logic for pipeline control signals.
At the left end of the bottom row is the special purpose register le. The middle and right end of the row contains the cache interface logic that consists of (moving left to right) the protection speci cation latches, protection comparison logic, tag comparison logic, instruction and data address latches, and the segment table.
The design is pad limited, so no extra e ort was expended to compact the layout beyond what was necessary to t into the pad frame.
INSTRUCTION SET
In this section we present the VLSI-BAM instruction set. The instructions are divided into three groups: general purpose, Prolog inspired general purpose, and Prolog speci c. The general purpose instructions are those which can be found in typical processors. The Prolog inspired instructions are those which are not often present in general purpose processors, but which can still be used for general com-putation. The remaining instructions are tailored speci cally to the requirements of Prolog execution.
The general purpose instructions are summarized in Table 3 .1. It is important to point out that all arithmetic and logic operations operate on the full 32-bit word. Also, conditional branches consist of separate compare and branch instructions. Compare instructions set or clear the TF (true-false) condition code bit, and the branch instructions take the branch when TF is set. Branches, jumps, and calls are delayed by one instruction. The instruction in a branch delay slot can always be executed (bt), annulled (turned into a nop) if the branch is taken (btat), or annulled if the branch is not taken (btan). Both directions of annulling are included because Prolog often favors annulling when the branch is taken (for example, branching out of straight-line code to the uni cation failure routine), whereas conditional branches to the top of a loop (common in procedural languages) favor annulling when the branch is not taken.
Special load and store instructions (ldl and stu) activate a signal sent to the cache (external to the VLSI-BAM chip) which can be used by the cache controller logic to implement the \lock" and \unlock" multiprocessor synchronization operations 3]. Indexed load and store instructions (ldx and stx) are useful for the Prolog built-in predicate arg/3 and for matrix operations in procedural languages.
Limited support is provided for detecting 32-bit two's complement over ow for addition and subtraction (add32 and sub32 ). Addition by a constant (addi) is most often used for pointer arithmetic and so is available only as a non-trapping instruction. 5 The divide step (divs) and multiply step (mpys) instructions perform a single bit shift with conditional subtract/add. Thirty-two consecutive divs (mpys) instructions perform a complete 32-bit divide (multiply).
The special register le provides extra storage for system routines. Access to the special register le is accomplished using the rd and wr instructions.
Because the VLSI-BAM processor supports code which has branch instructions in branch delay slots, to correctly restart the pipeline after a trap, two PCs must be saved by a trap. The software trap instruction (trap) transfers control to a table of jmp instructions in low memory. Two instructions for each table entry are required for the jmp and its delay slot.
The remainder of this section motivates and presents our extensions to the general purpose instruction set. A major in uence on the design of these extensions was the simultaneous development of an optimizing Prolog compiler. The abstract machine used by the compiler was initially designed using a top-down approach 31]. We assumed a set of data structures similar to those used by the WAM. Knowledge of possible compiler optimizations was applied to the semantics of Prolog to decompose Prolog's general operations into their components. These components, the abstract instruction set, are the instructions and addressing modes required to compile Prolog operations into e cient code. E cient translation of abstract machine instructions into the architectural instruction set was a prime in uence in the rst pass of the instruction set design. 
add, sub, and, or, xor r(i), r(j), r(k) Tables 3.1{3 .3 summarize the VLSI-BAM processor instruction set. The rst two columns give the instruction mnemonic and operands. The third column gives the instruction's register transfer description. R(i) denotes general purpose register i; s(i) denotes special register i; dispn is a signextended n-bit displacement; immn is a sign-extended n-bit immediate; addr26 is a 26-bit segment o set; o 1 8 and o 2 8 are zero-extended 8-bit displacements; tag is a four-bit immediate tag value; and cond is one of twenty comparison conditions. M x] is the memory location at address x. Tag^value speci es the tag insertion operation. Tvar represents the value of the unbound variable tag (0001). Cycle counts assume no pipeline stalls due to load or store delays. All branch and jump instructions are delayed, and the following instruction is executed unless it is annulled. The cycle count of dref depends on the number of memory operations (l) performed.
design. For example, although we had included cdr-coding of lists in our earlier processor 9, 25], we found that it is not of su cient bene t to justify hardware support. Therefore cdr-coding support was not provided in the VLSI-BAM. In the following subsections we provide additional WAM performance measurements that we found useful as a basis for making design decisions during the design of the VLSI-BAM processor.
In addition to our studies of abstract instruction sets, we investigated the microarchitectural requirements for high performance Prolog 5] and gathered execution statistics for the VLSI-PLM, a microcoded implementation of the WAM 25] . These investigations pointed out those microarchitectural features that would give the greatest performance gains and the Prolog operations that most need instruction set support.
Prolog Inspired General Purpose Instructions
Prolog inspired general purpose instructions are those instructions which support Prolog and which also may be useful in the implementation of other languages (Table 3. 2). These instructions include load and store of immediates, single-cycle double-word load and store, and push and pop memory operations.
Immediates can be loaded, stored, or used in a comparison (ldi, sti, stid, cmpi). The immediates are tagged and are created by sign-extending a 12 or 17-bit immediate and replacing the four most signi cant bits with an immediate tag. Load immediate (ldi) is used for creating integers and atoms. Store immediate (sti) is an optimization of a ldi, st sequence and is used to bind an atom with a variable that is known at compile time to be unbound. Double-word memory operations (ldd, std, stdc, pushd, pushdc) are motivated by Prolog's large memory bandwidth requirements. A double-word store or push is single-cycle (stdc, pushdc) only if the source registers form a consecutive, even/odd register pair, because only three registers, two of which must be adjacent, can be read from the register le per cycle. The std and pushd instructions allow the use of non-consecutive registers. They are two-cycle instructions, but this is o set by the absence of a pipeline stall when they are immediately followed by a memory operation.
Push instructions are included to support compound term creation. Using branch-and-bound search techniques, we determined an optimal set of single-cycle instructions for creation of all possible two and three-word structures 14]. This set of instructions is optimal in the sense that, for our microarchitecture, each structure is created in the smallest number of cycles. The resulting \compound term creation instruction set" favors the idiom of placing two words of data in registers and then moving them to memory using a double-word push. The VLSI-BAM chip also provides the external cache controller with a \push instruction" signal. With a properly designed external cache controller, push operations can skip the ll of the cache line from memory if a push incurs a cache miss and also refers to the rst word of the cache line 6]. This optimization has been used in a previous Prolog design 20]. The push instructions allow the amount of the increment to be speci ed, and any general purpose register can be used as a stack pointer.
Prolog requires that variable assignment be undone on backtracking. This unbinding of variables is implemented by recording variable addresses on a \trail" stack. The original WAM model requires two pointer comparisons to determine if trailing is necessary. Our implementation restricts variables to the global stack (which reduces the number of comparisons to one) and uses a compare instruction followed by a conditional push (pusht). The pop instruction is used during backtracking to retrieve variable addresses from the trail stack. The compiler can reduce the amount of trailing and detrailing through the use of ow analysis to determine when uninitialized variables 1] can be used (our use of uninitialized variables is different from 1]|we use the same tag for both initialized and uninitialized variables and determine at compile time when destructive assignment is safe).
The location of, and interaction between, the environment and choice point stacks is software de ned. However, unsigned maximum (umax) is provided to simplify the management of the environment and choice point stack pointers when these stacks are intermixed. In this case, allocation occurs at the maximum of the two stack pointer values.
Prolog Speci c Instruction Set Support
Prolog speci c instructions are those instructions tailored speci cally for e cient execution of Prolog (Table 3. 3). These instructions support tagged pointer creation, two and three-way branch on tag, pointer dereferencing, and uni cation of atoms.
3.2.1. Tagged Data Support Pointer creation is accomplished by the load e ective address (lea) instruction which calculates an address and then replaces the most signi cant four bits with an immediate tag. This instruction is used to create pointers to unbound variables and compound terms (lists and structures).
Type checking built-ins are supported with single-cycle compare-and-branch-ontag instructions (btgeq and btgne). These instructions also allow the compiler to replace shallow backtracking with a conditional branch on an argument's tag.
Prolog allows unbound variables to be bound together. The resulting reference chain must be dereferenced before subsequent variable binding. WAM instructions always dereference their operands, often resulting in super uous dereferencing. However, our optimizing compiler keeps track of which variables are dereferenced and generates explicit dereferences only when necessary. Implementing dereference as a single instruction reduces static code size and allows dereference memory reads to be pipelined, resulting in a tighter loop than the equivalent assembly code 11, 24] . We use the same tag value for both unbound variables and reference pointers (unbound variables are self referential). The dereference instruction (dref ) is implemented as a sequence of internal opcodes. Because a single dref instruction could potentially require many execution cycles, it is interruptible and restartable.
All of the basic arithmetic and compare instructions (add, sub, and, or, xor, cmp) have a version that traps on 28-bit over ow. These instructions operate on the full 32-bit word, but 28-bit over ow occurs if either of the sources or the result do not have integer tags (0000 or 1111). The trap on 28-bit over ow allows Prolog arithmetic operations to be compiled to fast, safe code that avoids extra instructions for tag over ow checking. If a 28-bit over ow does occur, the trap routine can signal an over ow error or convert the data into an alternative representation.
Unification Support
Uni cation is one of the primary operations of Prolog; it is used for argument passing, structure creation, structure decomposition, and pattern matching. Although general uni cation is a complex algorithm, if one is given information about the arguments being uni ed, the general algorithm can be greatly simpli ed. This is one of the advantages of the WAM instruction set over an interpreter. Our compiler takes this principle further and propagates information to simplify uni cation as much as possible.
Analysis of the primitives necessary to support uni cation of a Prolog variable with an atom 31] motivates the single-cycle unify-immediate instruction (uni) which binds the atom to the variable if the variable is unbound, and otherwise tests them for equality.
Uni cation of a Prolog variable with a compound term also bene ts from special support. Analysis of the primitives necessary to support uni cation of a Prolog variable with a list or structure 31] motivates the switch-tag instruction (swt), a three-way branch based on the tag of one register. One direction of the branch is taken if the tag is an unbound variable; a second direction is taken if the tag matches a speci ed immediate tag (usually list or structure); and a third direction is taken for all other tags. The three-way branch could be implemented using two two-way branches, however, WAM execution statistics (Table 3 .4) show that there is a small but signi cant performance advantage to the three-way branch.
The LOW RISC processor 19] provides a 5-way branch and the Carmel-2 processor 11] provides a 10-way branch based on the tag of a single register. WAM execution statistics show that such generality is unnecessary for uni cation of a Prolog variable with a compound term.
When the compiler cannot determine any information about the types of the arguments to be uni ed, then general uni cation must be used. In this case one can still take advantage of dynamic properties of the argument types. The common cases of general uni cation should be done quickly in-line and infrequent cases passed to a general uni cation subroutine. Analysis of WAM execution (Table 3 .5) indicates that about 70% of all general uni cations are simple bindings of an unbound variable with a non-variable. These statistics motivate the switch-bind instruction (swb), a three-way branch based on the tags of two registers. The conditions of the three branch directions are: variable/non-variable, non-variable/variable, and otherwise (order of the arguments matters). This allows the common cases of variable/non-variable and non-variable/variable to be done in-line. A general uni cation subroutine is called for all other cases. Note that although the quick success and quick failure cases are simple to check for, their execution frequency is low enough that we have chosen not to do these checks in-line.
The Pegasus processor 24] supports general uni cation with a 16-way branch based on two tag bits from each of two registers. The LIBRA processor 18] has a \partial unify" instruction. This single-cycle instruction performs either a nop, a store, a call, or a branch depending on the tags and comparison of the two arguments. It executes the variable/non-variable case of general uni cation in four cycles (not counting dereferencing of the arguments). Using switch-bind (swb), Figure 3 .1 provides an example of the use of several of the Prolog speci c instructions. The predicate (created just as an example) succeeds for only certain combinations of variables and atoms for the arguments. When the second argument is an atom, the arguments are uni ed. When compiling, we assume that mode analysis cannot deduce any information about the types of the arguments when the predicate is entered.
The VLSI-BAM code places the two predicate arguments in registers r0 and r1. Before the type can be determined using a swt instruction, each argument must be fully dereferenced (using dref ). The rst swt instruction branches to label example_2_1 when r0 is an atom. Execution falls through when r0 is neither an atom or an unbound variable. In this case, the fall-through corresponds to failure of the predicate, and so the fail routine is called. As an optimization, the delay slot of the jmp(fail) is lled with the rst instruction of the fail routine and we now jump to the second instruction of fail.
The delay slot of the swt is executed when it branches to its rst destination. Typically, the rst instruction at the destination (dref(r1)) is replicated in the delay slot and the destination address incremented in order to reduce the execution time by a cycle.
When the two arguments are an unbound variable and an atom (in that order), then the uni cation reduces to the binding of a variable to a non-variable (st (r1,r0) ). The variable being bound may require trailing, and this is done with a cmp(ltu,r0,hb), pusht(r0,tr,1) sequence. When both arguments are atoms, the uni cation simpli es to an equality comparison (cmp(ne,r0,r1), btat(fail)).
We will return to this example in Section 5 to help illustrate the methodology behind our cost-bene t analysis.
COMPILATION OF PROLOG
A signi cant aspect of our project was the simultaneous development of an optimizing Prolog compiler 31, 35] Compilation of Prolog is done in three stages. First, the compiler produces code for its abstract machine. Second, this code is macro-expanded into the VLSI-BAM instruction set. Finally, the VLSI-BAM code is optimized by a peephole optimizer and instruction reordering stage that maximizes the use of the double-word bus and minimizes the number of nops and pipeline stalls.
COST/BENEFIT ANALYSIS OF ARCHITECTURAL FEATURES AND INSTRUCTIONS
In Section 3 we motivated our instruction selection based on several sources of information: work on abstract instruction sets for compilers, bottom-up analysis of microarchitectural requirements for high performance Prolog, and analysis of WAM execution statistics. In this section we give a more rigorous validation of the architectural design and instruction selection by analyzing the cost and performance bene ts of each special purpose feature and instruction. There has been some work to determine such results for other designs 10, 11, 24, 26] , but the analysis presented here is more complete. Table 5 .1 shows the implementation cost of those features that extend the VLSI-BAM beyond a general purpose architecture. Implementation cost is expressed in terms of chip area required to implement the feature and in terms of VLSI design e ort required. The chip area is measured in percent of total active area which includes both transistor and wiring area. The chip contains approximately 110,000 transistors, and the total active area is 91 square millimeters using 1.2 CMOS (two metal layers). The VLSI layout was done using a symbolic layout editor with custom designed, parameterized cells. The building blocks were assembled into larger units using a data path compiler, PLA compiler, tiler, and router. The design e ort for each feature is given as a percentage of its design that was automatically performed by the design tools. The last column of Table 5 .1 lists those instructions that depend on a given feature. We do not give each feature's e ect on the cycle time, since the microarchitecture and logic were designed carefully to prevent these features from being on the critical path.
Cost of Features
Segment mapping requires the greatest area of the special features. This area is primarily due to the 32 by 24-bit register le which contains the segment map. This register le is used to extend the address space as well as perform tag mapping. A smaller register le tailored to tag mapping alone would take less area. The next greatest area consuming feature is the tagged-immediate generation circuitry. This is due in part to the use of three distinct instruction formats for tagged-immediates. For each special feature of the VLSI-BAM processor, this table gives the percentage of active area (transistors and wires) required to implement the feature, the design complexity of the layout, and a list of instructions that depend on the feature. The design complexity is given as a percentage of the layout that was automatically generated (using tilers, routers, etc.) and the percentage that was laid out by hand. 100% compiled indicates that less than 30 gates were placed by hand. Multi-cycle/conditional is a subset of internal opcodes|the 0.1% active area refers to the entire internal opcode implementation.
The three-way branch instructions, swt and swb, use a unique destination o set format and require two addition displacement adders to allow the destination address calculation to overlap with the opcode decode. The double-word memory port requires extra ports on the general purpose register le to support the increased bandwidth. The area listed is the di erence in size between our four/ ve-port register le and the more usual three-port register le. 6 The extra pads required by the double-word bus are not included in the cost. After the fast tag logic, the remaining features use a very small portion of the total active area.
Bene ts of Features
To determine the performance bene t of each feature, we calculated the cycle count increase caused by omitting the use of all instructions that depend on the feature 30]. For example, if omitting the instructions ldd, std, stdc, pushd, and pushdc increases execution time from 100 cycles to 111 cycles, then the performance bene t procedure(example/2).
dref(r0).
swt(r0,tatm=example_2_1+1, tvar=example_2_2). dref(r1). jmp(fail+1). ldd(b-2,t0/t1).
label(example_2_1).
procedure(example/2). label(expand_0).
btgeq(tvar,r0,expand_2). label(expand_1).
btgeq(tatm,r0,example_2_1). btgeq(tvar,r0,example_2_2).
jmp(fail+1). ldd(b-2,t0/t1). label(expand_2).
ld(r0,r14). cmp(eq,r0,r14). btat(expand_1). jmp(expand_0). addi(r14,0,r0). label(example_2_1). due to the double-word memory port is 11%. An instruction is omitted by replacing it with its macro-expansion into instructions that still remain in the instruction set. An e ort was made to determine optimal expansions, and after macro-expansion, peephole optimization and instruction reordering are performed. Omission of segment mapping requires that explicit instructions be inserted to mask tag bits before tagged-pointers are used as addresses. The combined bene t of two or more architectural features is determined by omitting all instructions a ected by at least one of the features. A description of a preliminary version of this analysis is given in 23].
To illustrate the technique used to measure performance bene ts, Figure 5 .1 shows how the code from Figure 3 .1 changes when the dref and swt instructions are removed from the instruction set. The work done by these two instructions must now be done using the btg instruction along with several general purpose instructions.
The dref instruction is replaced with a btg that jumps to an explicit dereference loop. The dref instruction takes 1 cycle when the initial tag is non-variable and 4 cycles when only one memory load is required for dereferencing. The explicit loop requires 1 cycle when the initial tag is non-variable and 7 or 9 cycles when one memory load is required. To gain performance (at the cost of code size), a separate dereference loop is generated for each dref instruction replaced. Use of a single dereference subroutine would require saving the current PC and would take extra execution cycles.
The swt(reg,tag1=label_1,tag2=label_2) instruction requires 1, 2, and 2 cycles for the label_1, label_2, and fall-through branch directions. The btg instruction pair requires 2, 3, and 2 cycles, which is, on average, almost one cycle more for each dynamic occurrence. tag logic, double-word memory port, segment mapping, multi-cycle support, and tagged-immediate support are consistently important features. Tag over ow detection is important only in programs that make heavy use of integer arithmetic. The overall Prolog support column is determined by using only the instructions from Table 3 .1 (and non-tagged versions of ldi and cmpi), omitting segment mapping and all instructions in Tables 3.2 and 3. 3. Each of the bene ts listed in Table 5 .2 represent performance changes with respect to the full VLSI-BAM architecture and instruction set. In practice, however, the bene t of a feature depends on what other features are also present. To study the interaction of the architectural features, the performance of every combination of features was determined. Figure 5 .2 is a plot of the cost versus bene t of all 48 meaningful combinations of the special architectural features (except tag over ow detection|each of these 48 points assumes that tag over ow detection is supported). The upper left hand portion of the plot contains two additional points that represent the base case and the base case plus the instructions push, umin, and umax. The line is the lower half of the convex hull of the points and represents one strategy for adding features to the general purpose base. Each line segment connects the points so that the best bene t per cost is achieved as one moves from left to right. In this case, it turns out that each point on the line represents a set of features that adds one new feature to that of the point on its left.
Starting from the base architecture, we rst add the three instructions (push, umin, umax) that require no special architectural support, but which were not included in our base instruction set. Then tag over ow hardware is the next best feature to add, since the amount of hardware is negligible. It is interesting to note that the SPARC architecture also added tag over ow support as its one addition for supporting tagged languages. The remaining features, moving left to right, are To summarize, the specialized support added for Prolog does not require unreasonable amounts of chip space or hand layout (13% active area for all Prolog related features), and it provides a performance bene t of 60 to 70%.
Comparison with Korsloot and Mulder Korsloot and
Mulder 17] present a study of the bene ts of special architectural support for Prolog. They conclude that support for tags, auto increment/decrement addressing, and special choice point bu ers can provide a 20 to 25% performance advantage. This is a smaller advantage than given in Table 5 .2 (64% for all Prolog support), so it is important to investigate the source of the discrepancy. In this subsection we modify the VLSI-BAM bene t analysis to match as closely as possible the Korsloot and Mulder study. The primary source of the discrepancy is due to tag placement. A secondary contribution to the discrepancy arises from the bene t of specialized VLSI-BAM instructions that are not considered in the Korsloot and Mulder study.
One of the main di erences between the assumptions of the two studies is the placement of the tag bits in the 32-bit word. The VLSI-BAM places the tags in most signi cant bits of the word, whereas Korsloot and Mulder assume tags in the least signi cant bits. Korsloot and Mulder break tag support down into three categories: tag masking (to use tagged data as a memory address), tag insertion (to create tagged data), and tag extraction (primarily to branch based on the tag value). When the tag bits are in the least signi cant bits, tag masking can be done using displacement addressing. Tag insertion is simply addition of a small constant (for immediates, this can be done at compile time). These di erences result in a di erent conclusion about the performance bene ts for tag masking and tag insertion hardware support. Table 5 .3 gives the cycle count reduction resulting from architectural support for tag masking and tag insertion. Both of these results are relative to our base architectures. The results show a signi cant di erence in the bene t of hardware support for tag masking and tag insertion. The bene t is small when the tag bits are in the least signi cant bits (2% for masking and 2% for insertion), but more sizable when the tags are in the most signi cant bits (4% and 12%, respectively). Tag placement in the least signi cant bits shows less bene t due to specialized architectural support, therefore this tag placement has better performance when there is no specialized support.
When there is no architectural support the essential advantage of placing the tag in the lower bits is that tag insertion and tag masking can be done as the addition of a small constant that can be combined with immediates and displacement constants. When the tag is in the most signi cant bits of the word, then combining the constant for the tag operation with the o set requires an immediate with bit elds speci ed at both ends of the word|not an immediate format that is usually supported.
Over ow detection during tagged arithmetic is simpler with the tags in the low bits, because the value of the tagged integer lies in the upper part of the word, and an over ow of this value is detected by the over ow hardware already present for 32-bit arithmetic. This advantage is not of the greatest importance, since the time required for over ow detection even in the case of most-signi cant bit tags with no hardware support is modest (2.6% from Table 5 .2).
The proper choice for tag placement appears to depend strongly on whether architectural support will be provided for tag operations. Obviously if a Prolog system is to be implemented on hardware that includes no support for tags, then there are compelling reasons to place the bits in the least signi cant bits. But once one is willing to commit hardware for tag support, the choice becomes more di cult. Tags in the most signi cant bits can be of uniform width, simplifying the branch on tag support. For tags in the least signi cant bits the advantage of combining tag constants with immediates and o sets becomes less signi cant because formats for immediates can be changed to include bits from both the high and low ends of the word. Also, the amount of hardware required to support tagged-arithmetic over ow detection is minimal in either case.
Returning to the comparison of our analysis with that of Korsloot and Mulder, we feel that the best way to compare architectural support, other than for tag masking and tag insertion, is to assume that our base architectures already include support for tag masking and tag insertion. Although this is not exactly what the results of Korsloot represent, their numbers should be very close since the incremental e ect of tag masking and tag insertion is only 1%. 7 Speci cally, the base architecture that we assume consists of the VLSI-BAM base plus segment mapping and instructions needed to support tag masking and insertion. This con guration is labeled \KM base" in Table 5 .4. 8 Using this base, we present results in Table 5 .4 for the advantages of adding architectural features to support tag extraction, address modi cation, and choice point support.
For tag extraction Korsloot adds a branch on tag instruction. This is comparable to the btgeq and btgne instructions in the VLSI-BAM. Adding btg instructions to the KM base results in similar cycle count reduction as observed by Korsloot and Mulder.
Address modi cation consists of adding auto increment and auto decrement addressing modes. This is the same as adding the VLSI-BAM push and pop instructions (excluding the double-word versions). In this case, we observed less reduction in cycle count, apparently because our instruction reorderer is able to combine multiple addi instructions arising from macro expanding push instructions (push is replaced with st and addi). Speci cally, for boyer and chat parser, the increase in dynamic addi count (resulting from disallowing auto address mode) is 3.3 times less that the dynamic count of push and pop. This ratio roughly matches the ratio of the reductions in cycle count: (1 ? 0:96)=(1 ? 0:989) = 3:6 for boyer.
Korsloot and Mulder provide choice point support by adding a shadow bu er that allows quick transfer of between the bu er and the register le. The VLSI-BAM design supports choice points by relying on compiler optimization and the use of double-word loads and stores. The \choice point support" column in Table 5 . 4 shows that double-word loads and stores provide a slightly better performance improvement (3%) over a specialized choice point bu er. This is because doubleword memory operations can also be used for creating procedure environments and accessing adjacent arguments of lists and structures.
When the support for tags, address modi cation, and choice points are all com- . The top rows are data collected with our tools. The column labeled \KM base" represents the VLSI-BAM general purpose base plus segment mapping, ldi, cmpi, lea, and 28-bit arithmetic. The following instruction additions to this base are made for each column: \tag support", btgeq and btgne; \address modi cation", push and pop; \choice point support", ldd, std, and stdc; \SRISC", btgeq, btgne, push, pop, ldd, std, stdc, pushd, and pushdc; and \VLSI-BAM", SRISC plus all the remaining VLSI-BAM instructions.
bined (the \SRISC" column), the total bene t is a cycle count reduction by 14 to 23%. For boyer and chat parser, we nd a slightly smaller reduction (3 to 5%) than Korsloot and Mulder. The additional VLSI-BAM instructions beyond \SRISC" (ldx, sti, stid, pusht, umin, umax, dref, uni, swb, and swt) provide an additional 3 to 10% performance advantage. Section 5.3 provides more detail on which instructions contribute the most to this improvement.
Note that the total bene ts for special purpose Prolog support found here average to 37.6% (1=0:727 = 1:376) compared to 64.2% (Table 5. 2). This di erence, as we discussed earlier, is due to the e ects of tag placement on the apparent architectural advantage of tag masking and tag insertion. 9 Two primary conclusions can be drawn from the results of the comparison with Korsloot and Mulder. First, if there is no architectural support for tag operations, placing the tags in the least signi cant bits gives better performance. With architectural support, however, there is no performance advantage to either tag position. The choice is primarily dictated by aesthetics or historical preference. Second, use of double-word load and store instructions gives as much or more performance improvement than specialized choice point bu ers.
Effect of Compiler Optimizations Andrew
Taylor has reported performance results for an optimizing Prolog compiler (PARMA) targeted for the MIPS processor 28]. His work raises the question of whether specialized hardware can be completely replaced with compiler optimization. The combination of PARMA and the MIPS R3000 is roughly equivalent in speed with the combination of Aquarius and the VLSI-BAM (after adjusting for di erences in clock rates). This implies that the PARMA compiler is able to nd more performance enhancing optimizations than the Aquarius compiler. But special purpose architectural features could also speed up the PARMA/MIPS combination. Most likely, however, the performance advantages of special features would be reduced.
In this section we attempt to estimate the usefulness of our special architectural features given an improved compiler. A version of the PARMA compiler which generates VLSI-BAM code is not available, so to approximate the quality of code given by such a compiler, we modify the assembly code produced by the Aquarius compiler based on an analysis of the execution trace of each benchmark. This analysis determines which operations, at least for the speci c inputs used, are not productive for achieving the solution. Because the analysis is based only on a limited set of the possible inputs, the resulting optimizations are \unsafe" as far as producing correct code is concerned, but the analysis allows us to approximate the code produced by an extremely good compiler.
Analysis of the trace determines which operations are not taking one closer to the solution. For example, if a particular branch is never taken, then that branch can be removed. If the new value written to a register is always identical to the old value it replaces, then that register update is unnecessary (the same holds true for memory locations). A speci c case of this is when a dereference instruction does not result in a new value being loaded into the register|such a dereference instruction would be removed by our hypothetical \optimizing compiler." Also, register and memory liveness can be computed from the trace, and if a memory location (or register) is not read between the time of two writes, then the rst write is unnecessary.
Unfortunately, our analysis does not approximate some important optimizations which, if implemented, could possibly have a large e ect on the performance bene t of specialized architectural features. For example, we do not take into account the e ects of global register allocation, determinism extraction, multiple specialization, and compile-time garbage collection 32]. Consequently, our results do not bound the improvements possible with compiler optimizations.
We performed the following steps to create optimized code. First, the program was executed and a trace of procedure calls was obtained. This trace contains information on the types of each of the arguments. Using the data types of the arguments, mode declarations were added to the program to supplement the compiler's static analysis. Then an instrumented instruction-level simulator was used to gather information on register (and memory) writes that did not change the value of the write's destination. Instructions that never place a new value in the destination of the write were eliminated. Next, all branch instructions that use the fast tag logic (btgeq, btgne, swt, and swb) were analyzed to determine if one direction was always taken. If the branch is always taken, it was replaced with a jmp. If the branch is never taken, it is removed. Also, register (and memory location) liveness was determined. If an instruction always writes a value that is never used, the instruction is removed. Table 5 .2. We see that the fast tag logic becomes much less important (it drops from 15% to about 3% performance bene t). This arises mostly from the near elimination of the dereference operation (see Tables 5.7 and 5.8). All of the other architectural features have roughly the same bene t as before. Thus, the primary architectural needs for Prolog, using the improved compiler, are high memory bandwidth (multi-word loads and stores) and fast tag masking and tag insertion. For the VLSI-BAM these tag operations are done using segment mapping and tagged-immediate formats. If the architecture is byte-addressed and the implementation places the tags in the leastsigni cant bits, then these tag operations can be performed using displacement addressing. The need for multi-word memory loads and stores, however, remains. Table 5 .6 shows that specialized hardware support is still signi cant (31% for nand and 34% for meta) with the improved compiler. In Section 5.2.1 we saw that the KM base roughly corresponds to a general purpose base with a Prolog implementation that places tags in the lower bits. With the Aquarius compiler, specialized hardware of the VLSI-BAM provides slightly more speed up (36 and 43%, respectively). The reason that the speed up due to specialized hardware changes only slightly as compiler optimizations improve is that in spite of the extra optimizations, Prolog still requires signi cant bandwidth to and from memory. Also, branch-on-tag support will always be useful due to the imperfect knowledge at compile time (for example, we do not know the data types of some inputs until run-time).
The performance bene t of the improved compiler over the Aquarius compiler is 21 and 26% (Table 5 .6). Using data from Taylor's thesis 28], PARMA's improvement over Aquarius could be as much as 40% for nand. The extra optimization capabilities of PARMA over our hypothetical compiler points to the need for additional bene t studies using a compiler as good as or better than PARMA. (Table 5 .4) and the VLSI-BAM instruction set, both using the \improved" compiler optimizations. The \Compiler optimizations" column gives the performance di erence of the \improved" compiler and the Aquarius compiler with static mode analysis, both using the VLSI-BAM instruction set. Cache e ects are not included. Table 5 .7 lists performance bene ts of individual instructions or instruction groups. Signi cant (greater than one percent) performance bene t is obtained from a majority of the special purpose instructions (dref, umin/umax, btgeq/ne, push/d/c, lea, and swt). The multi-cycle pointer dereference instruction (dref ) has an average execution time of 1.6 cycles. Macro-expansion of dref into an explicit loop increases the average dereference time to 2.2 cycles. Although the bene t of dref per dereference is only 0.6 cycle, the total performance bene t is signi cant because of its frequent use. Some of the smaller benchmarks (not listed in the table), however, show no bene t for dref due to the complete elimination of dereferencing by compiler optimization. Unsigned maximum (umax) is used during environment and choice point creation. Omission of umax causes the time to determine the top of stack to increase from one to three cycles. Tagged-pointer creation (lea) is a frequent operation, and its omission adds an extra cycle for tag insertion (using or). Elimination of auto-increment addressing (push, pushd, pushdc) requires one extra cycle for each block allocation. The three-way branch on tag (swt) can be replaced by two btgeq instructions, adding an extra cycle to two of the branch directions. Elimination of the two-way branch on tag (btgeq/ne) would require a two instruction compare and branch.
Bene ts of Individual Instructions
Compared with our previously published results 15], the performance bene t of swt has dropped signi cantly. The di erence is due to improvements in the compiler and instruction reorderer that allow btg instructions to replace swt in special cases.
The remaining instructions have less than one percent average performance bene t. Because the VLSI-PLM spends about 5% of its time trailing variable addresses, we included special support in the VLSI-BAM (pusht). However, due to the compiler's use of uninitialized variables, which do not have to be trailed, trailing time is reduced to 1.4% in the VLSI-BAM. Omitting pusht causes a slow down of 0.6%, which corresponds to trail time increasing from 2 to 3 cycles. Preliminary analysis using macro-expanded WAM for the chat parser benchmark indicated that the bene t for pop would be 1.5%. Compiler optimization of trailing has reduced this result. Similarly, compiler optimization reduces the number of general uni cations, minimizing the bene t of swb. Our initial studies also overestimated the bene ts of special support for uni cation of atoms (uni, sti, stid). Every time one of the atom uni cation instructions is executed, one cycle is gained when compared to not having the instruction. However, they simply do not have a high enough dynamic execution frequency to make a signi cant impact on the performance. Although pusht, swb, pop, uni, sti, and stid provide marginal performance bene t, their implementation uses only features already required by other instructions. An interesting conclusion about the number of directions needed in multi-way branches can be made from these measurements. Multi-way branches are implemented in the VLSI-BAM with the swt and swb instructions, which are both singlecycle three-way branches (Table 3. 3). Swt is used for uni cation of compound terms, for which greater than a three-way branch is not needed (Table 3 .4 and 31]). Swb is used for uni cation of terms whose types are unknown at compile time. It takes care of 70% of these cases (Table 3 .5), which gives a 0.5% execution time improvement (Table 5.7) . If some single-cycle branch took care of 100% of these cases, we calculate the further improvement would be about 0.2%. Given the additional complexity that such a branch implies, we conclude that a multi-way branch with more than three directions is not e ective for Prolog. Table 5 .8 gives the performance bene t of the special purpose instructions assuming an improved compiler. As mentioned before, the bene t of the dref instruction is drastically reduced because most dereference instructions do no useful work and are eliminated. The bene t of the btgeq and btgne instructions is reduced approximately in half, since a sizable number of these instructions always take the same branch direction. All of the remaining instructions remain at about the same level of bene t. Thus, a set of push instructions remains important along with an unsigned maximum for stack pointer manipulation. Macro expansion of WAM code into SPUR instructions causes the large code size of the SPUR. Static code size for the VLSI-BAM is surprisingly small, only slightly larger than that of the KCM. Measurements of the code size resulting when just using the VLSI-BAM's general purpose base instructions show that without the special architectural features for Prolog, the code size would be 2.1 times that of the VLSI-BAM. This is exactly the ratio observed between VLSI-BAM code and Aquarius Prolog SPARC code. Thus, the compactness of the VLSI-BAM code is due to the success of ow analysis in reducing code size (overcoming simple, but verbose macro-expansion) and the appropriateness of the VLSI-BAM instruction set for Prolog. But when swt is eliminated from the instruction set, it is replaced by two consecutive btg instructions (see Figure 5 .1). The btg instructions have a much greater o set range, and so are not likely to over ow and need replacement. A swt instruction that over ows for both destinations requires more code space than just two btg instructions.
Summarizing the Costs and Bene ts
In this section we have looked at special purpose support for Prolog from several perspectives. Here we summarize our ndings by giving our opinion on what hardware support is best for Prolog, taking into account trends toward further compiler improvement.
If we assume that the implementation places the tags in the lower bits of the word, then hardware support for tag masking and tag insertion is supplied by base plus displacement addressing and normal load immediate and add immediate instructions.
The greatest need, given both current and future compilers, is adequate bandwidth to and from memory. This can be provided by multiple-word loads and stores. Fortunately, multi-word memory loads and stores are now part of most high-performance microprocessor instruction sets.
The most important specialized support for Prolog can be reduced to a fast (single cycle) branch on tag (equality or non-equality comparison with an immediate value). When the tags are placed in the low bits of the word, this usually forces the tags to be variable size. Unfortunately, this complicates a special branch on tag instruction. One solution is to support only the most common (or the two most common) tag placement and width. Less common tag formats must be checked using a multi-instruction sequence of a mask operation followed by a normal compare and branch.
The VLSI-BAM supports three-way branch-on-tag instructions, but the indications are that they are currently at the margin of usefulness (swt provides only a 1% performance advantage and swb is less), and future compilers will further reduce this usefulness. Also against three-way branches is the need for additional displacement adders and the restricted range of branch o sets.
The single most useful specialized instruction in the VLSI-BAM instruction set is dref. Not only does it provide a 4% performance bene t, but also a 24% reduction in static code size. However, trace analysis shows that most dref instructions could possibly be removed by a better compiler. Whether a dereference instruction should be included in future Prolog instruction sets is an open question, and depends on the e ort of compilation that goes into the majority of code executed. To achieve a more interactive system, one may not spend much time on static analysis, and so a dereference instruction would be useful for reducing code size.
Although the performance bene t is small, the cost of supporting a tag check on tagged arithmetic operations is so small (a few gates) that such support should be considered. It is not surprising, then, that the SPARC instruction set supports tagged add and subtract instructions.
There are two other instructions that improve Prolog performance that can be considered \general purpose," but that are not present in many of the highperformance microprocessor instruction sets. Auto-increment addressing mode for stores (push instructions) speeds up stack operations and heap data creation. Management of interleaved stacks can be done more e ciently with the unsigned maximum operation. Both push and unsigned maximum maintain their performance enhancement as compiler optimization improves. The utility of the unsigned maximum instruction, however, remains to be veri ed for the case when the environment and choice point stacks are not interleaved.
In summary, future high-performance instruction sets for Prolog should include the following additions to a general purpose base: multi-word memory loads and stores, a single cycle branch on tag, tagged arithmetic support, push instructions, and (possibly) unsigned maximum. ) , of which query is modi ed to use integer division in place of the original oating point; mu, which proves a theorem of Hofstadter's \mu-math"; prover, a simple theorem prover; queens 8, which solves the eight queens problem using an incremental generate-and-test strategy; meta qsort, a meta-interpreter running Warren's qsort; nand, a logic synthesis program using branch-and-bound search; simple analyzer, a ow analyzer analyzing Warren's qsort; poly 10, which symbolically raises a polynomial to the tenth power; chat parser, which parses a set of English sentences; boyer, an extract from a BoyerMoore theorem prover; and peep, the VLSI-BAM peephole optimizer processing meta qsort. Further information about many of the benchmarks may be found in 12]. Table 6 .1 compares the performance of the VLSI-BAM processor to that of two other Prolog systems. A more complete comparison of the performance of various Prolog systems can be found in 34, 32] . The results for VLSI-BAM are simulated assuming a 20 MHz clock and include overhead due to cache misses 6]. A clock speed of 20 MHz is used because it is the speed at which several VLSI-BAM chips successfully executed all the benchmark programs. The simulated system has 128 KB instruction and data caches. The caches are direct mapped and use a write back policy. They are run in warm start, that is, each benchmark is run twice and the results of the rst run are ignored. Cache e ects are signi cant only for the last six programs in Table 6 .1. The cache overhead is greatest for simple analyzer, and poly 10; for these programs the overhead ranges from 11% to 38%. For meta qsort and chat parser the overhead is less than 3%.
PERFORMANCE RESULTS
The KCM 2], one of the fastest WAM implementations, has a relatively large amount of specialized hardware to execute a WAM-like instruction set e ciently, whereas the VLSI-BAM processor uses modest hardware to support an optimizing compiler. We nd that the speed advantage of the VLSI-BAM over the KCM is equal to or greater than the cycle time ratio.
Although the same compiler is being used for both the SPARC and VLSI-BAM machines, the VLSI-BAM outperforms the SPARC for several reasons. First, there is the improvement due to specialized hardware support (this accounts for between 30 and 40%), but the majority of the di erence in performance is due to the SPARCstation 1+'s relatively slow load and store instructions. Because Prolog programs heavily use memory loads and stores, a slow memory system will have a dramatic e ect on the performance.
A common measure of Prolog speed is logical inferences per second (LIPS). In general this quantity is ambiguous; however, it is well de ned for the naive reverse benchmark. The VLSI-BAM processor correctly executes naive reverse at 27.5 MHz 10 giving a measured performance of 3.37 million LIPS.
CONCLUSIONS
The primary goal of our research has been to determine a minimal set of extensions to a general purpose architecture necessary for achieving high performance logic programming. At the same time, however, performance of the general purpose architecture has not been compromised. When tags are placed in the most-signi cant end of the word, we have identi ed tagged-immediate support, segment mapping, double-word memory bus, special logic for fast branch on tag, and multi-cycle instruction support as important Prolog speci c features. This hardware support gives a 64% performance bene t and costs 13% of the VLSI-BAM chip area. When tags are placed in the least-signi cant end of the word, then double-word memory bus and fast branch on tag logic continue to be very important. In this case, specialized hardware support gives a 38% performance bene t. Even when we use an improved compiler, specialized instruction set support gives about a 32% performance bene t.
Our special instructions for trailing and uni cation of atoms are of marginal performance bene t. We conclude that branches with three or more directions are not e ective for Prolog, especially as compilers improve. Our measurements, however, justify the utility of multi-word memory loads and stores, fast branch-on-tag instructions, push instructions, and tagged arithmetic. Such instruction set support would not only improve Prolog performance, but would also be advantageous to the implementation of other dynamically typed languages.
