Design and analysis of hardware for high-performance prolog  by Holmer, Bruce K. et al.
NORTH - HOLLAND 
DESIGN AND ANALYS IS  OF  HARDWARE FOR 
H IGH-PERFORMANCE PROLOG* 
BRUCE K. HOLMER, BARTON SANO, MICHAEL CARLTON, 
PETER VAN ROY, AND ALVIN M. DESPAIN 
D Most Prolog machines have been based on specialized architectures. Our 
goal is to start with a general-purpose architecture and determine a min- 
imal set of extensions for high-performance Prolog execution. We have 
developed both the architecture and optimizing compiler simultaneously, 
drawing on results of previous implementations. We find that most Prolog- 
specific operations can be done satisfactorily in software; however, there is 
a crucial set of features that the architecture must support o achieve the 
best Prolog performance. In this paper, the costs and benefits of special 
architectural features and instructions are analyzed. In addition, we study 
the relationship between the strength of compiler optimization and the ben- 
efit of specialized hardware. We demonstrate hat our base architecture can 
be extended to include explicit support for Prolog with modest increase in 
chip area (13%), and yet attain a significant performance benefit (60-70%). 
Experiments using optimized code that approximates the output of future 
optimizing compilers indicate that special hardware support can still pro- 
vide a performance benefit of 30-35%. The microprocessor described here, 
the VLSI-BAM, has been fabricated and incorporated into a working test 
system. <3 
*This paper is a revised and expanded version of "Fast Prolog with an Extended General 
Purpose Architecture," presented atthe 17th International Symposium on Computer Architecture, 
1990. This work was partially funded by the Advanced Research Projects Agency (ARPA) and 
monitored by the FBI under Contract J-FBI-91-194 and the Office of Naval Research under 
Contract N00014-88-K-0579. B. Holmer was funded by the National Science Foundation under 
the Research Initiation Award MIP-9210692. 
Address correspondence to Bruce Holmer, Siemens Components, Inc., 10950 North Tantau 
Ave., Cupertino, CA 95014. 
Received March 1995; accepted March 1996. 
THE JOURNAL OF LOGIC PROGRAMMING 
(~) Elsevier Science Inc., 1996 0743o1066/96/$15.00 
655 Avenue of the Americas, New York, NY 10010 PII S0743-1066(96)00068-4 
108 B. K. HOLMER ET AL. 
1. INTRODUCTION 
Logic programming in general and Prolog [27] in particular have become popular 
for rapid software prototyping, natural anguage translation, and expert system 
programming. Prolog's use of dynamic typing, backtracking, and unification place 
heavy computational demands on general-purpose computers. In an attempt o 
achieve ever higher performance, several special-purpose architectures have been 
proposed and built. Early Prolog architectures [37] were microcoded interpreters. 
Because no compilation was done, performance was disappointing. Higher per- 
formance processors [2, 9, 20, 25] have since been based on the Warren Abstract 
Machine (WAM) [36]. Their instruction sets were derived from the WAM to support 
execution of Prolog programs. These processors are special-purpose, microcoded en- 
gines that depend on parallel execution of operations within each relatively coarse- 
grained instruction for high performance. Initial designs implemented only the 
instructions that supported the WAM, and depended on a host processor for non- 
WAM computations. To support Prolog built-ins (primitive Prolog operations pro- 
vided by the system) and system I/O, newer designs incorporate general-purpose 
instructions to minimize dependence on a host. Alternatively, the use of a sim- 
ple, non-WAM instruction set better supports compiler optimization. Several such 
special-purpose r duced instruction set architectures have been proposed for logic 
programming [11, 18, 19, 24]. These architectures include primitives that support 
the use of tagged ata, pointer dereference, and multiway branches. Our hypothesis 
is that providing support for both compiler optimization and low-level operations 
can best be accomplished by extending a simple general-purpose architecture to 
support Prolog without compromising the general-purpose p rformance. 
The performance improvements of recent general-purpose architectures over older 
architectures can be traced to research in which both the compiler and architecture 
were developed together [13, 16, 22]. Architectural features that cannot be used 
by the compiler or that cannot demonstrate p rformance improvement are not in- 
cluded. Likewise, architectural features are added that support often-used primitive 
operations. We have adopted this approach from the beginning of our project. 
It has been conjectured that commercial special-purpose ymbolic processing 
architectures are doomed because they are not commodity items, and consequently, 
economics prevent hem from staying on the leading edge of implementation tech- 
nology. However, if the architectural features necessary to improve symbolic perfor- 
mance are modest and do not interfere with the general-purpose architecture, then 
as more chip area becomes available, future implementations of general-purpose 
processors can deliver high-performance symbolic omputing in a standard prod- 
uct. We hope that our work is a step towards this result. 
The VLSI-BAM design was begun at University of California, Berkeley, as part 
of the Aquarius project. The Aquarius project group relocated to the University of 
Southern California in 1989 where the VLSI-BAM chip and a custom cache board 
were completed, fabricated, and tested. 
This paper presents the design of a processor based on the Berkeley Abstract 
Machine (BAM) and motivates its design with the results of our preliminary studies. 
We also present a discussion of the optimizing compiler, a cost/benefit analysis of 
the architectural features, and the simulated performance. Section 2 summarizes 
the processor architecture and hardware implementation. Section 3 presents the 
instruction set, along with the results of our studies which motivated instruction 
HARDWARE FOR H IGH-PERFORMANCE PROLOG 109 
selection. The compilation of Prolog programs is described in Section 4, and in 
Section 5, we present a cost/benefit analysis of the special features and instructions. 
Section 6 gives the performance r sults. The final section concludes with a summary 
of our results. 
2. PROCESSOR ARCHITECTURE AND IMPLEMENTATION 
The VLSI-BAM processor is a general-purpose, ingle-chip, pipelined processor with 
extensions to support Prolog execution (Figure 1). Both data and instruction words 
are 32 bits, and most instructions execute in a single cycle. The main features for 
Prolog are tag manipulation (integrated into arithmetic and the memory system), 
a double-word ata port to memory, special branch on tag support, and several 
instructions to support our execution model for Prolog. 
The architecture is presented in detail, along with our motivations in the sub- 
sections below. Retaining a core general-purpose architecture imposes constraints 
on the symbolic extensions. For example, the processor should be able to handle 
tagged data items as single entities, with no special treatment for the tags. We dis- 
cuss the ramifications of this on the word format and the virtual memory system. 
Then we present he architecture's register structure and memory interface. Fi- 
nally, we present some details of the implementation such as the pipeline structure 
and our mechanism for multiple-cycle instructions. 
2.1. Word Format 
Prolog does not require the user to specify the type of a data item. This requires 
that run-time type checking be implemented by adding a tag to each data item to 
encode the type of that item. Many Prolog processors handle the tag and value 
fields separately. This approach does not satisfy our goal of integrating tagging into 
[ ox~..o...i, I 
instruction 
interface 
.38 38,[. 64,[, ):15 
address address ~ ,L ~, control 
instruction cache data cache 
F IGURE 1. Block diagram of the VLS I -BAM processor. 
110 B.K .  HOLMER ET  AL. 
a general-purpose architecture. Instead, we use a standard 32-bit word length and 
place the tag in the most significant 4 bits of the word. Arithmetic computations 
and addresses, however, use the entire 32-bit word, so general-purpose computa- 
tions are not affected by Prolog's use of tags. Tag values fixed by the hardware 
are those for nonnegative integers (0000) and negative integers (1111). This selec- 
tion of tags for integers is a common technique used by Lisp implementations on 
general-purpose machines [26]. We have also fixed the tag value for variable pointers 
(tvar = 0001) to increase the number of bits available for branch displacements in
several Prolog specific instructions. 1 All other tag values are software-defined. Our 
Prolog implementation uses tags similar to those of the WAM. 
2.2. Segmented Virtual Addresses 
One consequence ofusing both the tag and value as an address is that each data type 
is mapped into its own area of virtual memory; however, Prolog's execution model 
places data with several data types on the same stack or heap. One possible solution 
is to mask (zero) the tag bits of the address before using it to access memory. 
This solution is not satisfactory when applied to applications not using tags (for 
example, C programs). To avoid this difficulty, we have introduced a segment table 
that maps the most significant 6 bits of an address to a 12-bit value (Figure 2). 
An address before mapping is referred to as a short virtual address (SVA), and the 
38-bit address resulting from the mapping is referred to as a long virtual address 
(LVA). This memory segmentation scheme is similar to the segmentation used in the 
801 processor [7]. The 801 uses segmentation to extend the virtual address pace; 
however, our primary motivation for using segmentation is to allow multiple data 
types to be mapped to the same LVA segment. Mapping two bits in addition to 
the tag bits allows the use of several memory areas for a given data type, each area 
using a different mapping. 2 At one extreme, all data types can be mapped to the 
same LVA segment (this is equivalent o masking the most significant six address 
bits). At the other extreme, all SVA segments can be mapped to distinct LVA 
segments. In our current implementation of Prolog, variable, list, and structure 
pointers are mapped to the same LVA segment, whereas the environment/choice 
point stack, the trail stack, and the symbol table are mapped to separate segments. 
ZThe dref, uni, swb, and swt instructions all assume tvar = 0001 as part of their definition (see 
Table 3). 
2Although mapping more than 6 bits may increase the flexibility of this scheme, mapping 
additional bits would drastically increase the size of the on-chip segment mapping table. 
short virtual address 
value • lag ~ 
/ 
/ segmcnl map 
,/ 
12 
26 
scgment offscl 
20 I 
long virtual address 
F IGURE 2. Segmentation f vir- 
tual address pace. 
HARDWARE FOR H IGH-PERFORMANCE PROLOG 111 
Another use of segmentation is for sharing data in a multiprocessor system. In 
this case, the 38-bit LVA is used as the global virtual address, and sharing of data 
by cooperating processes i done at the segment level. 
2.3. Memory Interface 
The high memory bandwidth requirement of Prolog dictates eparate instruction 
and data buses (Figure 1). In addition, we have expanded the data bus to double- 
word width. A double-word ata bus is motivated by Carlson's tudy [5] of the ar- 
chitectural requirements of high-performance Prolog processors. Carlson compiled 
Prolog programs into basic register transfer level operations, and then compacted 
them into more complex instructions while enforcing microarchitectural constraints. 
His results show that the best performance/cost tradeoff occurs when the architec- 
ture provides a double-word port to data memory. 
A double-word memory port improves the performance of term creation and 
speeds block transfers to and from environments and choice points. Some previous 
Prolog processors support fast choice point creation and restoration through the use 
of specialized buffers or shadow registers [9, 24]. Such hardware solutions are costly, 
and do not fit our goal of maintaining a general-purpose architecture. Instead, we 
rely on double-word memory operations and on compiler optimization to minimize 
shallow backtracking [33]. 
Our processor design is tightly coupled with the cache design. We decided against 
on-chip caches ince, in our case, it is more appropriate to use processor chip area 
for architectural features and use fast, dense static RAM chips for large caches. 
To speed cache accesses, however, protection violation and consistency checks and 
address tag comparison are done on-chip. More details about the cache interface 
are given in [6]. 
2.4. Base Architecture 
All programmer-visible processor registers are accessed as two sets of 32 registers: 
the general-purpose r gister set and the special register set. The general-purpose 
registers are used for procedure argument passing, temporary storage, and as stack 
pointers. The only general-purpose r gister with a preassigned use is the continu- 
ation pointer (r31). This register is implicitly set to the return address by the call 
instruction. All other uses of the general-purpose registers are defined by software 
convention. 
The special registers provide access to the processor status word (PSW), pro- 
gram counter (PC), partial product/quotient register (PQ), segment mapping table, 
cache interface configuration registers, and a set of 15 extra registers (s0-sl4). 
2.5. Implementation Details 
The execution pipeline consists of five stages (Figure 3). All instructions that 
modify registers or memory do so in the last pipeline stage. Register bypassing 
forwards results of the ALU and memory read pipeline stages to instructions in 
earlier stages of the pipeline. Hardware pipeline stalls are provided in hardware 
to ensure correct execution of both load and store operations. If data from a load 
instruction are used by the next instruction, then the next instruction is delayed 
by. a cycle. Also, memory instructions immediately following a store are delayed 
112 B.K.  HOLMER ET AL. 
I instruction fetch 
R register ead 
A ALU 
M memory read 
W register/memory write 
F IGURE 
line. 
3. VLSI-BAM processor execution pipe- 
by a cycle. Store instructions require access to the cache during both the M and 
W pipeline stages--M to provide the address to determine the cache hit/miss and 
W to provide the data to be stored. 
All instructions are 32 bits with a 6-bit opcode and fixed source register format. 
Instruction execution is controlled by an opcode pipeline that operates in parallel 
with the execution pipeline. Each stage of the opcode pipe decodes the opcode as- 
sociated with that stage of the execution pipeline. Multicycle instructions and con- 
ditional instructions are implemented using "internal opcodes" [21]. The internal 
opcodes of multicycle instructions are fetched from a PLA and inserted into the op- 
code pipeline. When an internal opcode is inserted, no instruction is fetched uring 
that cycle. Thus, a single external opcode can invoke a sequence of internM opcodes 
to provide for often-used complex operations (for example, pointer dereferencing). 
Internal opcode insertion is also used for atomic synchronization perations, for 
pipeline interlock delays, and for trap and interrupt handling. Conditional execu- 
tion is implemented by conditionally replacing an opcode in the opcode pipe with 
an internal opcode. Our design uses 55 external opcodes and 24 internal opcodes; 
of the internal opcodes, nine are related to traps (trap, rft), 13 implement multi- 
cycle instructions ( dref, stx, std, pushd, las, jmpr), and two implement conditional 
operation instructions ( uni, pusht ). 
"Fast tag logic" is used to implement single-cycle tag-compare-and-branch i - 
structions. The fast tag logic consists of an extra register file that duplicates the 
tag portion of the general-purpose register file and special tag comparison logic that 
allows quick tag comparison and branch. Previous Prolog processors [9] have also 
duplicated tag bits to accelerate branching on tag value. 
The general-purpose register file has two read ports (one single-word and one 
double-word) and two write ports (both single-word). This port structure pro- 
vides the bandwidth required by single-cycle double-word memory accesses without 
greatly increasing the complexity of the register file design. 
Figure 4 is a photomicrograph of the VLSI-BAM chip. The layout consists of 
three rows of functional units. In the top row, the leftmost quarter contains the 
register index latches, register bypass logic, fast tag comparison logic, and fast tag 
register file. The middle half of the row contains instruction address circuitry that 
consists of three branch offset adders (two for the three-way branches, swt and swb, 
and one for btg), an incrementer, multiple latches that hold the PC value of each 
pipeline stage, latches for trap handling, and the partial product/quotient regis- 
ter. The rightmost quarter of the top row contains the pipeline control ogic that 
consists of decode PLAs and latches for holding the opcode of each pipeline stage. 
The left three quarters of the middle row contains the main data path. The 
general-purpose r gister file is on the left end, the barrel shifter and ALU are 
toward the right end, and pipeline latches lie in between. The right quarter of the 
middle row contains condition code logic, logic for trap detection and prioritization, 
and random logic for pipeline control signals. 
HARDWARE FOR HIGH-PERFORMANCE PROLOG 113 
F IGURE 4. Photomicrograph of VLSI-BAM chip die. 
3. 
At the left end of the bottom row is the special-purpose r gister file. The middle 
and right end of the row contain the cache interface logic that consists of (moving 
left to right) the protection specification latches, protection comparison logic, tag 
comparison logic, instruction and data address latches, and the segment table. 
The design is pad-limited, so no extra effort was expended to compact he layout 
beyond what was necessary to fit into the pad frame. 
INSTRUCTION SET  
In this section, we present he VLSI-BAM instruction set. The instructions are 
divided into three groups: general-purpose, Prolog-inspired general-purpose, and 
Prolog-specific. The general-purpose instructions are those which can be found in 
typical processors. The Prolog-inspired instructions are those which are not of- 
ten present in general-purpose processors, but which can still be used for general 
114 B.K.  HOLMER ET AL. 
computation. The remaining instructions are tailored specifically to the require- 
ments of Prolog execution. 
The general-purpose instructions are summarized in Table 1. It is important o 
point out that all arithmetic and logic operations operate on the full 32-bit word. 
Also, conditional branches consist of separate compare and branch instructions. 
Compare instructions et or clear the TF (true-false) condition code bit, and the 
branch instructions take the branch when TF is set. Branches, jumps, and calls are 
delayed by one instruction. The instruction in a branch delay slot can always be 
executed (bt), annulled (turned into a nop) if the branch is taken (brat), or annulled 
TABLE 1. General-purpose instructions. 
Instruction Operands Action Cycles 
r(i), displ6, r(k) r(k) ~ M[r(i) q- displ6] 1 
( ldl distinguishable to cache 
for synchronization operations) 
ldx r(i), r(j), r(k) r(k) 4-- M[r(i) + r(j)] 1 
st, stu r(i), r(k), displ6 M[r(k) + displ6] 4-- r(i) 1 
( stu distinguishable to cache) 
r(i), r(k), r(l) M[r(k) + r(1)] *-- r(i) 2 
r(i), displ6, r(k) r(k) ~- M[r(i)+displ6]; 2 
M[r(i)+displ6] ~-- -1  
add, sub, and, or, xor r(i), r(j), r(k) r(k) ~--r(i) op  r(j) 1 
add32, sub32 r(i), r(j), r(k) r(k) *--r(i) op  r(j) 1 
(trap on signed 32-bit overflow) 
addi, andi, ori, xori r(i), imml6, r(k) r(k) *--r(i) op  imml6 1 
sll, sra, srl r(i), r(j), r(k) r(k) *-r( i )  op  r(j)(4:0) 1 
slli, srai, srli r(i), imm5, r(k) r(k) ~--r(i) op  imm5 1 
divs, mpys r(i), r(j), r(k) (r(k), PQ, WE)~-- 1 
op(r( i ) ,  r(j), PQ, TF) 
cond, r(i), r(j) TF  ~-- (r(i) cond r(j)) 1 
a~ldr26 if (TF) PC(25:0) ~-- addr26 1 
addr26 if (TF) PC(25:0) ,-- addr26; 1 
else annul next instruction 
addr26 if (TF) { 1 
PC(25:0) *-- addr26; 
annul next instruction }
addr26 PC(25:0) *-- addr26 1 
r(i), displ6 PC ~-- r(i) + displ6 2 
addr26 r(31) ~-- PC+l ;  1 
PC(25:0) *-- addr26 
s(i), r(k) r(k) ~-- s(i) 1 
r(i), s(k) s(k) ~-- r(i) 1 
imm5 save PCs and PSW; 6 
set supervisor bit; 
PC *-- 2"(32+imm5) 
restore saved PSW; 4 
fetch at saved PCs 
ld, ldl 
stx 
las 
cmp 
bt 
btan 
btat 
jmp 
jmpr 
call 
rd 
wr 
trap 
rft 
Tables 1-3 summarize the VLSI-BAM processor instruction set. The first two columns give the 
instruction mnemonic and operands. The third column gives the instruction's register transfer descrip- 
tion. R(i) denotes general-purpose r gister i; s(i) denotes pecial register i; dispn is a sign-extended 
n-bit displacement; immn is a sign-extended n-bit immediate; addr26 is a 26-bit segment offset; offl.8 
and off2-8 are zero-extended 8-bit displacements; tag is a 4-bit immediate tag value; and cond is one 
of 20 comparison conditions. M[x] is the memory location at address x. Tag~value specifies the tag 
insertion operation. Tvar represents he value of the unbound variable tag (0001). Cycle counts assume 
no pipeline stalls due to load or store delays. All branch and jump instructions are delayed, and the 
following instruction is executed unless it is annulled. The cycle count of dref depends on the number 
of memory operations (l) performed. 
HARDWARE FOR HIGH-PERFORMANCE PROLOG 115 
if the branch is not taken (btan). Both directions of annulling are included because 
Prolog often favors annulling when the branch is taken (for example, branching out 
of straight-line code to the unification failure routine), whereas conditional branches 
to the top of a loop (common in procedural languages) favor annulling when the 
branch is not taken. 
Special oad and store instructions (Idl and stu) activate a signal sent to the cache 
(external to the VLSI-BAM chip) which can be used by the cache controller logic to 
implement the "lock" and "unlock" multiprocessor synchronization perations [3]. 
Indexed load and store instructions (ldx and stx) are useful for the Prolog built-in 
predicate arg/3 and for matrix operations in procedural languages. 
Limited support is provided for detecting 32-bit two's complement overflow for 
addition and subtraction (add32 and sub32). Addition by a constant (addi) is 
most often used for pointer arithmetic, and so is available only as a nontrapping 
instruction. 3 The divide step (divs) and multiply step (mpys) instructions perform a 
single-bit shift with conditional subtract/add. Thirty-two consecutive divs (mpys) 
instructions perform a complete 32-bit divide (multiply). 
The special register file provides extra storage for system routines. Access to the 
special register file is accomplished using the rd and wr instructions. 
Because the VLSI-BAM processor supports code which has branch instructions 
in branch delay slots, to correctly restart he pipeline after a trap, two PCs must be 
saved by a trap. The software trap instruction (trap) transfers control to a table of 
jmp instructions in low memory. Two instructions for each table entry are required 
for the jmp and its delay slot. 
The remainder of this section motivates and presents our extensions to the 
general-purpose instruction set. A major influence on the design of these extensions 
was the simultaneous development of an optimizing Prolog compiler. The abstract 
machine used by the compiler was initially designed using a top-down approach [31]. 
We assumed a set of data structures imilar to those used by the WAM. Knowl- 
edge of possible compiler optimizations was applied to the semantics of Prolog to 
decompose Prolog's general operations into their components. These components, 
the abstract instruction set, are the instructions and addressing modes required 
to compile Prolog operations into efficient code. Efficient translation of abstract 
machine instructions into the architectural instruction set was a prime influence in 
the first pass of the instruction set design. 
Touati and Despain [29] present numerous measurements of WAM execution 
behavior. Results of their studies were part of our starting point in the VLSI-BAM 
design. For example, although we had included cdr-coding of lists in our earlier 
processor [9, 25], we found that it is not of sufficient benefit to justify hardware 
support. Therefore cdr-coding support was not provided in the VLSI-BAM. In 
the following subsections, we provide additional WAM performance measurements 
that we found useful as a basis for making design decisions during the design of the 
VLSI-BAM processor. 
In addition to our studies of abstract instruction sets, we investigated the mi- 
croarchitectural requirements for high performance Prolog [5], and gathered execu- 
tion statistics for the VLSI-PLM, a microcoded implementation f the WAM [25]. 
These investigations pointed out those microarchitectural features that would give 
3A trapped version of addition by a constant can be synthesized from ldi followed by add32. 
116 B.K. HOLMER ET AL. 
the greatest performance gains and the Prolog operations that most need instruc- 
tion set support. 
3.1. Prolog- Inspired General-Purpose Instructions 
Prolog-inspired general-purpose instructions are those instructions which support 
Prolog and which also may be useful in the implementation of other languages 
(Table 2). These instructions include load and store of immediates, ingle-cycle 
double-word load and store, and push and pop memory operations. 
Immediates can be loaded, stored, or used in a comparison (ldi, sti, stid, cmpi). 
The immediates are tagged and are created by sign-extending a 12- or 17-bit im- 
mediate and replacing the four most significant bits with an immediate tag. Load 
immediate (ldi) is used for creating integers and atoms. Store immediate (sti) is 
an optimization of a ldi, st sequence, and is used to bind an atom with a variable 
that is known at compile time to be unbound. 
Double-word memory operations (ldd, std, stdc, pushd, pushdc) are motivated 
by Prolog's large memory bandwidth requirements. A double-word store or push is 
single-cycle (stdc, pushdc) only if the source registers form a consecutive, ven/odd 
register pair because only three registers, two of which must be adjacent, can be 
read from the register file per cycle. The std and pushd instructions allow the use 
of nonconsecutive r gisters. They are two-cycle instructions, but this is offset by 
TABLE 2. Prolog- inspired general-purpose instruct ions.  
Instruction Operands Action Cycles 
ldi tag, immlT, r(k) r(k) +- tag ^  imml7 1 
sti tag, immlT, r(k) M[r(k)] +-- tag A imml7 1 
stid tag, imml2, r(k), disp5 M[r(k) + disph] ,--tag ^  imml2 1 
cmpi cond, r(i), tag, imml2 TF ~-- (r(i) cond tag ^  imml2) 1 
ldd r(i), displ 1, r(k), r(l) r(k) ,-- M[r(i) q- displ 1]; 1 
r(1) ,-- M[r(i) q- d ispl l  + 1] 
(r(i) q- disp11 even) 
std r(i), r(j), r(k), d ispl l  M[r(k)+displl] +--r(i); 2 
M[r(k)+displ l÷l]  ,--- r(j) 
(r(k)+ disp11 even) 
stdc r(i), r(k), disp16 M[r(k) + displ6] +-- r(i); 1 
M[r(k) + displ6 + 1] ~- r(i + 1) 
(i and r(k)+ disp16 even) 
push r(i), r(k), displ6 M[r(k)] ,-- r(i); 1 
r(k) ~-- r(k) + displ6 
pusht r(i), r(k), displ6 if (WE) { 1 
M[r(k)] *- r(i); 
r(k) +--r(k) + displ6 } 
pushd r(i), r(j), r(k), d ispl l  M[r(k)] *--r(i); 2 
M[r(k)+l] *- r(j); 
r(k) +- r(k) + displ l  
(r(k) even) 
pushdc r(i), r(k), displ6 M[r(k)] +-r(i); 1 
M[r(k) + 1] ~- r(i + 1); 
r(k) ~- r(k) q- displ6 
(i and r(k) even) 
pop r(i), displ6, r(k) r(k) ,-- M[r(i) - displ6]; 1 
r(i) ~-- r(i) - displ6 
umin, umax r(i), r(j), r(k) r(k) *-- 1 
unsigned_min/max(r(i), r(j)) 
HARDWARE FOR HIGH-PERFORMANCE PROLOC 117 
the absence of a pipeline stall when they are immediately followed by a memory 
operation. 
Push instructions are included to Support compound term creation. Using 
branch-and-bound search techniques, we determined an optimal set of single-cycle 
instructions for creation of all possible two- and three-word structures [14]. This set 
of instructions i optimal in the sense that, for our microarchitecture, each struc- 
ture is created in the smallest number of cycles. The resulting "compound term 
creation instruction set" favors the idiom of placing two words of data in registers 
and then moving them to memory using a double-word push. The VLSI-BAM 
chip also provides the external cache controller with a "push instruction" signal. 
With a properly designed external cache controller, push operations can skip the 
fill of the cache line from memory if a push incurs a cache miss, and also refers to 
the first word of the cache line [6]. This optimization has been used in a previous 
Prolog design [20]. The push instructions allow the amount of the increment to be 
specified, and any general-purpose register can be used as a stack pointer. 
Prolog requires that variable assignment be undone on backtracking. This 
unbinding of variables is implemented by recording variable addresses on a "trail" 
stack. The original WAM model requires two pointer comparisons to determine 
if trailing is necessary. Our implementation restricts variables to the global stack 
(which reduces the number of comparisons to one) and uses a compare instruction 
followed by a conditional push (pusht). The pop instruction is used during back- 
tracking to retrieve variable addresses from the trail stack. The compiler can reduce 
the amount of trailing and detrailing through the use of flow analysis to determine 
when uninitialized variables [1] can be used (our use of uninitialized variables i  dif- 
ferent from Ill--we use the same tag for both initialized and uninitialized variables 
and determine at compile time when destructive assignment is safe). 
The location of, and interaction between, the environment and choice point 
stacks is software defined. However, unsigned maximum (umax) is provided to 
simpli .fy the management of the environment and choice point stack pointers when 
these stacks are intermixed. In this case, allocation occurs at the maximum of the 
two stack pointer values. 
3.2. Prolog-Specific Instruction Set Support 
Prolog-specific instructions are those instructions tailored specifically for efficient 
execution of Prolog (Table 3). These instructions support agged pointer creation, 
two- and three-way branch on tag, pointer dereferencing, and unification of atoms. 
3.2.1. Tagged Data Support. Pointer creation is accomplished by the load effec- 
tive address (lea) instruction which calculates an address and then replaces the 
most significant four bits with an immediate tag. This instruction is used to create 
pointers to unbound variables and compound terms (lists and structures). 
Type checking built-ins are supported with single-cycle compare-and-branch-on- 
tag instructions (btgeq and btgne). These instructions also allow the compiler to 
replace shallow backtracking with a conditional branch on an argument's tag. 
Prolog allows unbound variables to be bound together. The resulting reference 
chain must be dereferenced before subsequent variable binding. WAM instruc- 
tions always dereference their operands, often resulting in superfluous dereferencing. 
However, our optimizing compiler keeps track of which variables are dereferenced 
118 B. K. HOLMER ET AL. 
TABLE 3. Prolog instructions. 
Instruction Operands Action Cycles 
lea tag, r(i), displ2, r(k) r(k) ~--tag ~ (r( i)+ disp12) 1 
btgeq, btgne tag, r(i), disp16 if (r(i)(31:28) =/7~ tag) { 1 
PC *- PC + disp16; 
annul next instruction }
dref r(i) if (r(i)(31:28) = tvar) l = 0 :1  
do { l ~ 0: 
tmp*-- r(i); 2 + 2l 
r(i) ~- M[r(i)] 
} until ((r(i)(31:28) ~ tvar) or 
(r(i) = tmp)) 
(1 = number of memory refs) 
add28, sub28, 
and28, or28, 
xor28 r(i), r(j), r(k) r(k)ff--r(i) op r(j) 1 
(trap on non-integer tags) 
cmp28 cond, r(i), r(j) TF ~-- (r(i) cond r(j)) 1 
(trap on non-integer tags) 
uni tag, immlT, r(i) if (r(i)(31:28 / = tvar) { 1 
M[r(i)] ,--tag A immlT; 
TF*-- 0 
} else if (r(i) = tag ~ immlT) 
TF *- 0; 
else TF *- 1 
swb r(i), r(j), if ((r(i)(31:28) = tvar) and 1 
offl_8, 0ff2_8 (r(j) (31:28) ~ tvar)) 
PC ~-- PC + offl_8; 
else if ((r(i)(31:28) # tvar) and 
(r(j)(31:28) = tvar)) { 
PC *-- PC + off2_8; 
annul next instruction 
} else annul next instruction 
swt r(i), tagl, tag2, if (r(i)(31:28) = tagl) 1 
offl-8, 0ff2_8 PC *- PC + offl_8; 
else if (r(i)(31:28) = tag2) { 
PC *-- PC + off2_8; 
annul next instruction 
} else annul next instruction 
(tag1 or tag2 is tvar) 
and generates  expl ic i t  dereferences on ly  when necessary. Imp lement ing  dereference 
as a single ins t ruct ion  reduces s tat ic  code size and al lows dereference memory  reads 
to be p ipel ined,  resu l t ing in a t ighter  loop than  the equiva lent  assembly  code [11, 
24]. We use the  same tag  value for both  unbound var iab les  and reference po int -  
ers (unbound var iab les  are self -referential) .  The  dereference ins t ruct ion  (dref)  is 
imp lemented  as a sequence of interna l  opcodes.  Because a single dref i ns t ruct ion  
could potent ia l l y  require many execut ion  cycles, it is in ter rupt ib le  and res tar tab le .  
A l l  of the  basic a r i thmet ic  and compare  inst ruct ions  (add, sub, and, or, xor, 
cmp) have a vers ion that  t raps  on 28-bit  overflow. These ins t ruct ions  operate  on 
the  full 32-bit  word,  but  28-bit  overflow occurs if e i ther  of the  sources or the  resul t  
do not  have integer tags (0000 or 1111). The  t rap  on 28-bit  overflow al lows Pro log  
ar i thmet ic  operat ions  to be compi led  to fast, safe code that  avoids ext ra  ins t ruct ions  
for tag  overflow checking. If a 28-bit  overflow does occur,  the t rap  rout ine  can s ignal  
an overf low error  or convert  the  data  into an a l te rnat ive  representat ion .  
HARDWARE FOR HIGH-PERFORMANCE PROLOG 119 
3.2.2. Unification Support. Unification is one of the primary operations of Pro- 
log; it is used for argument passing, structure creation, structure decomposition, 
and pattern matching. Although general unification is a complex algorithm, if 
one is given information about the arguments being unified, the general algorithm 
can be greatly simplified. This is one of the advantages of the WAM instruction 
set over an interpreter. Our compiler takes this principle further, and propagates 
information to simplify unification as much as possible. 
Analysis of the primitives necessary to support unification of a Prolog vari- 
able with an atom [31] motivates the single-cycle unify-immediate instruction (uni) 
which binds the atom to the variable if the variable is unbound, and otherwise tests 
them for equality. 
Unification of a Prolog variable with a compound term also benefits from special 
support. Analysis of the primitives necessary to support unification of a Prolog 
variable with a list or structure [31] motivates the switch-tag instruction (swt), a 
three-way branch based on the tag of one register. One direction of the branch 
is taken if the tag is an unbound variable; a second direction is taken if the tag 
matches a specified immediate tag (usually list or structure); and a third direction 
is taken for all other tags. The three-way branch could be implemented using two 
two-way branches; however, WAM execution statistics (Table 4) show that there is 
a small but significant performance advantage to the three-way branch. 
The LOW RISC processor [19] provides a five-way branch and the Carmel-2 
processor [11] provides a ten-way branch based on the tag of a single register. 
WAM execution statistics how that such generality is unnecessary for unification 
of a Prolog variable with a compound term. 
When the compiler cannot determine any information about the types of the 
arguments to be unified, then general unification must be used. In this case, one 
can still take advantage of dynamic properties of the argument ypes. The com- 
mon cases of general unification should be done quickly in-line and infrequent cases 
passed to a general unification subroutine. Analysis of WAM execution (Table 5) in- 
dicates that about 70% of all general unifications are simple bindings of an unbound 
variable with a nonvariable. These statistics motivate the switch-bind instruction 
(swb), a three-way branch based on the tags of two registers. The conditions of 
the three branch directions are: variable/nonvariable, nonvariable/variable, and 
otherwise (order of the arguments matters). This allows the common cases of 
TABLE 4. WAM variable/compound term unification statistics. 
Program Argument  ype (%) Cost (cycles) 
get_list Variable List Other  swt Two-way 
prover 18.7 80.5 0.8 1.20 1.40 
meta_qsort 42.1 42.0 16.0 1.58 2.32 
simple_analyzer 24.4 67.4 8.3 1.33 1.74 
chat_parser 8.8 84.8 6.4 1.15 1.37 
average 23.5 68.7 7.9 1.32 1.71 
get -st ructure Variable Structure Other  swt Two-way 
prover 26.7 73.3 0.0 1.27 1.53 
meta_qsort 37.6 62.4 0.0 1.38 1.75 
simple_analyzer 13.5 86.5 0.0 1.14 1.27 
chat_parser 44.0 52.5 3.5 1.48 1.98 
average 30.4 68.7 0.9 1.31 1.64 
120 B. K. HOLMER ET AL. 
TABLE 5. WAM general unification statistics. 
Argument type (%) 
Quick Quick Var Nonvax Var 
Program success failure nonvar vat var Recursive 
prover 15.6 15.6 0.0 61.4 0.0 7.5 
meta_qsort 0.0 0.0 0.0 50.5 49.5 0.0 
simple_analyzer 0.1 2.3 13.3 70.5 11.5 2.1 
chat_parser 0.3 11.8 13.6 69.3 2.3 2.5 
average 4.0 7.4 6.7 62.9 15.8 3.0 
variable/nonvariable and nonvariable/variable to be done in-line. A general unifi- 
cation subroutine is called for all other cases. Note that although the quick success 
and quick failure cases are simple to check for, their execution frequency is low 
enough that we have chosen not to do these checks in-line. 
The Pegasus processor [24] supports general unification with a 16-way branch 
based on two tag bits from each of two registers. The LIBRA processor [18] has 
a "partial unify" instruction. This single-cycle instruction performs either a nop, 
a store, a call, or a branch, depending on the tags and comparison of the two 
arguments. It executes the variable/nonvariable case of general unification in four 
cycles (not counting dereferencing of the arguments). Using switch-bind (swb), 
VLSI-BAM executes this case in five cycles. Although the partial unify instruction 
of the LIBRA has a slight performance advantage, its complexity does not fit with 
our goal of minimally extending a general-purpose architecture. 
3.2.3. Example Use of Prolog Specific Instructions. Figure 5 provides an exam- 
ple of the use of several of the Prolog-specific instructions. The predicate (created 
just as an example) succeeds for only certain combinations of variables and atoms 
for the arguments. When the second argument is an atom, the arguments are 
unified. When compiling, we assume that mode analysis cannot deduce any infor- 
mation about the types of the arguments when the predicate is entered. 
The VLSI-BAM code places the two predicate arguments in registers r0 and 
r l .  Before the type can be determined using a swt instruction, each argument 
must be fully dereferenced (using dref). The first swt instruction branches to label 
example_2_l when r0 is an atom. Execution falls through when r0 is neither an 
atom nor an unbound variable. In this case, the fall-through corresponds to failure 
of the predicate, and so the fa i l  routine is called. As an optimization, the delay 
slot of the jmp( fa i l )  is filled with the first instruction of the fa i l  routine, and we 
now jump to the second instruction of fa i l .  
The delay slot of the swt is executed when it branches to its first destination. 
Typically, the first instruction at the destination (dre f ( r l ) )  is replicated in the 
delay slot and the destination address incremented in order to reduce the execution 
time by a cycle. 
When the two arguments are an unbound variable and an atom (in that or- 
der), then the unification reduces to the binding of a variable to a nonvariable 
(st  ( r l ,  z0)). The variable being bound may require trailing, and this is done with 
a crop ( l tu , r0 ,hb) ,  pusht (r0, t r ,  1) sequence. When both arguments are atoms, 
the unification simplifies to an equality comparison (amp (ne, r0,  r l ) ,  b ta t  ( fa i l ) ) .  
We will return to this example in Section 5 to help illustrate the methodology 
behind our cost-benefit analysis. 
HARDWARE FOR H IGH-PERFORMANCE PROLOG 121 
example(X, Y) :- (vax(X); atom(X)), atom(Y), X = Y. 
example(X, Y) :- atom(X), vat(Y). 
procedure(example/2). 
dref(rO). 
swt(rO,tatm=example_2_1+1,tvar=elample_2_2). 
dref(rl). 
jmp(fail+1). 
idd(b-2,tO/tl). 
label(example_21). 
dref(rl). 
swt (rl, farm=example_2_3+1, tvar=example_2_%). 
cmp(ne,rO,rl). 
imp(fail+l). 
Idd(b-2,tO/tl). 
label(example_2_2). 
dref(rl). 
btgne(tatm,rl,example_2_O). 
st(rl,rO). 
cmp(Itu,rO,hb). 
jmpr(cp+l). 
pusht(rO,tr,1). 
label(example_2_O). 
imp(fail+l). 
l dd(b -2 , tO/ t l ) .  
label(example_2_3). 
cmp(ne,rO,rl). 
btat(fail). 
jmpr(cp+1). 
nop. 
label(example_2_4). 
jmpr(cp+1). 
nop. 
FIGURE 5. Example translation of Prologinto VLSI-BAMinstructions. 
4. COMPILAT ION OF PROLOG 
A significant aspect of our project was the simultaneous development of an optimiz- 
ing Prolog compiler [31, 35]. The compiler incorporates techniques for determinism 
extraction and use of destructive assignment. The compiler accepts tandard Prolog 
and produces code for a simple non-WAM abstract machine. Although the compiler 
uses stacks and data structures similar to WAM implementations, it does not use 
the WAM during compilation, but instead directly compiles to its own abstract 
machine. Automatic mode generation (type inferencing) is implemented using ab- 
s.tract interpretation [8]. It derives ground, uninitialized variable [1], and derefer- 
ence modes. Differences between the numbers listed in the following sections and 
122 B.K. HOLMER ET AL. 
those in a previous version of the paper [15] reflect improvements in the compiler 
and back-end instruction reorderer. 
Compilation of Prolog is done in three stages. First, the compiler produces code 
for its abstract machine. Second, this code is macro-expanded into the VLSI-BAM 
instruction set. Finally, the VLSI-BAM code is optimized by a peephole optimizer 
and instruction reordering stage that maximizes the use of the double-word bus 
and minimizes the number of hops and pipeline stalls. 
. COST/BENEF IT  ANALYSIS OF ARCHITECTURAL FEATURES 
AND INSTRUCTIONS 
In Section 3, we motivated our instruction selection based on several sources of 
information: work on abstract instruction sets for compilers, bottom-up analysis 
of microarchitectural equirements for high-performance Prolog, and analysis of 
WAM execution statistics. In this section, we give a more rigorous validation of the 
architectural design and instruction selection by analyzing the cost and performance 
benefits of each special-purpose feature and instruction. There has been some work 
to determine such results for other designs [10, 11, 24, 26] but the analysis presented 
here is more complete. 
5.1. Cost of Features 
Table 6 shows the implementation cost of those features that extend the VLSI- 
BAM beyond a general-purpose architecture. Implementation cost is expressed in 
terms of chip area required to implement the feature and in terms of VLSI design 
effort required. The chip area is measured in percent of total active area, which 
includes both transistor and wiring area. The chip contains approximately 110,000 
transistors, and the total active area is 91 square millimeters using 1.2 #m CMOS 
(two metal ayers). The VLSI layout was done using a symbolic layout editor with 
TABLE 6. Cost of special architectural features. 
Feature Active area Design complexity Instructions affected 
Segment mapping 4.8% ~100% compiled - -  
Tagged-immediate 2.2% 100% compiled ldi, cmpi, sti, 
stid, lea, uni 
Switch offset adders 2.0% 100% compiled swt, swb 
Double-word memory 1.9% 95% compiled; ldd, std, stdc, 
port 5% by hand pushd, pushdc, pop 
Fast tag logic 1.6% ~100% compiled btgeq, btgne, swt, 
swb, dref, uni 
Multicycle/conditional 0.1% 100% compiled stx, std, pushd, 
pusht, dref, uni 
Tag overflow detect ~0.0% 100% by hand cmp28, add28, sub28, 
(10 gates) and28, or28, xor28 
Total special features 12.6% 99% compiled; 
1% by hand 
For each special feature of the VLSI-BAM processor, this table gives the percentage of active area 
(transistors and wires) required to implement the feature, the design complexity of the layout, and a list 
of instructions that depend on the feature. The design complexity is given as a percentage of the layout 
that was automatically generated (using tilers, routers, etc.) and the percentage that was laid out by 
hand. ~100~ compiled indicates that fewer than 30 gates were placed by hand. Multicycle/conditional 
is a subset of internal opcodes--the 0.1% active area refers to the entire internal opcode implementation. 
HARDWARE FOR HIGH-PERFORMANCE PROLOG 123 
custom designed, parameterized cells. The building blocks were assembled into 
larger units using a data path compiler, PLA compiler, tiler, and router. The design 
effort for each feature is given as a percentage of its design that was automatically 
performed by the design tools. The last column of Table 6 lists those instructions 
that depend on a given feature. We do not give each feature's effect on the cycle 
time since the microarchitecture and logic were designed carefully to prevent hese 
features from being on the critical path. 
Segment mapping requires the greatest area of the special features. This area 
is primarily due to the 32- by 24-bit register file which contains the segment map. 
This register file is used to extend the address pace as well as perform tag mapping. 
A smaller egister file tailored to tag mapping alone would take less area. The next 
greatest area consuming feature is the tagged-immediate generation circuitry. This 
is due in part to the use of three distinct instruction formats for tagged-immediates. 
The three-way branch instructions, swt arfd swb, use a unique destination offset for- 
mat, and require two addition displacement adders to allow the destination address 
calculation to overlap with the opcode decode. The double-word memory port 
requires extra ports on the general-purpose r gister file to support the increased 
bandwidth. The area listed is the difference in size between our four/five-port reg- 
ister file and the more usual three-port register file. 4 The extra pads required by 
the double-word bus are not included in the cost. After the fast tag logic, the 
remaining features use a very small portion of the total active area. 
5.2. Benefits of Features 
To determine the performance benefit of each feature, we calculated the cycle count 
increase caused by omitting the use of all instructions that depend on the lea- 
ture [30]. For example, if omitting the instructions ldd, std, stdc, pushd, and pushdc 
increases execution time from 100 cycles to 111 cycles, then the performance benefit 
due to the double-word memory port is 11%. An instruction is omitted by replacing 
it with its macro-expansion into instructions that still remain in the instruction set. 
An effort was made to determine optimal expansions, and after macro-expansion, 
peephole optimization and instruction reordering are performed. Omission of seg- 
ment mapping requires that explicit instructions be inserted to mask tag bits before 
tagged-pointers are used as addresses. The combined benefit of two or more ar- 
chitectural features is determined by omitting all instructions affected by at least 
one of the features. A description of a preliminary version of this analysis is given 
in [23]. 
To illustrate the technique used to measure performance benefits, Figure 6 shows 
how the code from Figure 5 changes when the dref and swt instructions are removed 
from the instruction set. The work done by these two instructions must now be 
done using the btg instruction along with several general-purpose instructions. 
The dref instruction is replaced with a btg that jumps to an explicit dereference 
loop. The dref instruction takes one cycle when the initial tag is nonvariable and 
four cycles when only one memory load is required for dereferencing. The explicit 
loop requires one cycle when the initial tag is nonvariable and seven or nine cycles 
when one memory load is required. To gain performance (at the cost of code size), 
4The pop instruct ion is included on the double-word memory port instruct ion list since it also 
requires the extra register file port. 
124 B.K.  HOLMER ET AL. 
procedure(example/2). 
dref(rO). 
swt(rO,tatm=example_2_l+l, 
tvar=example_2_2). 
dre f ( r l ) .  
imp(fail+l). 
idd(b-2,tO/tl). 
procedure(example/2). 
label(expand_O). 
btgeq(tvar,rO,expand_2). 
label(expand_l). 
btgeq(tatm,rO,example2_1). 
btgeq(tvar,rO,example_2_2). 
label(example_2_l). 
jmp(fail+l). 
idd(b-2,tO/tl). 
label(expand_2). 
Id(rO,r14). 
cmp(eq,rO,rl4). 
b~at(expand_l). 
jmp(expand_O). 
addi(rl4,0,rO). 
label(example_2_l). 
FIGURE 6. Result of removing dref and swt instructions. 
a separate dereference loop is generated for each dref instruction replaced. Use of a 
single dereference subroutine would require saving the current PC and would take 
extra execution cycles. 
The swt(reg,tagl-- label_l,tag2=label_2) instruction requires 1, 2, and 2 
cycles for the label_l, label_2, and fall-through branch directions. The btg in- 
struction pair requires 2, 3, and 2 cycles, which is, on average, almost one cycle 
more for each dynamic occurrence. 
Table 7 lists the performance benefit of the features given in Table 6. Fast tag 
logic, double-word memory port, segment mapping, multicycle support, and tagged- 
immediate support are consistently important features. Tag overflow detection 
is important only in programs that make heavy use of integer arithmetic. The 
overall Prolog support column is determined by using only the instructions from 
Table 1 (and nontagged versions of Idi and cmpi), omitting segment mapping and 
all instructions in Tables 2 and 3. 
TABLE 7. Performance benefit of special architectural features. 
Feature performance b nefit (%) 
Tag All 
Fast tag Double-word Segment Multicycle Tagged- overflow prolog 
Benchmark logic memory port mapping conditional immediate detect support 
boyer 16.1 6.4 13.9 11.0 1.3 5.6 64.9 
chat_parser 12.2 18.6 10.8 9.6 8.8 0.0 66.1 
meta_qsort 15.2 18.6 13.2 11.9 9.4 0.6 71.8 
nand 9.0 18.0 5.8 4.1 3.7 2.8 58.5 
peep 8.3 13.2 6.0 5.1 4.7 4.9 57.8 
prover 12.5 17.6 9.2 7.7 8.7 0.0 67.6 
simple_analyzer 13.5 9.6 13.5 11.6 4.7 4.9 62.7 
average 12.4 14.6 10.4 8.7 5.9 2.7 64.2 
This table gives the percent performance benefit for each special feature of the VLSI-BAM processor. 
The last column of lists the performance benefit of segment mapping and all instructions given in 
Tables 2 and 3. All benchmarks are compiled with automatic mode generation, and cache effects are 
not included. 
HARDWARE FOR H IGH-PERFORMANCE PROLOG 125 
1.7 
BASE 
push/umin/umax 
1.6 
tag overflow 
o t logic <~ 
1.4 
o 
o Iti le/cond 
m 
,uble-word\ ~9 % 0 0 
l~t ~ o o o 
1.2 x ~ _  o 6 ~° O 
~ o °°°  tag ged- immediate"'~,~ o o 
1.1 ~ 0  o o 
~'~- .  I segr t~nt~'~ sTitchoff.et~Iders 
l.O ~ ~ VLSI-BAM 
88 90 92 94 96 98 100 
Chip Area (%) 
F IGURE 7. Benefit versus cost of architectural features. 
Each of the benefits listed in Table 7 represent performance changes with respect 
to the full VLSI-BAM architecture and instruction set. In practice, however, the 
benefit of a feature depends on what other features are also present. To study the 
interaction of the architectural features, the performance of every combination of 
features was determined. 
Figure 7 is a plot of the cost versus benefit of all 48 meaningful combinations 
of the special architectural features (except tag overflow detect ion~ach of these 
48 points assumes that tag overflow detection is supported). The upper left-hand 
portion of the plot contains two additional points that represent the base case and 
the base case plus the instructions push, umin, and umax. The line is the lower half 
of the convex hull of the points, and represents one strategy for adding features to 
the general-purpose base. Each line segment connects the points so that the best 
benefit per cost is achieved as one moves from left to right. In this case, it turns out 
that each point on the line represents a set of features that adds one new feature 
to that of the point on its left. 
Starting from the base architecture, we first add the three instructions (push, 
umin, umax) that require no special architectural support, but which were not in- 
cluded in our base instruction set. Then tag overflow hardware is the next 
best feature to add since the amount of hardware is negligible. It is interest- 
ing to note that the SPARC architecture also added tag overflow support as its 
one addition for supporting tagged languages. The remaining features, moving 
left to right, are fast tag logic, multicycle and conditional instructions, double- 
word memory port, tagged immediates, egment mapping, and finally switch offset 
adders. 
To summarize, the specialized support added for Prolog does not require un- 
reasonable amounts of chip space or hand layout (13% active area for all Prolog 
related features), and it provides a performance benefit of 60-70%. 
126 B.K.  HOLMER ET AL. 
5.2.1. Comparison with Korsloot and Mulder. Korsloot and Mulder [17] present 
a study of the benefits of special architectural support for Prolog. They conclude 
that support for tags, auto-increment/decrement addressing, and special choice 
point buffers can provide a 20-25% performance advantage. This is a smaller 
advantage than given in Table 7 (64% for all Prolog support), so it is important 
to investigate the source of the discrepancy. In this subsection, we modify the 
VLSI°BAM benefit analysis to match as closely as possible the Korsloot and Mulder 
study. The primary source of the discrepancy is due to tag placement. A secondary 
contribution to the discrepancy arises from the benefit of specialized VLSI-BAM 
instructions that are not considered in the Korsloot and Mulder study. 
One of the main differences between the assumptions of the two studies is the 
placement of the tag bits in the 32-bit word. The VLSI-BAM places the tags in 
most significant bits of the word, whereas Korsloot and Mulder assume tags in 
the least significant bits. Korsloot and Mulder break tag support down into three 
categories: tag masking (to use tagged data as a memory address), tag insertion 
(to create tagged ata), and tag extraction (primarily to branch based on the tag 
value). When the tag bits are in the least significant bits, tag masking can be 
done using displacement addressing. Tag insertion is simply the addition of a small 
constant (for immediates, this can be done at compile time). 
These differences result in a different conclusion about the performance benefits 
for tag masking and tag insertion hardware support. Table 8 gives the cycle count 
reduction resulting from architectural support for tag masking and tag insertion. 
Both of these results are relative to our base architectures. The results show a 
significant difference in the benefit of hardware support for tag masking and tag 
insertion. The benefit is small when the tag bits are in the least significant bits (2% 
for masking and 2% for insertion), but more sizable when the tags are in the most 
significant bits (4 and 12%, respectively). Tag placement in the least significant 
bits shows less benefit due to specialized architectural support; therefore, this tag 
placement has better performance when there is no specialized support. 
When there is no architectural support, the essential dvantage ofplacing the tag 
in the lower bits is that tag insertion and tag masking can be done as the addition 
TABLE 8. Tag support comparison with Korsloot and Mulder. 
Cycle count ratio 
Tag support  Tag support  
Benchmark Base (masking) ( insertion) 
boyer 1.000 0.963 0.897 
chat_parser 1.000 0.961 0.894 
meta_qsort 1.000 0.955 0.880 
nand 1.000 0.971 0.897 
peep 1.000 0.974 0.893 
prover 1.000 0.959 0.879 
simple_analyzer 1.000 0.964 0.891 
average 1.000 0.964 0.890 
KM boyer 1.00 0.98 0.98 
KM chat_parser 1.00 0.98 0.98 
This table compares our architectural studies with the results of Korsloot and Mulder [17]. The last 
two rows of data are from [17]. The top rows are data collected with our tools. The column labeled 
"Base" represents he VLSI-BAM general-purpose base. The following instruction additions to this base 
are made for each column: "Tag support (masking)," segment mapping; and "Tag support (insertion)," 
ldi, cmpi, and lea. 
HARDWARE FOR HIGH-PERFORMANCE PROLOG 127 
of a small constant hat can be combined with immediates and displacement con- 
stants. When the tag is in the most significant bits of the word, then combining 
the constant for the tag operation with the offset requires an immediate with bit 
fields specified at both ends of the word--not an immediate format that is usually 
supported. 
Overflow detection during tagged arithmetic is simpler with the tags in the low 
bits because the value of the tagged integer lies in the upper part of the word, and 
an overflow of this value is detected by the overflow hardware already present for 
32-bit arithmetic. This advantage is not of the greatest importance since the time 
required for overflow detection even in the case of most significant bit tags with no 
hardware support is modest (2.6% from Table 7). 
The proper choice for tag placement appears to depend strongly on whether 
architectural support will be provided for tag operations. Obviously if a Prolog 
system is to be implemented on hardware that includes no support for tags, then 
there are compelling reasons to place the bits in the least significant bits. But once 
one is willing to commit hardware for tag support, the choice becomes more difficult. 
Tags in the most significant bits can be of uniform width, simplifying the branch 
on tag support. For tags in the least significant bits, the advantage of combining 
tag constants with immediates and offsets becomes less significant because formats 
for immediates can be changed to include bits from both the high and low ends 
of the word. Also, the amount of hardware required to support tagged-arithmetic 
overflow detection is minimal in either case. 
Returning to the comparison of our analysis with that of Korsloot and Mulder, we 
feel that the best way to compare architectural support, other than for tag masking 
and tag insertion, is to assume that our base architectures already include support 
for tag masking and tag insertion. Although this is not exactly what the results of 
Korsloot represent, heir numbers hould be very close since the incremental effect 
of tag masking and tag insertion is only 1%. 5 Specifically, the base architecture that 
we assume consists of the VLSI-BAM base plus segment mapping and instructions 
needed to support tag masking and insertion. This configuration is labeled "KM 
base" in Table 9. 6 Using this base, we present results in Table 9 for the advantages 
of adding architectural features to support tag extraction, address modification, 
and choice point support. 
For tag extraction, Korsloot adds a branch on tag instruction. This is comparable 
to the btgeq and btgne instructions in the VLSI-BAM. Adding btg instructions to 
the KM base results in similar cycle count reduction, as observed by Korsloot and 
Mulder. 
Address modification consists of adding auto-ncrement and auto-decrement ad- 
dressing modes. This is the same as adding the VLSI-BAM push and pop instruc- 
tions (excluding the double-word versions). In this case, we observed less reduction 
in cycle count, apparently because our instruction reorderer is able to combine 
multiple addi instructions arising from macro expanding push instructions (push 
is replaced with st and addi). Specifically, for boyer and chat_parser, the increase 
in dynamic addi count (resulting from disallowing auto address mode) is 3.3 times 
5The benefit of support  for tag masking, insertion, and extract ion is just  1°'/o more than  the 
benefit of support  for only tag extraction. 
6We also include tagged ar i thmetic overflow support  in the KM base because it was not included 
in the Korsloot study. 
128 B.K. HOLMER ET AL. 
TABLE 9. Comparison with Korsloot and Mulder. 
Cycle count ratio 
KM Tag support Address Choice point 
Benchmark base (extraction) modification support SRISC VLSI-BAM 
boyer 1.000 0.912 0.989 0.958 0.853 0.735 
chat_parser 1.000 0.945 0.972 0.879 0.792 0.704 
meta_qsort 1.000 0.929 0.983 0.888 0.791 0.701 
nand 1.000 0.916 0.985 0.876 0.772 0.738 
peep 1.000 0.924 0.986 0.904 0,813 0.759 
prover 1.000 0.910 0.983 0.885 0,769 0.708 
simple_analyzer 1.000 0.928 0.994 0.938 0.852 0.747 
average 1.000 0.923 0.985 0.904 0.805 0.727 
KM boyer 1.00 0.90 0.96 0.94 0.81 - -  
KM chat_parser 1.00 0.94 0.90 0.93 0.76 - -  
This table compares our architectural studies with the results of Korsloot and Mulder [17]. The last 
two rows of data are from [17]. The top rows are (iata collected with our tools. The column labeled "KM 
base" represents he VLSI-BAM general-purpose base plus segment mapping, ldi, cmpi, lea, and 28-bit 
arithmetic. The following instruction additions to this base are made for each column: "Tag support," 
btgeq and btgne; "Address modification," push and pop; "Choice point support," ldd, std, and stdc; 
"SRISC," btgeq, btgne, push, pop, ldd, std, stdc, pushd, and pushdc; and "VLSI-BAM," SRISC plus all 
the remaining VLSI-BAM instructions. 
less that the dynamic count of push and pop. This ratio roughly matches the ratio 
of the reductions in cycle count: (1 - 0.96)/(1 - 0.989) -- 3.6 for boyer. 
Korsloot and Mulder provide choice point support by adding a shadow buffer 
that allows quick transfer of between the buffer and the register file. The VLSI- 
BAM design supports choice points by relying on compiler optimization and the use 
of double-word loads and stores. The "Choice point support" column in Table 9 
shows that double-word loads and stores provide a slightly better performance 
improvement (3%) over a specialized choice point buffer. This is because double- 
word memory operations can also be used for creating procedure nvironments and 
accessing adjacent arguments of lists and structures. 
When the support for tags, address modification, and choice points are all com- 
bined (the "SRISC" column), the total benefit is a cycle count reduction by 14-23%. 
For boyer and chat_parser, we find a slightly smaller eduction (3-5%) than Korsloot 
and Mulder. 
The additional VLSI-BAM instructions beyond "SRISC" (ldx, sti, stid, pusht, 
umin, umax, dref, uni, swb, and swt) provide an additional 3-10% performance 
advantage. Section 5.3 provides more detail on which instructions contribute the 
most to this improvement. 
Note that the totM benefits for special-purpose Prolog support found here aver- 
age to 37.6% (1/0.727 = 1.376) compared to 64.2% (Table 7). This difference, as we 
discussed earlier, is due to the effects of tag placement on the apparent architectural 
advantage of tag masking and tag insertion. 7 
Two primary conclusions can be drawn from the results of the comparison with 
Korsloot and Mulder. First, if there is no architectural support for tag operations, 
placing the tags in the least significant bits gives better performance. With archi- 
tectural support, however, there is no performance advantage to either tag position. 
The choice is primarily dictated by aesthetics or historical preference. Second, use 
71.642/1.376 = 1.20 = 1.04 x 1.12 x 1.03 where the three factors are for tag masking, tag 
insertion, and tag overflow detection. 
HARDWARE FOR H IGH-PERFORMANCE PROLOG 129 
of double-word load and store instructions gives as much or more performance 
improvement than specialized choice point buffers. 
5.2.2. Effect of Compiler Optimizations. Andrew Taylor has reported perfor- 
mance results for an optimizing Prolog compiler (PARMA) targeted for the MIPS 
processor [28]. His work raises the question of whether specialized hardware can be 
completely replaced with compiler optimization. The combination of PARMA and 
the MIPS R3000 is roughly equivalent in speed with the combination of Aquarius 
and the VLSI-BAM (after adjusting for differences in clock rates). This implies 
that the PARMA compiler is able to find more performance-enhancing optimiza- 
tions than the Aquarius compiler. But special-purpose architectural features could 
also speed up the PARMA/MIPS combination. Most likely, however, the perfor- 
mance advantages of special features would be reduced. 
In this section, we attempt o estimate the usefulness of our special architectural 
features given an improved compiler. A version of the PARMA compiler which 
generates VLSI-BAM code is not available, so to approximate the quality of code 
given by such a compiler, we modify the assembly code produced by the Aquarius 
compiler based on an analysis of the execution trace of each benchmark. This 
analysis determines which operations, at least for the specific inputs used, are not 
productive for achieving the solution. Because the analysis is based only on a 
limited set of the possible inputs, the resulting optimizations are "unsafe" as far as 
producing correct code is concerned, but the analysis allows us to approximate the 
code produced by an extremely good compiler. 
Analysis of the trace determines which operations are not taking one closer to 
the solution. For example, if a particular branch is never taken, then that branch 
can be removed. If the new value written to a register is always identical to the 
old value it replaces, then that register update is unnecessary (the same holds true 
for memory locations). A specific case of this is when a dereference instruction 
does not result in a new value being loaded into the register--such a dereference 
instruction would be removed by our hypothetical "optimizing compiler." Also, 
register and memory liveness can be computed from the trace, and if a memory 
location (or register) is not read between the time of two writes, then the first write 
is unnecessary. 
Unfortunately, our analysis does not approximate some important optimizations 
which, if implemented, could possibly have a large effect on the performance benefit 
of specialized architectural features. For example, we do not take into account he 
effects of global register allocation, determinism extraction, multiple specialization, 
and compile-time garbage collection [32]. Consequently, our results do not bound 
the improvements possible with compiler optimizations. 
We performed the following steps to create optimized code. First, the program 
was executed and a trace of procedure calls was obtained. This trace contains 
information on the types of each of the arguments. Using the data types of the 
arguments, mode declarations were added to the program to supplement the com- 
piler's static analysis. Then an instrumented instruction-level simulator was used to 
gather information on register (and memory) writes that did not change the value 
of the write's destination. Instructions that never place a new value in the desti- 
nation of the write were eliminated. Next, all branch instructions that use the fast 
tag logic (btgeq, btgne, swt, and swb) were analyzed to determine if one direction 
was always taken. If the branch is always taken, it was replaced with a jmp. If the 
130 B. K. HOLMER ET AL. 
TABLE 10. Performance benefit of special architectural features assuming an 
improved compiler. 
Feature performance benefit (%) 
Tag All 
Fast tag Double-word Segment Mult icycle Tagged- overflow prolog 
Benchmark logic memory port mapping condit ional  immediate detect support  
meta_qsort 4.0 20.9 8.3 8.4 11.2 0.8 57.1 
hand 2.5 20.4 5.9 4.0 3.1 3.3 48.8 
This table gives the percent performance benefit for each special feature of the VLSI-BAM processor 
after optimization of the compiled code to simulate the effect of an improved compiler. The last column 
lists the performance benefit of segment mapping and all instructions given in Tables 2 and 3. Cache 
effects are not included. 
branch is never taken, it is removed. Also, register (and memory location) liveness 
was determined. If an instruction always writes a value that is never used, the 
instruction is removed. 
Table 10 lists the performance benefit of the architectural features assuming an 
improved compiler. This table should be compared with Table 7. We see that 
the fast tag logic becomes much less important (it drops from 15% to about 3% 
performance benefit). This arises mostly from the near elimination of the deref- 
erence operation (see Tables 12 and 13). All of the other architectural features 
have roughly the same benefit as before. Thus, the primary architectural needs for 
Prolog, using the improved compiler, are high memory bandwidth (multiword loads 
and stores) and fast tag masking and tag insertion. For the VLSI-BAM, these tag 
operations are done using segment mapping and tagged-immediate formats. If the 
architecture is byte-addressed and the implementation places the tags in the least- 
significant bits, then these tag operations can be performed using displacement 
addressing. The need for multiword memory loads and stores, however, remains. 
Table 11 shows that specialized hardware support is still significant (31% for 
nand and 34% for meta) with the improved compiler. In Section 5.2.1, we saw 
that the KM base roughly corresponds to a general-purpose base with a Prolog 
implementation that places tags in the lower bits. With the Aquarius compiler, 
specialized hardware of the VLSI-BAM provides lightly more speed-up (36 and 
43%, respectively). The reason that the speed up due to specialized hardware 
changes only slightly as compiler optimizations improve is that, in spite of the 
extra optimizations, Prolog still requires ignificant bandwidth to and from memory. 
Also, branch-on-tag support will always be useful due to the imperfect knowledge 
at compile time (for example, we do not know the data types of some inputs until 
run time). 
TABLE 11. Architectural benefits versus compiler benefits. 
KM Compiler 
Benchmark base optimizations 
meta_qsort 33.8 25.9 
nand 31.1 21.4 
This table compares the performance benefits of architectural features and those due to additional 
compiler optimization. The column labeled "KM base" is the performance difference between the KM 
base instruction set (Table 9) and the VLSI-BAM instruction set, both using the "improved" compiler 
optimizations. The "Compiler optimizations" column gives the performance difference of the "improved" 
compiler and the Aquarius compiler with static mode analysis, both using the VLSI-BAM instruction 
set. Cache effects are not included. 
HARDWARE FOR HIGH-PERFORMANCE PROLOG 131 
The performance benefit of the improved compiler over the Aquarius compiler is 
21 and 26% (Table 11). Using data from Taylor's thesis [28], PARMA's improvement 
over Aquarius could be as much as 40% for nand. The extra optimization capabil- 
ities of PARMA over our hypothetical compiler points to the need for additional 
benefit studies using a compiler as good as or better than PARMA. 
5.3. Benefits of Individual Instructions 
Table 12 lists performance benefits of individual instructions or instruction groups. 
Significant (greater than 1%) performance benefit is obtained from a majority of the 
special-purpose instructions ( dref, umin//umax, btgeq//ne, push/d/c, lea, and swt). 
The multicycle pointer dereference instruction (tire f) has an average xecution time 
of 1.6 cycles. Macro-expansion of dref into an explicit loop increases the average 
dereference time to 2.2 cycles. Although the benefit of dref per dereference is only 
0.6 cycle, the total performance benefit is significant because of its frequent use. 
Some of the smaller benchmarks (not listed in the table), however, show no benefit 
for dref due to the complete limination of dereferencing by compiler optimization. 
Unsigned maximum (umax) is used during environment and choice point creation. 
Omission of umax causes the time to determine the top of stack to increase from one 
to three cycles. Tagged-pointer c eation (lea) is a frequent operation, and its omis- 
sion adds an extra cycle for tag insertion (using or). Elimination of auto-increment 
addressing (push, pushd, pushdc) requires one extra cycle for each block alloca- 
tion. The three-way branch on tag (swt) can be replaced by two btgeq instructions, 
adding an extra cycle to two of the branch directions. Elimination of the two-way 
branch on tag (btgeq/ne) would require a two instruction compare and branch. 
Compared with our previously published results [15], the performance benefit of 
swt has dropped significantly. The difference is due to improvements in the compiler 
and instruction reorderer that allow btg instructions to replace swt in special cases. 
The remaining instructions have less than 1% average performance benefit. Be- 
cause the VLSI-PLM spends about 5% of its time trailing variable addresses, we 
included special support in the VLSI-BAM (pusht). However, due to the compiler's 
use of uninitialized variables, which do not have to be trailed, trailing time is re- 
duced to 1.4% in the VLSI-BAM. Omitting pusht causes a slow-down of 0.6%, 
which corresponds to trail time increasing from two to three cycles. Preliminary 
TABLE 12. Performance benefit of individual instructions. 
Instruction performance b nefit (%) 
umin btgeq push sti 
Benchmark dref umax btgne pushd/c lea swt pusht swb pop stid uni 
boyer 8.1 4.5 2.6 2.0 1.2 1.7 0.8 1.1 0.0 0.0 0.0 
chat_parser 3.5 3.6 3.5 2.5 3.0 0.9 1.3 0.6 0.8 0.3 0.9 
meta_qsort 5.3 5.4 5.7 3.2 3.2 0.5 0.8 0.5 0.2 0.2 0.1 
nand 1.0 3.2 3.1 3.0 1.8 0.2 0.1 0.0 0.0 0.0 0.0 
peep 1.9 3.2 2.6 2.0 2.0 1.1 0.3 0.7 0.1 0.1 0.0 
prover 2.8 3.6 3.8 3.2 2.7 1.0 0.4 0.4 0.2 0.4 0.0 
simple_analyzer 7.9 3.6 1.6 1.9 2.0 1.8 0.3 0.4 0.0 0.1 0.0 
average 4.3 3.9 3.2 2.5 2.3 1.0 0.6 0.5 0.2 0.2 0.1 
This table gives the percent performance benefit for each special instruction of the VLSI-BAM 
processor. All benchmarks are compiled with automatic mode generation, and cache effects are not 
included. 
132 B.K. HOLMER ET AL. 
TABLE 13. Performance benefit of individual instructions assuming an 
improved compiler. 
Instruction performance benefit (%) 
umin btgeq push sti 
Benchmark dref umax btgne pushd/c lea swt pusht swb pop stid uni 
meta-qsort 0.6 5.3 3.0 3.6 4.0 0.4 0.6 0.0 0.2 0.2 0.1 
nand 0.3 3.7 2.2 3.4 2.2 0.1 0.1 0.0 0.0 0.0 0.0 
This table gives the percent performance benefit for each special instruction of the VLSI-BAM 
processor assuming improved compilation. Cache effects are not included. 
analysis using macro-expanded WAM for the chat_parser benchmark indicated that 
the benefit for pop would be 1.5%. Compiler optimization of trailing has reduced 
this result. Similarly, compiler optimization reduces the number of general unifi- 
cations, minimizing the benefit of swb. Our initial studies also overestimated the 
benefits of special support for unification of atoms (uni, sti, stid). Every time one 
of the atom unification instructions is executed, one cycle is gained when compared 
to not having the instruction. However, they simply do not have a high enough 
dynamic execution frequency to make a significant impact on the performance. Al- 
though pusht, swb, pop, uni, sti, and stid provide marginal performance benefit, 
their implementation uses only features already required by other instructions. 
An interesting conclusion about the number of directions needed in multiway 
branches can be made from these measurements. Multiway branches are imple- 
mented in the VLSI-BAM with the swt and swb instructions, which are both single- 
cycle three-way branches (Table 3). Swt is used for unification of compound terms, 
for which greater than a three-way branch is not needed (Table 4 and [31]). Swb is 
used for unification of terms whose types are unknown at compile time. It takes care 
of 70% of these cases (Table 5), which gives a 0.5% execution time improvement 
(Table 12). If some single-cycle branch took care of 100% of these cases, we cal- 
culate that the further improvement would be about 0.2%. Given the additional 
complexity that such a branch implies, we conclude that a multiway branch with 
more than three directions is not effective for Prolog. 
Table 13 gives the performance benefit of the special-purpose instructions assum- 
ing an improved compiler. As mentioned before, the benefit of the dref instruction 
is drastically reduced because most dereference instructions do no useful work and 
are eliminated. The benefit of the btgeq and btgne instructions is reduced approx- 
imately in half since a sizable number of these instructions always take the same 
branch direction. All of the remaining instructions remain at about the same level 
of benefit. Thus, a set of push instructions remains important, along with an un- 
signed maximum for stack pointer manipulation. 
5.4. Code Size Benefits 
Table 14 compares the static code sizes of the VLSI-BAM, the KCM [2], Aquarius 
Prolog (SPARC) [34], and SPUR [4] relative to the PLM, a microcoded 
TABLE 14. Static code size ratios. 
KCM/PLM VLSI-BAM/PLM SPARC/PLM SPUR/PLM 
Bytes 3.0 3.1 6.5 14.1 
Instructions 1.1 2.6 5.5 12.0 
HARDWARE FOR HIGH-PERFORMANCE PROLOG 133 
implementation of the WAM [9]• The VLSI-BAM and Aquarius SPARC code size 
is calculated using prover, meta_qsort, simple_analyzer, and chat_parser. The KCM 
code size is from [2]. The SPUR code size is from [4]. The sizes do not include 
symbol table or run-time library• 
Macro-expansion f WAM code into SPUR instructions causes the large code size 
of the SPUR. Static code size for the VLSI-BAM is surprisingly small, only slightly 
larger than that of the KCM. Measurements of the code size resulting when just 
using the VLSI-BAM's general-purpose base instructions how that without the 
special architectural features for Prolog, the code size would be 2.1 times that of 
the VLSI-BAM. This is exactly the ratio observed between VLSI-BAM code and 
Aquarius Prolog SPARC code. Thus, the compactness of the VLSI-BAM code is 
due to the success of flow analysis in reducing code size (overcoming simple, but 
verbose macro-expansion) and the appropriateness of the VLSI-BAM instruction 
set for Prolog. 
Table 15 gives the reduction in code size provided by the addition of each in- 
struction (or instruction group). Past studies [4] have shown that code expansion 
of Prolog can be especially serious for simple instruction sets. One possible rule 
of thumb in cases where code size is important is that an instruction added for 
code size reduction should provide at least a 10% benefit. Using this criterion, only 
dref qualifies. One can also consider the combined benefit in terms of both perfor- 
mance and code size, and in this case, pusht can be added to the list of justified 
instructions• 
The swt shows a slight negative code size benefit because its destination offset 
field often overflows• For example, if the destination label_2 is too far away, then 
swt (rl, tvar=label_l, tstr=label_2). 
<delay slot instruction> 
is replaced by 
swt (rl, tvar=label_l, t str=skip) 
<delay slot instruction> 
skip : 
imp (label_2) • 
<another delay slot instruction> 
TABLE 15. Static code size benefit of individual instructions. 
Instruction code size benefit (%) 
umin btgeq push sti 
Benchmark dref umax btgne pushd/c lea swt pusht swb pop stid uni 
boyer 17.8 2.2 1.9 4.4 12.9 -0 .1 3.8 0.9 0.0 0.1 1.9 
chat_parser 28.6 3.8 4.5 3.5 4.6 -0 .2  8.0 0.3 0.0 2.9 10.5 
meta_qsort 20.9 3.8 2.7 4.5 13.5 0.0 2.6 0.5 0.0 0.3 1.3 
nand 23.3 3.8 1.O 3.8 8.6 -0 .1 0.7 0.0 0.0 1.1 0.0 
peep 36.8 2.1 2.8 4.4 9.0 -0 .4  5.8 1.6 0.0 0.3 2.6 
prover 17.0 3.2 0.9 4.6 9.7 0.0 2.1 0.6 0.0 0.6 0.0 
simple_analyzer 25.6 3.4 1.8 3.9 9.4 0.0 3.5 0.5 0.0 0.8 2.8 
average 24.1 3.2 2.2 4.2 9.6 -0.1 3.8 0.6 0.0 0.9 2.7 
This table gives the percent static code size benefit for each instruction of the VLSI-BAM processor. 
All benchmarks are compiled with automatic mode generation, and the Prolog built-in procedures are 
not included in the size. 
134 B. K. HOLMER ET AL. 
But when swt is eliminated from the instruction set, it is replaced by two consecu- 
tive btg instructions (see Figure 6). The btg instructions have a much greater offset 
range, and so are not likely to overflow and need replacement. A swt instruction 
that overflows for both destinations requires more code space than just two btg 
instructions. 
5. 5. Summarizing the Costs and Benefits 
In this section, we have looked at special-purpose upport for Prolog from several 
perspectives. Here, we summarize our findings by giving our opinion on what hard- 
ware support is best for Prolog, taking into account rends toward further compiler 
improvement. 
If we assume that the implementation places the tags in the lower bits of the 
word, then hardware support for tag masking and tag insertion is supplied by 
base plus displacement addressing and normal load immediate and add immediate 
instructions. 
The greatest need, given both current and future compilers, is adequate band- 
width to and from memory. This can be provided by multiple-word loads and 
stores. Fortunately, multiword memory loads and stores are now part of most high- 
performance microprocessor instruction sets. 
The most important specialized support for Prolog can be reduced to a fast 
(single-cycle) branch on tag (equality or nonequality comparison with an immedi- 
ate value). When the tags are placed in the low bits of the word, this usually forces 
the tags to be variable size. Unfortunately, this complicates a special branch on tag 
instruction. One solution is to support only the most common (or the two most 
common) tag placement and width. Less common tag formats must be checked us- 
ing a multiinstruction sequence of a mask operation followed by a normal compare 
and branch. 
The VLSI-BAM supports three-way branch-on-tag instructions, but the indica- 
tions are that they are currently at the margin of use-fulness (swt provides only 
a 1% performance advantage and swb is less), and future compilers will further 
reduce this usefulness. Also against hree-way branches is the need for additional 
displacement adders and the restricted range of branch offsets. 
The single most useful specialized instruction in the VLSI-BAM instruction set 
is dref. Not only does it provide a 4% performance benefit, but also a 24% re- 
duction in static code size. However, trace analysis shows that most dref instruc- 
tions could possibly be removed by a better compiler. Whether a dereference in- 
struction should be included in future Prolog instruction sets is an open question, 
and it depends on the effort of compilation that goes into the majority of code 
executed. To achieve a more interactive system, one may not spend much time 
on static analysis, and so a dereference instruction would be useful for reducing 
code size. 
Although the performance benefit is small, the cost of supporting a tag check 
on tagged arithmetic operations i so small (a few gates) that such support should 
be considered. It is not surprising, then, that the SPARC instruction set supports 
tagged add and subtract instructions. 
There are two other instructions that improve Prolog performance that can 
be considered "general-purpose," but that are not present in many of the high- 
performance microprocessor instruction sets. Auto-increment addressing mode for 
HARDWARE FOR H IGH-PERFORMANCE PROLOG 135 
stores (push instructions) peeds up stack operations and heap data creation. Man- 
agement of interleaved stacks can be done more efficiently with the unsigned max- 
imum operation. Both push and unsigned maximum maintain their performance 
enhancement as compiler optimization improves. The utility of the unsigned maxi- 
mum instruction, however, emains to be verified for the case when the environment 
and choice point stacks are not interleaved. 
In summary, future high-performance instruction sets for Prolog should include 
the following additions to a general-purpose base: multiword memory loads and 
stores, a single-cycle branch on tag, tagged arithmetic support, push instructions, 
and (possibly) unsigned maximum. 
6. PERFORMANCE RESULTS 
Table 16 compares the performance of the VLSI-BAM processor to that of two 
other Prolog systems. A more complete comparison of the performance of various 
Prolog systems can be found in [34, 32]. The results for VLSI-BAM are simu- 
lated assuming a 20 MHz clock and include overhead ue to cache misses [6]. A 
clock speed of 20 MHz is used because it is the speed at which several VLSI-BAM 
TABLE 16 .  Per fo rmance  resu l t s .  
Aquar ius  Aquar ius  
Benchmark  KCM SPARC VLS I -BAM 
log l0  0.039 (1.75) - -  0.0223 (1.00) 
ops8 0.059 (2.09) 0.0282 (1.00) 
t imes l0  0.082 (2.05) - -  0.0400 (1.00) 
d iv ide l0  0.091 (2.02) - -  0.0450 (1.00) 
nreverse 0.65 (3.22) 0.440 (2.18) 0.202 (1.00) 
qsort  1.32 (4.49) 0.520 (1.77) 0.294 (1.00) 
ser ia l ise 1.22 ( 1.73 ) 1.64 ( 2.33 ) 0.704 (1.00) 
query  12.6 (2.20) 15.1 ( 2.64 ) 5.73 (1.00) 
mu - -  2.86 (2.36) 1.21 (1.00) 
prover  - -  3.20 (2.32) 1.38 (1.00) 
queens_8 - -  3.20 (1.88) 1.70 (1.00) 
meta_qsor t  - -  17.2 (2.44) 7.06 (1.00) 
nand - -  57.0 (2.32) 24.6 (1.00) 
s imple_ana lyzer  - -  85.5 (1.71) 50.1 (1.00) 
poly_10 - -  112 (2.11) 53.2 (1.00) 
chat_parser  - -  550 (2.75) 200 (1.00) 
boyer  - -  3100 (1.45) 2140 (1.00) 
geometr i c  mean (2.32) (2.14) (1.00) 
This table compares the performance of VLSI-BAM with that  of two other Prolog imp lementat ions - -  
the KCM and Aquarius Prolog running on a SPARCstat ion 1A- (25 MHz). Each result  is presented as a 
t ime in mil l iseconds followed in parentheses by the ratio to the VLSI-BAM time. The KCM results [2} 
are derived from actual  measurements of a system running at 12 MHz. The VLSI-BAM results are sim- 
ulated assuming a 20 MHz clock and 128 Kbyte instruct ion and data  caches [6]. Results involving the 
Aquarius compiler (both SPARC and VLSI-BAM) use automat ic  mode generation. The programs used 
in this table and elsewhere in this paper include the well-known Warren benchmarks (the first eight in 
the table), of which query is modified to use integer division in place of the original f loating point; mu, 
which proves a theorem of Hofstadter 's  "mu-math ' ;  prover, a simple theorem prover; queens_S, which 
solves the eight queens problem using an incremental generate-and-test  strategy; meta-qsort,  a meta- 
interpreter unning Warren's qsort; hand, a logic synthesis program using branch-and-bound search; 
simple_analyzer, a flow analyzer analyzing Warren's qsort; poly_10, which symbol ical ly raises a polyno- 
mial to the tenth power; chat_parser, which parses a set of Engl ish sentences; boyer, an extract  from 
a Boyer-Moore theorem prover; and peep, the VLSI-BAM peephole opt imizer processing meta_qsort.  
Further information about many of the benchmarks may be found in [12]. 
136 B. K. HOLMER ET AL. 
chips successfully executed all the benchmark programs. The simulated system 
has 128 Kbyte instruction and data caches. The caches are direct mapped and 
use a write-back policy. They are run in warm start, that is, each benchmark is 
run twice and the results of the first run are ignored. Cache effects are significant 
only for the last six programs in Table 16. The cache overhead is greatest for sim- 
ple_analyzer, and poly_10; for these programs the overhead ranges from 11 to 38%. 
For meta_qsort and chat_parser, the overhead is less than 3%. 
The KCM [2], one of the fastest WAM implementations, has a relatively large 
amount of specialized hardware to execute a WAM-like instruction set efficiently, 
whereas the VLSI-BAM processor uses modest hardware to support an optimizing 
compiler. We find that the speed advantage of the VLSI-BAM over the KCM is 
equal to or greater than the cycle time ratio. 
Although the same compiler is being used for both the SPARC and VLSI-BAM 
machines, the VLSI-BAM outperforms the SPARC for several reasons. First, there 
is the improvement due to specialized hardware support (this accounts for between 
30 and 40%), but the majority of the difference in performance is due to the SPARC- 
station l+'s  relatively slow load and store instructions. Because Prolog programs 
heavily use memory loads and stores, a slow memory system will have a dramatic 
effect on the performance. 
A common measure of Prolog speed is logical inferences per second (LIPS). 
In general, this quantity is ambiguous; however, it is well defined for the naive 
reverse benchmark. The VLSI-BAM processor correctly executes naive reverse at 
27.5 MHz, 8 giving a measured performance of 3.37 million LIPS. 
. CONCLUSIONS 
The primary goal of our research as been to determine a minimal set of extensions 
to a general-purpose architecture necessary for achieving high-performance logic 
programming. At the same time, however, performance of the general-purpose 
architecture has not been compromised. When tags are placed in the most sig- 
nificant end of the word, we have identified tagged-immediate support, segment 
mapping, double-word memory bus, special ogic for fast branch on tag, and mul- 
ticycle instruction support as important Prolog-specific features. This hardware 
support gives a 64% performance benefit and costs 13% of the VLSI-BAM chip 
area. When tags are placed in the least significant end of the word, then double- 
word memory bus and fast branch on tag logic continue to be very important. In 
this case, specialized hardware support gives a 38% performance benefit. Even 
when we use an improved compiler, specialized instruction set support gives about 
a 32% performance benefit. 
Our special instructions for trailing and unification of atoms are of marginal per- 
formance benefit. We conclude that branches with three or more directions are not 
effective for Prolog, especially as compilers improve. Our measurements, however, 
justify the utility of multiword memory loads and stores, fast branch-on-tag in- 
structions, push instructions, and tagged arithmetic. Such instruction set support 
would not only improve Prolog performance, but would also be advantageous to
the implementation f other dynamically typed languages. 
8The instruct ion sequences which l imit the clock speed to 20 MHz do not occur in naive reverse. 
HARDWARE FOR HIGH-PERFORMANCE PROLOG 137 
We would like to thank the following people for their invaluable contributions, without which 
the VLSI-BAM would not have become a reality. Joan Pendleton designed the VLSI-BAM's mi- 
croarchitecture and VLSI circuit design and layout. Charlie Burns provided the VLSI-CAD tools 
used in the design of the VLSI-BAM processor. William Bush, Ralph Haygood, Joan Pendle- 
ton, and Tep Dobry made substantial contributions to the instruction set design of the VLSI- 
BAM. Ralph Haygood authored the Aquarius Prolog run-time library and VLSI-BAM assembler. 
Georges Smine and Ketan Bhat helped with the design and simulation of the VLSI-BAM cache 
board. 
We also acknowledge the numerous discussions with the other members of the Aquarius project, 
especially Tom Getzinger, Ashok Singhal, Herv~ Touati, and Vason Srini. Special thanks to Jim 
Testa for encouraging us to start a new Prolog chip design. 
We wish to thank Zycad Corporation for the use of their N.2 hardware simulation tools that 
simplified the task of simulating the microarchitecture. Equipment and other support for the 
project was provided by SUN, DEC, and TRW. 
REFERENCES 
1. Beer, J., The Occur-Check Problem Revisited, J. Logic Programming 5(3):243-261 
(Sept. 1988). 
2. Benker, H., Beacco, J. M., Bescos, S., Dorochevsky, M., Jeffre, Th., Pohimann, A., 
Noy(~, J., Poterie, B., Sexton, A., Syre, J. C., Thibault, O., and Watzalwik, G., KCM: 
A Knowledge Crunching Machine, in: Proc. 16th Annual Int. Syrup. on Computer 
Architecture, May 1989, pp. 186-194. 
3. Bitar, P. and Despain, A. M., Multiprocessor Cache Synchronization, Issues, Innova- 
tions, Evolution, in: Proc. 13th Annual Int. Syrup. on Computer Architecture, June 
1986, pp. 424-433. 
4. Borriello, G., Cherenson, A. R., Danzig, P. B., and Nelson, M. N., RISCs vs. CISCs 
for Prolog: A Case Study, in: Proc. 2nd Symp. on Architectural Support for Pro- 
gramming Languages and Operating Systems (ASPLOS II), Oct. 1987, pp. 136- 
145. 
5. Carlson, R., The Bottom-Up Design of a Prolog Architecture, Technical Report 
UCB/CSD 89/536, University of California, Berkeley, May 1989. 
6. Carlton, M., Pendleton, J., Holmer, B. K., Sano, B., and Despain, A. M., Cache and 
Multiprocessor Support in the BAM Microprocessor, in: Proc. 4th Annual Syrup. on 
Parallel Processing, Apr. 1990. 
7. Chang, A. and Mergen, M. F., 801 Storage: Architecture and Programming, ACM 
Trans. Computer Systems 6(1):28-50 (Feb. 1988). 
8. Debray, S. K. and Warren, D. S., Automatic Mode Inference for Prolog Programs, in: 
Proc. 1986 Syrup. on Logic Programming, Sept. 1986, pp. 78-88. 
9. Dobry, T. P., A High Performance Architecture for Prolog, Kluwer Academic, 
1990. 
10. Dorochevsky, M., Noy~, J., and Thibanlt, O., Has Dedicated Hardware for Pro- 
log a Future?, in: Proc. Processing Declarative Knowledge (PDK'91), 1991, pp. 17- 
31. 
11. Harsat, A. and Ginosar, R., CARMEL-2: A Second Generation VLSI Architecture for 
Flat Concurrent Prolog, in: Proc. Int. Conf. on Fifth Generation Computer Systems, 
Dec. 1988, pp. 962-969. 
12. Haygood, R., A Prolog Benchmark Suite for Aquarius, Technical Report UCB/CSD 
89/509, Computer Science Division, University of California, Berkeley, Apr. 
1989. 
13. Hennessy, J., Jouppi, N., Przybylski, S., Rowen, C., and Gross, T., Design of a High 
Performance VLSI Processor, in: Proc. 3rd Caltech Conf. on VLSI, 1983, pp. 33- 
54. 
138 B. K. HOLMER ET AL. 
14. Holmer, B. K. and Despain, A. M., Viewing Instruction Set Design as an Optimization 
Problem, in: Proc. 2,~th Annual Workshop on Microprogramming and Microarchitec- 
ture (MICRO-2$), 1991, pp. 153-162. 
15. Holmer, B. K., Sano, B., Carlton, M., Van Roy, P., Haygood, R., Bush, W. R., 
Despain, A. M., Pendleton, J. M., and Dobry, T., Fast Prolog with an Extended 
General Purpose Architecture, in: Proc. 17th Annual Int. Symp. on Computer Archi- 
tecture, 1990, pp. 282-291. 
16. Katevenis, M. G. H., Reduced Instruction Set Computer Architectures for VLSI, MIT 
Press, 1985. 
17. Korsloot, M. and Mulder, H., Sequential Architecture Models for Prolog: A Perfor- 
mance Comparison, New Generation Computing 9:201-209 (1991). 
18. Mills, J. W., LIBRA: A High-Performance Balanced Computer Architecture for Pro- 
log, Ph.D. thesis, Arizona State University, Dec. 1988. 
19. Mills, J. W., A High-Performance LOW RISC Machine for Logic Programming, J.
Logic Programming 6(1&2):179-212 (Jan./Mar. 1989). 
20. Nakashima, H. and Nakajima, K., Hardware Architecture of the Sequential Inference 
Machine: PSI-II, in: Proc. 1987Syrup. on Logic Programming, Aug. 1987, pp. 104- 
113. 
21. Pendleton, J., Kong, S., Brown, E., Dunlap, F., Marino, C., Ungar, D., Patterson, D., 
and Hodges, D., A 32-bit Microprocessor for Smalltalk, IEEE J. Solid State Circuits 
SC-21(5):741-749 (Oct. 1986). 
22. Radin, G., The 801 Minicomputer, in: Proc. Syrup. on Architectural Support for 
Programming Languages and Operating Systems (ASPLOS), Mar. 1982, pp. 39- 
47. 
23. Sano, B., Performance vs. Cost of the BAM, Technical Report TR-89-01, Advanced 
Computer Architecture Laboratory, University of Southern California, Dec. 1989. 
24. Seo, K. and Yokota, T., Design and Fabrication of Pegasus Prolog Processor, in: Proc. 
VLSI 89, 1989. 
25. Srini, V. P., Tam, J. V., Nguyen, T. M., Patt, Y. N., Despain, A. M., Moll, M., and 
Ellsworth, D., A CMOS Chip for Prolog, in: Proc. Int. Conf. on Computer Design, 
Oct. 1987, pp. 605-610. 
26. Steenkiste, P. and Hennessy, J., Tags and Type Checking in LISP: Hardware and 
Software Approaches, in: Proc. 2nd Syrup. on Architectural Support for Programming 
Languages and Operating Systems (ASPLOS II), Oct. 1987, pp. 50-59. 
27. Sterling, L. and Shapiro, E., The Art of Prolog, MIT Press, 1986. 
28. Taylor, A., High Performance Prolog Implementation, Ph.D. thesis, University of 
Sydney, June 1991. 
29. Touati, H. and Despain, A., An Empirical Study of the Warren Abstract Machine, in: 
Proc. 1987 Syrup. on Logic Programming, Aug. 1987, pp. 114-124. 
30. Ungar, D. M., The Design and Evaluation of a High Performance Smalltalk System, 
MIT Press, 1987. 
31. Van Roy, P., An Intermediate Language to Support Prolog's Unification, in: E. Lusk 
and R. A. Overbeek (eds.), Proc. North American Conf. on Logic Programming, MIT 
Press, Oct. 1989, pp. 1148-1164. 
32. Van Roy, P., 1983-1993: The Wonder Years of Sequential Prolog Implementation, 
J. Logic Programming 19/20:385-441 (May/July 1994). 
33. Van Roy, P., Demoen, B., and Willems, Y. D., Improving the Execution Speed of 
Compiled Prolog with Modes, Clause Selection, and Determinism, in: TAPSOFT'87, 
Lecture Notes in Computer Sciences, 250, Mar. 1987, pp. 111-125. 
34. Van Roy, P. and Despain, A. M., High-Performance Logic Programming with the 
Aquarius Prolog Compiler, Computer 25(1):54-68 (Jan. 1992). 
HARDWARE FOR HIGH-PERFORMANCE PROLOG 139 
35. Van Roy, P., Can Logic Programming Execute as Fast as Imperative Programming?, 
Ph.D. thesis, University of California, Berkeley, Dec. 1990. Available as U. C. Berkeley 
Computer Science Division Technical Report UCB/CSD 90/600. 
36. Warren, D. H. D., An Abstract Prolog Instruction Set, Technical Report 309, SRI 
International, Oct. 1983. 
37. Yokota, M., Yamamoto, A., Taki, K., Nishikawa, H., and Uchida, S., The Design and 
Implementation of a Personal Sequential Inference Machine: PSI, New Generation 
Computing 125-144 (1983). 
