A high-performance low risc machine for logic programming  by Mills, J.W
J.LOGICPROGRAMMING1989:179-212 179 
A HIGH-PERFORMANCE LOW RISC MACHINE 
FOR LOGIC PROGRAMMING* 
J. W. MILLS 
D This paper describes a reduced-instruction-set computer (RISC) architecture 
for PROLOG and gives examples of Warren-machine (WAM) instructions, 
built-in functions, and unit clauses written using its instruction set. Two 
optimizations that allow dereferencing and trailing to be omitted frequently 
are applied to the RISC code, allowing it to execute 30% faster than 
unoptimized macro expansions of Warren-machine instructions. Using an 
instruction cache and a data cache, hand-optimized unit clauses are esti- 
mated to run at more than 1,700,OOO logical inferences per second (LIPS), 
while a mixture of hand-optimized and macro-expanded RISC code should 
execute in the range of 200,000 to 700,000 LIPS. Hand-optimized RISC code 
is four times the size of corresponding WAM code; macro-expanded RISC 
code is seven times larger. a 
1. INTRODUCIION 
David H. D. Warren’s abstract machine for PROLOG [39,40], the WAM, has 
influenced the design of more PROLOGs than any other abstract PROLOG 
machine proposed to date. Although McCabe’s Sigma machine [21] and D. L. 
Bowen’s NIP [2] were described in more detail than the WAM, the WAM captured the 
imagination of the PROLOG implementation community. The attraction to the 
WAM is strong: it is used in at least five compiled PROLOGs,’ provides a model for 
investigating parallel PROLOG and PROLOG theorem provers [26,10,16,13,14] 
and, slightly modified, is the PROLOG Language Machine (PLM) designed by 
Dobry, Despain, and Patt [S, 32,381. 
*An earlier version of this paper is available as Technical Report TR-86-008 from the Computer 
Science Department, Arizona State University, Tempe, AZ 85287. 
Address correspondence to J. W. Mills, Department of Computer Science, Indiana University, 
Bloomington, Indiana 85287. 
Received July 1987; accepted 10 October 1987. 
‘Applied Logic Systems, Arity, Berkeley PLM, Quintus, and SUNY Stony Brook PROLOGs. 
THE JOURNAL OF LOGIC PROGRAMMING 
OElsevier Science Publishing Co., Inc., 1989 
655 Avenue of the Americas, New York, NY 10010 0743-1066/89/$3.50 
180 J. W. MILLS 
The WAM has even been extended to a complex object-oriented architecture [7], 
but the opposite approach-reducing the WAM to a RISC architecture-was consid- 
ered unproductive in 1984 when I began to investigate such a RISC architecture. This 
belief was probably due to Tick and Warren’s decision not to use a RISC (such as the 
Berkeley RISC II [17]) to implement he WAM. A conventional RISC has less datapath 
parallelism than the WAM, and because translating PROLOG to a RISC instruction 
set requires many more conditional branches and subroutine calls than the WAM, the 
RISC code is larger [34]. As a result, the performance of the WAM was expected to be 
superior to that of a RISC. Evidence of this was obtained using Patterson’s SPUR 
(Symbolic Processing Using a RISC) [29]. Borriello, Cherenson, Danzig, and Nelson 
found that an unaided SPUR could attain 60% of the performance of a WAM [4]. If a 
specialized coprocessor was added, they estimated that the SPUR would be slightly 
faster than the PLM. However, because the SPUR was designed to execute LISP, not 
PROLOG [36], I believe that the performance suffered. 
It turns out that performance improvements greater than those obtained by the 
SPUR team are possible with the LOW RISC (an acronym for Logic Programming 
Windowed RISC). Faster execution of PROLOG programs is due to an increased 
number of datapaths inside the LOW RISC which allow tag and value operations to 
be performed in parallel, an instruction set evolved from the WAM, and hardware 
support for common operations such as tag check and branching. (Also desirable, 
but not present in this earliest version of the LOW RISC architecture, is hardware 
support for shallow backtracking, stack manipulation, trailing, single-cycle “partial” 
unification, and zero-to-four-cycle dereferencing.2) 
Additional improvements are possible with the simple architecture presented here 
by recognizing that the WAM is an abstract compiling paradigm as well as an 
abstract machine. Thus, RISC code may be added, omitted, moved, or otherwise 
optimized as long as the behavior of the abstract machine is not changed. Such 
optimized code no longer corresponds exactly to a sequence of WAM instructions, 
but rather to a sequence of pieces of WAM instructions. These pieces are the parts of 
a WAM instruction executed in the specific context of a program, such as the 
write-mode operations performed when a structure is bound to a logical variable. 
Because code optimizations can be made “inside” WAM instructions, in certain 
contexts trailing and dereferencing can be identified as superfluous and omitted 
from the RISC code. 
As the speed of a PROLOG implemented on a RISC depends on the code 
executed, four representative WAM instructions, a unification subroutine, four 
PROLOG goals, the inner loop of the Determinate Concatenate benchmark, and a 
unit clause of arity four are translated into LOW RISC instructions. These examples 
were chosen because they illustrate common functions performed by a WAM 
executing PROLOG [20,25]. The examples are compared with code for the WAM. 
Execution speed and program size are determined by comparing the cycle times and 
code size of LOW RISC routines with the cycle times and code size of PLM instruc- 
tions. Although a certain amount of caution must be used in interpreting these 
results, I estimate that a single LOW RISC processor could execute PROLOG 
2The third and final version of the LOW RISC, a 40-bit architecture, is described in the author’s 
doctoral dissertation, in preparation. 
LOW RISC MACHINE 181 
procedures at speeds ranging from 200,000 to 1,700,OOO LIPS (logical inferences per 
second).3 
2.. TRANSLATING PROLOG TO LOW RISC CODE 
2. I. Overview of the LOW RISC Instruction Set 
Although the LOW RISC architecture is specified in Appendix A, the instruction set is 
briefly described here for convenience. There are only seven instructions, each of 
which executes in a single cycle. Instruction execution is internally pipelined as in 
the Berkeley RISC II, allowing instructions to overlap. After the pipeline fills, a LOW 
RISC instruction is completed every half cycle. 
Memory is referenced by load and store instructions. As in other RISC& the result 
of a load is delayed. Tag status flags are set to indicate the type of the value loaded 
or the value stored. Value status flags remain unchanged by a load or a store. 
Value and tag manipulation are performed in parallel by add and subtract 
instructions. The add instruction also has characteristics of a logical instruction. 
Control bits allow it to zero the first operand’s value and select the destination tag 
from either the second operand or immediate data. If the second operand is 
immediate data, then the destination tag is selected from either the first operand or 
another immediate data field. Thus an add instruction can be used to “glue” a tag 
and a value together. The sub instruction subtracts the first operand’s value from the 
second, and the first operand’s tag from the second. All fields are interpreted as 
twos-complement integers. For both add and sub the tag status flags are set to 
indicate the type of the result. Value status flags are set to indicate the result of both 
arithmetic add and sub and the inclusive-or add operation. 
Execution is controlled in three ways. First, an unconditional branch is made by 
using the if always instruction, by loading a value into the program counter, or by 
performing program-counter arithmetic. Second, conditional branches are per- 
formed by the if instruction based on the status of relevant tag and value status 
flags, or by the switch instruction based on any register’s type. The switch instruc- 
tion contains four offsets. The type selects one of the offsets to be added to the 
program counter, or continues execution at the next instruction. Finally, a coproces- 
sor interface instruction, hook, is provided. This instruction allows a general-pur- 
pose processor to be used for system predicates that cannot be programmed with 
the LOW RISC instruction set. 
2.2. Macro-expanding WAM Instructions Using the LOW RISC 
RISC architectures increase the semantic gap that must be bridged by a PROLOG 
compiler. However, the WAM provides an intermediate representation of PROLOG 
that can be translated to LOW RISC code. If the LOW RISC instructions corresponding 
to a WAM instruction are grouped together, then they may be viewed as a macro 
‘Since this paper was originally written, a simulator for the extended LOW RISC, described in [24], 
resulted in execution speeds for the PLM benchmark suite ranging from 300K to 700K LIPS. Code was 
generated by macro-expanding WAM instructions, and was not optimized [31]. 
182 J. W.MILLS 
expansion of that WAM instruction. This is an easy path to take when implementing 
PROLOG on a RISC. The SPUR team took this approach [4], which led to a slower 
implementation because dereferencing, trailing, and mode-removing optimizations 
were not applied. Four WAM instructions (bold print in the headings) are shown 
with their LOW RISC native code. 
2.2.1. call Proc, N. The call WAM instruction is used to execute PROLOG goals 
in the body of a clause. It saves the address of code for the remaining goals in the 
body of the clause (the continuation) in the CP register. If the goal and all its 
subordinate goals succeed, then program execution will resume at the address in the 
CP register. 
addrl : 
addr2: 
add P, proc, P 
add P, addr2, CP 
next instruction 
;faster than PC-relative load 
;and takes less space 
2.2.2. put-variable XII, Ai. The put_variable WAM instruction creates an unbound 
variable in the heap, and puts references to this variable in a temporary variable and 
an argument register. 
addrl : add H, 0, Xn 
store H, 0, H 
add Xn, 0, Ai 
add H, 1, H 
2.2.3. get-variable Xn, Ai. The get-variable WAM instruction is equivalent o the 
put-value WAM instruction. Both transfer the contents of one register to another. If 
the variable occurs after the first goal, a variation of this instruction saves the 
parameter in the environment of the WAM. In the LOW RISC, the variable is 
preserved in the calling procedure’s register window. 
addrl : add Xn, 0, Ai 
2.2.4. get-structure F, Ai. The get-structure WAM instruction may bind an in- 
coming parameter which is a variable to a new copy of a structure skeleton, start 
matching an incoming structure to the encoded structure, or fail. Additional WAM 
instructions finish the structure copying or matching. If this were a true macro 
expansion, then the branches to read and write mode code would also include 
instructions to set or clear a mode flag, and would branch to the same entry point. 
In the branched-to code, each instruction would check the mode flag, slowing the 
execution of the program. 
addrl : switch Ai, Lv, fail, fail, Lstruc ;dereference Ai 
load Ai, 0, T 9 
if always, addrl 7 
add T, 0, Ai , 
LOW RISC MACHINE 183 
Lv: 
Lvl : 
Lstruc: 
sub Ai, HB, T, SC 
if value 2, Lvl 
store Ai, 0, struc:H 
store TR, 0, Ai 
add TR, I, Tr 
add H = O,functor, T 
store H, 0, T 
add H = O,arity, T 
store H, 1, T 
write-mode code follows here 
;bind variable Ai to 
:structure ptr 
> 
7 
; trail if necessary 
I 
;store functor as word pair 
9 
, 
3 
sub T, functor, T, SC 
load Ai, 1, T 
if value # , fail 
sub T, arity, T, SC 
if value Z , fail 
read-mode code follows here 
;match structure Ai to code 
, 
;fail if either indices or 
3 
;arities are not equal 
2.3. Unijkation 
In the WAM, get-value and unify-value are two of the instructions requiring the 
availability of full unification as a micro subroutine. This micro subroutine is 
invariant, and cannot be modified if typing information for the operands is either 
provided by the programmer or deduced during compile time. 
Unification is coded on the LOW RISC with a sequence of switch instructions 
which select an action based on the types of the two arguments. The unification 
subroutine uses the windowed register file of the LOW RISC to handle nested lists and 
structures. Maintaining the unification stack in the register file improves perfor- 
mance. As an extreme (and unlikely) example, if 50 registers in the window file were 
unused when the unification subroutine was entered, a list with 11 nested levels or a 
structure with 9 nested levels could be unified without overflowing the windowed 
register file. Since the choice-point stack and the unification stack share the use of 
the register file, it is fortunate that such examples are uncommon. 
2.3.1. LOW RISC Unification Subroutine 
Register Usage Overlapped usage during recursion 
RO return pointer 
Rl return window 
R2 (drefd) object 1 
R3 (drefd) object 2 list1 & list2 structure1 & structure2 
R4 cell at dref’d object 1 return ptr index 1 / arity 1 
R5 cell at dref’d object 2 old window index 2 / arity 2 / return ptr 
R6 car 1 return window 
R7 car 2 element n, 
R8 element n2 
R9 T 
184 J. W. MILLS 
unify: 
varl : 
constl : 
list1 : 
strucl : 
match: 
fail: 
rank: 
bind1 : 
bind2: 
ulists: 
switch 
load 
if 
add 
switch 
load 
if 
add 
switch 
load 
if 
add 
switch 
load 
if 
add 
switch 
load 
if 
add 
sub 
if 
if 
sub 
if 
if 
sub 
if 
store 
store 
if 
add 
sub 
if 
store 
store 
if 
add 
add 
add 
add 
add 
add 
R2, varl, constl, listl, strucl 
R2, 0, R4 
always, unify 
R4, 0, R2 
R3, rank, bind& bindl, bind1 
R3, 0, R5 
always, varl 
R5, 0, R3 
R3, bind2, match, fail, fail 
R3, 0, R5 
always, constl 
R5, 0, R3 
R3, bind2, fail, ulists, fail 
R3, 0, R5 
always, list1 
R5, 0, R3 
R3, bind2, fail, fail, ustrucs 
R3, 0, 
always, 
R5, 0, 
R2, R3, 
cell = , 
always, 
R2, R3, 
value > , 
value = , 
Q HB, 
value 2, 
w 0, 
TR, 0, 
always, 
TR, 1, 
R3, TB, 
value 2 , 
R3, 0, 
TR, 0, 
always, 
TR, 1, 
R4, 0, 
R5, 0, 
W, 0, 
P, nxtcell, 
P = 0, unify, 
R5 
strucl 
R3 
T, SC 
return 
fail routine 
T, SC 
bind2 
return 
T, SC 
return 
R3 
R2 
return 
TR 
T, SC 
return 
R2 
R3 
return 
TR 
R6 
R7 
R5 
R4 
P 
;select row 
;deference variable 1 
;row l-select column entry 
;deference variable 2 
;row 2-select column entry 
;dereference variable 2 
pow 3-select column entry 
;dereference variable 2 
;row 4-select column entry 
;dereference variable 2 
;match constant 1 
;and constant 2 
;fail 
;rank variables 1 and 2 
;if same variable then return 
;bind variable 1 to object 2 
;bind variable 2 to object 1 
;unify two lists cell by cell 
, 
;unify the two cars 
add W, 4, W 
LOW RISC MACHINE 185 
nxtcell : 
ustrucs: 
loop: 
test: 
return: 
load w 1, 
if always, 
load R3, 1, 
sub R4, R5, 
load R2, 1, 
load 
if 
sub 
if 
R3, 1, 
cell # , 
R4, R5, 
value f , 
sub 
if 
R4, R5, 
value = , 
load 
load 
add 
add 
add 
add 
sub 
if 
load 
W R4, 
R3, R4, 
W, 0, 
P, test, 
P = 0, unify, 
W, 5, 
R4, 1, 
value 2 , 
Q R4, 
add 
add 
P = 0, RO, 
w = 0, 
R2 ;load cdrs & continue 
unify 7 
R3 3 
T, SC ;if function indices not 
;equal . . . 
R4 ;(get arities now to avoid 
; delay) 
R5 3 
fail routine ; . . . then fail else.. . 
T, SC 
fail routine 
T, SC 
return 
R7 
R8 
R6 
R5 
P 
W 
R6 
loop 
R7 
;if arities not equal 
;then fail else. . . 
2 
;if arities are zero 
; then return else.. . 
> 
;load element of structure 1 
;and elements of structure 2 
;unify the two structures 
;from right to left 
;(” backwards”) 
2 
2 
9 
;if arity not decremented to 
;zero 
;then unify next elements 
;else... 
;(only used if branch is 
; taken) 
P 
Rl, W 
;return to caller 
9 
This subroutine can be used whenever unification is required, but optimizations 
improve its use and performance. 
Loop unrolling is applied by open-coding the unification routine while omitting 
the recursive list and structure subroutines. These two cases are handled by 
calls to the unification subroutine, but only when both arguments are list 
pointers or structures with matching functors. Thus one pass through the 
unifier can be avoided for variable or atomic parameters. 
Strength reduction is used to reduce the number of cases that must be examined 
(and to replace these cases by immediate failure) when the types of either or 
both of the two arguments can be determined at compile time. The resulting 
routine is both shorter and faster. 
If the unification routine is coded in line (open-coded), it may be further 
optimized by omitting dereferencing and trailing (see Section 3, Optimizing LOW 
RISC Code). Both open-coded and subroutined unification routines may be improved 
186 J.W.MILLS 
by detecting special cases, such as list pointers or structure pointers that are 
identical. In these cases, the unification routine can immediately terminate success- 
fully because the remaining elements of each list or structure are identical. 
2.4. Open-Coding PROLOG Goals 
As many as one-half of the goals in a PROLOG program may be system predicates 
[20]. The WAM instruction set is typically extended to allow system predicates to be 
executed on a general-purpose processor. The PLM uses an escape instruction [8] and 
Overbeek’s WAM emulator uses a call_foreign instruction [13] for this purpose. The 
LOW RISC takes a similar approach, using the hook instruction to control an 
attached processor. However, system predicates involving strength-reduced unifica- 
tion, term manipulation, and integer addition and subtraction can be expanded into 
in-line code sequences (open-coded) with the LOW RISC instruction set. Three 
PROLOG goal sequences and one built-in predicate (arg/3) are given as examples. 
2.4.1. X = Y. This goal is also equivalent o the unit clause p(X, X), which uses 
the get value WAM instruction. It requires full unification unless a strength reduction 
can beTdentified (e.g., the goal is always called with both arguments either variables 
or constants). The goal calls the unification subroutine of Section 2.3.1. 
x = y: 
next: 
add W, 0, OldW 
add P, next, Return 
add P = 0, unify, P 
add W, #regs, W 
next goal code starts here 
;umfy X and Y 
3 
, 
7 
2.4.2. nonvar(X). This PROLOG goal may be encoded by a switch_on_type WAM 
instruction [38]. Thus, the LOW RISC encoding also demonstrates that WAM instruc- 
tion. 
switch X, fail, nv, nv, nv 
nv: first nonvariable processing 
instruction if ail side effects 
are transparent to fail 
2.4.3. retract(count(X)), Y is X + 1, assert(count(Y)). This sequence is typically 
used to save and manipulate a parameter that must not be affected by backtracking. 
The Gl register points to a vector of memory words. A predicate is represented by 
an index into the vector. Each argument of the predicate is allocated one word. This 
encoding takes advantage of the nonlogical nature of assert and retract, replacing 
them with destructive assignment. 
load Gl, count, X 
nop 
add X, 1, Y 
store Gl, count, Y 
LOW RISC MACHINE 187 
2.4.4. arg(N,S,A). This system predicate is defined according to Clocksin and 
Mellish [6]. 
arg/3: switch N, fail, ok, fail, fail 
add N, 0, Cnt 
switch S, fail, fail, list, dref 
load S, N, Tl 
; type-check N 
;use N as counter if S is a 
;list 
ok: 
list: 
loop : 
dref: 
bind: 
match: 
next: 
sub Cnt, 1, Cnt, SC 
if value r , dref 
sub Cnt, 1, Cnt, SC 
if always, loop 
load S, 1, S 
switch A, bind, match, match, match 
load A, 0, T2 
if always, dref 
add T2, 0, A 
sub A, HB, T2, SC 
if value <, next 
store A, 0, Tl 
store TR, 1, A 
if always, next 
add TR, 1, TR 
add W, 0, OldW 
add P, next, Return 
add P = 0, unify, P 
add W, #regs, W 
next goal code starts here 
;use N as an index if S is a 
;struc 
;get Nth list ptr for list S 
;dereference & type A 
;bind variable A to S(N) 
;trail A if necessary 
;unify A and S(N) 
2.5. Determinate Concatenate 
The Determinate Concatenate benchmark has become a standard benchmark by 
which PROLOG implementations are compared: performance on this benchmark 
remains one of the first questions asked about a PROLOG implementation. The 
encoding of the critical inner loop of Determinate Concatenate is presented here, 
and estimated to run at 1,400,OOO LIPS. The estimate is justified in Section 4.3.3, 
Effects of Optimization. 
2.5.1. Inner Loop for the WAM. The inner loop of the Determinate Concatenate 
corresponds to the following WAM instructions operating in the mode specified. 
During the time that the new list is being constructed, this fragment of the 
benchmark will be executed repeatedly. The fragment does not show the action 
taken upon reaching the end of the list. 
188 J.W.MILLS 
cone/3 : switch_on_term Cla, Cl, C2, fail ;C2 always selected in list 
c2: get-list Al 
unify-variable X44 ;read mode 
unify_variable Al ;read mode 
get-list A3 
unify_value x4 ;write mode 
unify_variable A3 ;write mode 
execute cone/3 
2.5.2. Inner Loop for the LOW RISC. The LOW RISC code for the inner loop of the 
Determinate Concatenate is shown below. Branches to routines outside the inner 
loop are shown, but the routines themselves are not. These routines correspond to 
those portions of the WAM instructions’ microcode that are not executed in the inner 
loop. If no dereferencing or trailing is performed, then only the 14 instructions 
preceded with bullets (0) will be executed. 
/* Register usage: R2 Al 
R3 A2 
R4 A3 
R5 dereferenced Al 
R6 dereferenced A3 
R7 scratch 
*/ 
a cone/3: switch R2, Cla, Cl, C2, fail 
. load R2, 0, R5 
if always dref 
add R5, 0, R2 
0 c2: load R2, 1, R2 
l dref: switch R4, write& fail, readl, fail 
0 load R4, 0, R6 
if always dref 
add R6, 0, R4 
. writel: sub R4, HB, R7, SC 
. store R4, 0, list:H; 
. if value 2, notrail; 
. store H, 0, R5 
store TR, 0, R4 
add TR, 1, TR 
;switch_on_term 
;dereference if Al is bound 
;else this load is equivalent 
;to... 
;get_list Al 
;unify_variable X4 (read) 
; . . . if Al is a variable 
;unify_variable Al (read) 
;get_list A3 
;unify_value X4 (write) 
LOW RISC MACHINE 189 
. notrail: add H, 1, H 3 
. store H, 0, H ;unify_variable A3 (write) 
. add H, 0, bound:R4 ; 
. if always, cone/3 ;execute cone/3 
. add H, 1, H 
2.6. A Unit Clause 
Shallow backtracking through a database of unit clauses is a frequent operation: in 
a PROLOG metacompiler I instrumented to determine the frequency of procedure 
calls, 17% of all calls were due to shallow backtracking through a unit-clause 
database. A typical unit clause from this database is written in WAM and LOW RISC 
code so that it can be used later for estimating the performance difference between 
the two machines. 
2.6.1. Unit Clause for the WAM. This code shows the small semantic gap between 
PROLOG and the WAM. The clause address corresponds to the predicate, and there 
is one WAM instruction for each argument. The WAM code is very concise, but it 
cannot be varied if the context changes. 
di/4: get-constant Al, ” ;di(’ ‘, /* blank */ 
get-nil A2 ; [I, 
get-constant A3, ‘scan’ ; ‘scan’, 
get-constant A4, ‘scan’ ; ‘scan’). 
2.6.2. Unit Clause for the LOW RISC. The first example is the unoptimized code 
for the unit clause. Instructions executed if the clause is called with 
di(“,X, ‘scan’, OutputState) are shown with a bullet (.) in front of them. 23 clock 
cycles are needed if no dereferencing or trailing is done. In the next section we will 
examine a pair of optimizations that substantially reduce the number of instructions 
executed for this clause. 
di/4 
. blank: 
. 
Lvl : 
switch Al, Lvl, Lcl, fail, fail 
load Al, 0, T 
if always, blank 
add T, 0, Al 
sub Al, HB, T, SC 
if value 2 , nil 
store Al, 0, 
store TR, 0, 
if always, 
add TR, 1, 
; trail check in heap only 4 
;if trail check fails do next 
; arg 
const:’ ;bind variable Al to blank 
Al ;trail if trail check succeeded 
nil ;do next arg after trailing 
TR ;update trail pointer 
;simply an entry point 
;dereference Al 
, 
4A split stack model is used here, with no bindings allowed into the environment stack. All bindings 
are made into the heap. This model was used in the LOW FUSC simulator [31]. 
190 J. W. MILLS 
. Lcl: 
. 
. 
. nil: 
. 
Lv2: 
. 
. 
Lc2: 
l scanl: 
. 
Lv3 : 
. Lc3: 
. 
. 
. scan2: 
. 
. Lv4: 
. 
. 
sub Al, “, T, SC 
if value Z , fail 
nap 
switch A2, Lv2, Lc2, fail, fail 
load A2, 0, T 
if always, nil 
add T, 0, A2 
sub A2, HB, T, SC 
if value 2 , scan1 
store A2 0, const:[ ]
store TR, 0, A2 
if always, scan1 
add TR, 1, TR 
sub A27 [I, T, SC 
if value # , fail 
nop 
switch A3, Lv3, Lc3, fail, fail 
load A3, 0, T 
if always, scan1 
add T, 0, A3 
sub A3, HB, T, SC 
if value 2 , scan2 
store A3, 0, const:‘scan’ 
store TR, 0, A3 
if always, scan2 
add TR, 1, TR 
sub A3, ‘scan’, T, SC 
if value # , fail 
nop 
switch A4, Lv4, Lc4, fail, fail 
load A4, 0, T 
if always, scan2 
add T, 0, A4 
sub A4, HB, T, SC 
if value 2 , proceed 
store A4, 0, const:‘scan’ 
add CP, 0, P 
store TR, 0, A4 
add TR, 1, TR 
. proceed: add CP, 0, P 
. nop 
. nop 
;match constant Al to blank 
;fail if not blank 
;fall through to next arg 
;dereference A2 
;trail check in heap only 
;if trail check fails do next 
Wg 
;bind variable A2 to nil 
;&ail if trail check succeeded 
;do next arg after trailing 
;update trail pointer 
;match constant A2 to nil 
;fail if not nil 
;fall through to next arg 
;dereference A3 
; trail check in heap only 
;if trail check fails do next 
; arg 
;bind variable A3 to ‘scan’ 
$-ail if trail check succeeded 
;do next arg after trailing 
;update trail pointer 
;match constant A3 to ‘scan’ 
;fail if not ‘scan’ 
;fall through to next arg 
;dereference A4 
; trail check in heap only 
;if trail check fails proceed 
;bind variable A4 to ‘scan’ 
;proceed after trailing 
;trail if trail check succeeded 
;update trail pointer 
;proceed without trailing 
;needs two nops 
LOWRISCMACHINE 191 
Lc4: add 
sub 
if 
CP, 0, P ;assume we’ll proceed 
A4, ‘scan’, T, SC ;match constant A4 to ‘scan’ 
value 2 , fail ;fail if not ‘scan’, overriding 
; the proceed. First 
;instruction of code 
;proceeded to should not 
;jump, as it is the ifs nop 
; slot. 
3. OPTIMIZING LOW RISC CODE 
The LOW RISC instruction sequences shown so far are straightforward. Four opti- 
mizations have been mentioned: 
(1) loop unrolling (Section 2.3), 
(2) strength reduction of unification (Section 2.3) 
(3) omitting dereferencing and trailing (Section 2.3) and 
(4) strength reduction of retract-op-assert (Section 2.4.3). 
More optimizations are possible, such as filling the slot after a delayed branch 
with an instruction other than a nop [14]. Other optimizations which apply to WAM 
code will also improve LOW RISC code, such as the omission of choice points [36]. 
The third optimization, omitting dereferencing and trailing, is a god example of an 
optimization “inside” a WAM instruction, and will be discussed in the following 
sections. 
3.1. Omitting Dereferencing in All but the First Clause 
Given a series of clauses in a procedure, the WAM dereferences incoming parameters 
in every clause, even if the parameters were dereferenced in the first clause. As 
backtracking causes each clause to be tried, each failure costs a dereference check 
and zero or more executions of the dereference loop per parameter. However, 
repeated dereferencing can be avoided by creating a dereferenced choice point. 
3.1.1. The Dereferenced Choice Point. A choice point is necessary to avoid 
dereferencing, which restricts the optimization to those procedures with two or more 
clauses. The first step in optimizing such a procedure is to dereference each 
incoming parameter before storing it in the choice point. This is done by what 
would be the try or try_me_else WAM instruction, and occurs before unification 
begins in the head of a clause. Some parameters are dereferenced that normally are 
not. Parameters are dereferenced that are matched to the first use of a variable or 
are void in the first clause; thus the choice point will be consistent whether the 
clause succeeds, succeeds with only variable or void parameters, or fails. 
The overhead associated with the creation of the dereferenced choice point is less 
than it appears. If any clause in the procedure succeeds, then all incoming nonvoid 
parameters will be dereferenced at least once. If no clause succeeds, the overhead is 
still less if the number of clauses in the procedure is greater than the arity of the 
192 J. W.MILLS 
procedure. This is because failing through all the clauses will cause the first 
unindexed, nonvoid argument to be dereferenced more times than it would have 
taken to dereference all arguments once to build the dereferenced choice point. 
Thus, creating a dereferenced choice point may actually cost less than not doing so. 
In general, this optimization is cost-effective whenever the arity of the procedure is 
less than the number of clauses in the procedure selected by indexing. 
3.1.2. Omitting Dereferencing. After the dereferenced choice point is created in 
the first clause, the parameters stored in it are not dereferenced in subsequent 
clauses. In these clauses dereferenced parameters are restored by failure. Thus, 
dereference checking and dereference loops can be omitted. This optimization 
becomes more useful as the number of clauses in a procedure increases. It is 
possible to generalize this optimization to include all parameters dereferenced in the 
head of a clause, even those within lists and structures. This extension is more 
complex, and requires the creation of an extension to the choice point similar to a 
WAM environment. In this paper the simpler case is presented. 
3.2. Replacing Trailing with Destructive Assignment 
Trailing and trail checking are time-consuming operations. The LOW RISC reduces 
some of the time needed for trailing because it does not use environments: register 
windows replace them, and binding is not allowed to a window register. The impact 
of trailing can be reduced even further. If trailing is viewed as part of a mechanism 
to assign values to variables during the execution of a PROLOG procedure, then it 
can be optimized to destructive assignment. This optimization is described only for 
incoming parameters. Again, although the optimization can be extended to parame- 
ters inside lists and structures, this requires analysis and reorganization of the 
clauses in the procedure. The simpler case is described in this paper.5 
3.2.1. Unbound Variables. An incoming parameter that is an unbound variable 
dereferences to itself. This convention was described by Warren [39], and requires 
the value of an unbound variable to be its memory address. This address and the 
unbound tag identifying the dereferenced variable are necessary to apply the 
optimization. Both are available in a dereferenced choice point, which is a require- 
ment for this optimization. 
If a parameter is an unbound variable, its memory address is placed in the choice 
point. This tagged cell will be used in future bindings, instead of either the register 
or the tagged cell in the choice point. For example, a variable bound to a constant 
during program execution would cause the tagged cell to be replaced by the 
constant, and the register to be set to “bound”, while the choice-point tagged cell 
would contain the original unbound variable. 
sA 1987 draft paper by Joachim Beer, “The Occurs-Check in PROLOG, or How to Run Faster with 
It than without It”, describes a similar optimization within lists and structures in detail, and further 
extends its signific&ce by showing how tde use of a context-sensitive unbound variable can provide an 
occurs check “for free”. 
LOW RISC MACHINE 193 
3.2.2. Omitting Trailing. Given a dereferenced choice point, the first through the 
next-to-last clauses in the procedure are compiled with trail checking and trailing 
omitted for all incoming unbound variables not “nested” inside a list or structure. 
This is safe because: 
(1) All subsequent references to the bound but not trailed variable are made by 
procedures lower in the execution tree. Until failure propagates up to the 
clause that bound but did not trail the variable, the variable’s value is not 
changed. (Disjunction can be handled by treating the disjunct as two or more 
clauses subordinate to the clause containing the disjunct, creating a derefer- 
enced disjunction choice point, and then applying the trail optimization.) 
(2) Failure of the clause that bound the variable causes the parameter’s original 
dereferenced value to be restored from the dereferenced choice point. This 
value is used to set the variable in memory either to unbound, or to a new 
value contained in the next clause. 
(3) All bindings are reset during the trust instruction before the dereferenced 
choice point is discarded prior to executing the last clause in the procedure. 
The effect of these actions is to replace trailing with destructive assignment in the 
first through next-to-last clause for those parameters that are unbound variables. 
The dereferenced choice point contains the same information that normally would 
be available on the trail (i.e., which cells have been bound). Resetting the variable is 
not done in the failure routine. It is delayed until the head of the next clause, when 
a new value is assigned to the variable, or the variable is reset to unbound. Shallow 
backtracking may leave some variables bound if a clause has fully or partially 
succeeded, but the forced reset prior to discarding the dereferenced choice point 
assures that all bindings are undone before the procedure exits. 
The optimization is not applicable to the last clause. Trailing must be done 
because the dereferenced choice point will be discarded, leaving the parent proce- 
dure without the information needed to overwrite the bindings made by the last 
clause. 
3.3. Applying Both Optimizations to a Unit Clause 
The unit clause shown in Section 2.6 is shown after applying the dereference and 
trailing optimizations; we assume that the clause is neither the first nor the last 
clause in the procedure, so that both are applicable. The size of the code shrinks, 
and the number of instructions executed drops to 18 under the same conditions as 
we used to execute the clause the first time. Instructions executed if the clause is 
called with di(’ ‘, X, ‘scan’, Output State) are shown with a bullet (0) in front of them, 
di/4: 
l blank: switch Al, Lvl, Lcl, fail, fail ; dereference Al 
. Lvl: store Al, 0, const:’ ’ ;bind variable Al to blank 
if always nil ;do next arg after tailing 
. Lcl: sub Al, ’ ‘, T, SC ;match constant Al to blank 
. if value it, fail ;fail if not blank 
. nop ;fail through to next arg 
194 J. W. MILLS 
. 
. 
. 
nil : switch 
Lv2: store 
if 
Lc2: sub 
if 
nop 
scan: switch 
Lv3 : store 
if 
Lc3 : sub 
if 
nap 
scan2: switch 
Lv4: store 
proceed: add 
nop 
nap 
Lc4: add 
sub 
if 
A2, Lv2, Lc2, fail, fail 
A29 Q, const:[ ]
always, scan1 
A2, [I> T, SC 
value # , fail 
A3, Lv3, Lc3, fail, fail 
A3, 0, const:‘scan’ 
always, scan2 
A3, ‘scan’ T, SC 
value # , fail 
A4, Lv4, Lc4, fail, fail 
A4, 0, const:‘scan’ 
CP, 0, P 
cp, 0, P 
A4, ‘scan’, T, SC 
value # , fail 
;dereference A2 
;bind variable A2 to nil 
;do next arg after trailing 
;match constant A2 to nil 
;fail if not nil 
;fall through to next arg 
;dereference A3 
;bind variable A3 to ‘scan’ 
;do next arg after trailing 
;match constant A3 to ‘scan’ 
;fail if not ‘scan’ 
;fall through to next arg 
;dereference A4 
ibind variable A4 to ‘scan’ 
;proceed without trailing 
needs two nops 
;assume we’ll proceed 
;match constant A4 to ‘scan’ 
;fail if not ‘scan’, overriding 
;the proceed. Code pro- 
;ceeded 
; to should not make any 
;jumps in the first two 
;instructions, as the if’s 
;nop slot is filled from there. 
3.3.1. Type Declarations Really Make Things Fast. If the programmer can make 
type declarations, as is possible with BIM PROLOG (and to an extent with Quintus 
and DEC-10 PROLOG using mode declarations), then the execution of this clause 
can be speeded up much more. The reason for this is that tag checking can be 
omitted, which-coupled with the dereference and trailing optimizations-leaves 
very little code to execute; only nine instructions taking nine clock cycles. The 
sequence of optimizations that led to this code is not possible with the WAM, 
because the WAM instructions are too complex. The micro instructions that make up 
a WAM instruction can be executed “selectively” based solely on the state of the 
read/write mode bit. If greater selectivity were allowed, we would end up with a 
variant of each WAM instruction for each context, or an even larger number of mode 
bits. 
di/4: 
. Lcl: sub Al, ’ ‘, T, SC ;match constant Al to blank 
. if value # , fail ;fail if not blank 
LOW RISC MACHINE 195 
. Lv2: store A2, 0, const:[ ] ;bind variable A2 to nil 
;(overwrite the binding 
;in the next clause 
;if we failed earlier in Lcl) 
. Lc3: sub A4, ‘scan’, T, SC ;match constant A3 to ‘scan’ 
. if value # , fail ;fail if not ‘scan’ 
. nop ;fall through to next arg 
. Lv4: add CP, 0, P ;start the proceed 
. store A4, 0, const:‘scan’ ;bind variable A4 to ‘scan’ 
. nop ;needs only one nop 
4. ESTIMATING THE PERFORMANCE OF THE LOW RISC 
4.1. A Proposed LOW RISC Implementation 
Any estimate of the LOW RISC'S performance will be dependent on the implementa- 
tion chosen. The LOW RISC system and processor described here explain the 
assumptions used to estimate performance. Figure 1 shows a LOW RISC system with 
an instruction cache, a data cache, and an attached processor. 
LOW RISC Attached Processor 
Program Memory Program Memory 
b Attached 
Processor 
I 
Data 
Attached 
Cache 
r Processor 
Cache 
$ I 
LOW RISC Attached Processor 
Data Memory Data Memory 
- Data 110 Bus 
- Address Instruction l/O Bus 
- Attached Processor Busses 
FIGURE 1. A LOW RISC system. 
196 J. W.MILLS 
4.1. I. Instruction Cache. The instruction cache takes advantage of the locality of 
reference exhibited by PROLOG procedures during head unification and tail 
recursion. The LOW RISC clause compiler produces in-line code for the head of a 
clause consisting of multiple blocks of three to ten instructions, all linked by 
forward references. An instruction cache that prefetched a four-word block would 
allow the LOW RISC to execute the head of a PROLOG clause with a few misses. 
Cache misses would occur when a goal was called, and at the termination of a 
clause. 
4.1.2. Data Cache. The data cache supports memory references into the heap, 
the trail, and the local stack. (The choice-point stack is kept separately, and placed 
Global 
Tag File Windowed Tag File 
Control Signal 
Generation 
\ I 
1 
Global 
Value File Windowed Value File I Executio PC 
J I I I Instructionk 32 \ \..\\-.\\ 
tl 
@ 
Current Window - PCBus 
- Tag 1 BUS, Value 1 BUS (read) r/Z4 Immediate Bus (read) 
M Tag 2 Bus, Value 2 Bus (read) s&s% Data l/O Bus (read/write) 
- Tag Bus. Value Bus (read/write) \-+.-Xv Instruction I/O Bus (read/write) 
llllllllll Instruction/Address Bus (write) + Control Signals 
FIGW 2. A LOW RISC processor. 
LOW RISC MACHINE 197 
in the window register file.) The data cache is also used to communicate to an 
attached processor. The attached processor performs system functions such as 
floating-point arithmetic that cannot be performed by the LOW RISC. Because both 
the instruction cache and the data cache are small compared to the main memory, 
they may be constructed of memory fast enough to allow the LOW RISC to run at full 
speed as often as the cache miss rate allows. 
4.1.3. The LOW RISC Processor. The LOW RISC processor shown in Figure 2 is a 
Harvard bus machine [19]. The three pipelined arithmetic/logic units (ALUs) 
operate in parallel. Each ALU has a specific function: the tag ALU selects the 
destination tag and determines its type; the value ALU determines the destination 
value, calculates the address of operands, or performs register arithmetic with the 
program counter; the address ALU increments the program counter and calculates 
branch addresses from the immediate offsets in if and switch instructions. Only the 
value ALU performs both addition and subtraction. If a tagged cell is being 
constructed from two registers (add Rn = 0), then the value ALU simply selects the 
Tag 1 Bus 
Tag 2 Bus 
Tag ALU 
Tag Bus 
Value 1 Bus 
Value 2 Bus 
Value ALU 
Value Bus 
PC Bus 
PC Immediate Bus 
Addr ALU/Multiplexer 
InstructionlAddr Bus 
Instruction Decode 
Data I/O Bus 
Instruction I/O Bus 
FIGURE 3. 
- Instruction 1 =,- Instruction 3 
111111111111111111 Instruction 2 m Instruction 4 .:..,,,. .:.. 
Internal operation of the Low xusc processor. 
A Instruction Fetch 
198 J. W.MILLS 
second operand’s value. The output of the value ALU or the result of a load 
overrides the address ALU output if the destination is the program counter. 
Execution is overlapped, allowing an instruction to be dispatched twice per major 
clock cycle. The clock cycle is divided into four minor cycles internally. Every 
instruction is decoded into all possible immediate fields, and so does not depend on 
the instruction type. After determining the instruction type, the correct immediate 
value or values are selected. They are then multiplexed onto either the value 2 bus 
(add, sub, load, store) or the immediate bus (if, switch, hook). 
4.1.4. LOW RISC Instruction Cycles. The internal operation of the LOW RISC is 
shown in Figure 3. The overlapping fetch-and-execute is similar to other pipelined 
processors [18]. If the major clock cycle (cp) of the LOW RISC is assumed to be 100 
nanoseconds (ns), then an instruction will be executed every 50 ns after the pipe 
fills, regardless of pipeline breaks due to branches. However, if the break cannot be 
filled with a useful instruction, then the instruction executed must be a nop to avoid 
incorrect operations. 
4.2. Code Size 
Code size is easier to estimate than execution speed, because its variation is due to 
unexpected uses of the instruction set rather than errors in the design of the 
architecture or limitations in manufacturing technology. The code size of the LOW 
RISC is compared with the code size of the WAM described in [39]. Table 1 estimates 
the relative code size of an ideal program. WAM instructions are compared with WAM 
instructions open-coded with the LOW RISC instruction set. All instructions listed are 
encoded in this paper in a programming example (see Table 1 footnotes). The 
dynamic frequency of execution for each instruction is taken from analysis of the 
PLM as reported in [9]. 
The LOW RISC code sequences follow Warren’s description of the operation of the 
WAM, and are virtually identical to the operation of the PLM [39]. Deviations are due 
to experience gained writing WAM emulators as reported in [23]. Where an instruc- 
tion’s encoding depends on a particular mode (read or write) or data type (bound or 
unbound), each fragment is coded as a separate block in an in-line sequence linked 
by jumps. 
The code size estimates in Table 1 are calculated by converting instruction 
lengths to bits, then using each instruction’s dynamic frequency to determine the 
number of bits contributed by the instruction to an ideal program. The frequency is 
a percentage, but because not all WAM instructions are listed, the frequencies do not 
add up to 100 percent. What is important is that each frequency represents the 
expected percentage of occurrence for a WAM instruction over a wide variety of 
programs. An ideal program has exactly that percentage of occurrence of WAM 
instructions. 
Macro-expanding the WAM instructions in this ideal program with the LOW RISC 
instruction set makes the ideal program nearly seven times larger. One approach to 
this loss of code compactness i to mix emulator code and open-code only unit-clause 
databases and low-level primitives, e.g. append/3, member/2. 
LOWRISCMACHINE 199 
TABLE 1. Comtxaison of code size. 
LOW RISC 
open-coded 
Berkeley PLM WAM instructions 
Bits X Bits x 
Instruction Frequency Bytes Bits Frequency Words Bits Frequency 
put_valuea 10.7 2 16 171 1 32 342 
unify_variable X b 8.8 2 16 141 704c 
read 1 32 
write 4 128 
get_listb 7.3 2 16 117 13 416 3037 
unify_value X b 4.9 2 16 78 627’ 
read 4 128 
write 4 128 
switch_on_terrnd 4.8 9 72 346 1 32 154 
get_structurea 4.1 5 40 164 18 576 2362 
executeb 4.0 1 8 32 1 32 128 
get-variable X a 3.4 2 16 54 1 32 109 
Cdl” 2.0 5 40 80 4 128 256 
put-variable X a 1.8 2 16 29 4 128 230 
get-value X d 1.4 2 16 22 4 128 179 
Total 
Relative 
code size 
1234 8128 
1.0 6.6 
“Encoding found in Section 2.2. 
bEither read or write encoding found in Section 2.5. 
‘This figure is the average of the read and write instruction lengths. 
d Encoding found in Section 2.4. 
4.3. Execution Speed 
In this section estimated execution times of LOW RISC code are compared with 
execution times for the equivalent WAM instructions on a microcoded WAM engine, 
the Berkeley PLM [9]. The range of execution speeds of the LOW RISC in different 
situations is derived, and the performance of a PROLOG metacompiler on the LOW 
RISC system estimated. 
There are three reasons why I believe that the estimates presented in this section 
are sound. First, the LOW RISC instruction set is the basis for Applied Logic System’s 
PROLOG WAM emulator. A recent report indicates that this PROLOG can attain 
speeds in excess of 14,000 LIPS on iAPX-286 personal computers [4]. Next, the LOW 
RISC system proposed does not place unrealistic strains on the limits of technology 
by proposing the use of large amounts of very fast memory, or extremely complex 
and large VLSI designs: in fact, the system proposed could be built using bit-slice 
microprocessors in a fashion similar to the PLM prototype. Finally, other RISC 
processors have met their design goals or exceeded them [17,41] and demonstrated 
200 J. W.MILLS 
execution speeds faster than those of the complex instruction set computers (CISCs) 
they were compared with. 
4.3.1. Ideal Machine Time. Table 2 shows the ideal execution times for the PLM 
and macro-expanded WAM instructions on the LOW RISC. The instruction cycle time 
of the PLM (cp) is 100 ns [9]. The LOW RISC is assumed to have the same basic cycle 
time as the PLM, but its observed cycle time is only 50 ns because the LOW RISC's 
internal pipeline is always assumed to be full (realistic because of the use of delayed 
branches). Thus, the instruction fetch overlaps the execution of the previous 
instruction, causing the observed cycle times for emulator routines and open-code 
sequences to be half of the 100~ns basic cycle time. 
The instruction cycle times of the PLM and LOW RISC macro expansions are 
normalized using the instruction frequency [time (ns) x frequency]. The result is the 
TABLE 2. Comparison of ideal execution speed. 
Instruction 
LOW RISC 
open-coded 
Berkeley PLM wa instructions 
Time Time (us) x Time Time (ns) X 
Frequency Cycle (ns) Frequency Cycle (ns) Frequency 
put_valuea 10.7 2 
unify_variable X b 8.8 
read 5 
write 3 
get_listb 7.3 
bound 3 
unbound 7 
unify_value X b 4.9 
read 20 
write 3 
switch_on_terma 4.8 5 
get_stnicturea 4.1 
bound 6 
unbound 8 
executeb 4.0 1 
get-variable X ’ 3.4 2 
caBa 2.0 1 
put-variable X a 1.8 4 
get-value X d 1.4 21 
200 
500 
300 
300 
700 
2000 
300 
500 
600 9 
800 7 
100 400 2 
200 680 1 
100 200 2 
400 720 4 
2100 2940 23 
2140 
3520’ 
3650’ 2008’ 
5640’ 3308’ 
2400 2 
1150 
200 
100 480 
2870’ 1640’ 
1 
1 
4 
4 
7 
23 
4 
50 
50 
200 
200 
350 
450 
350 
100 
50 
100 
200 
1150 
535 
1100’ 
400 
170 
200 
360 
1610 
Ideal machine time 
Relative execution speed 
a Encoding found in Section 2.2 
27080 13851 
1.0 1.9 
bEither read or write encoding found in Section 2.5. 
‘This figure is the average of the read/write or bound/unbound execution times. 
dEncoding found in Section 2.4. 
LOWRISCMACHINE 201 
time that the ideal program would spend executing all occurrences of each instruc- 
tion. The ideal machine time is the total of all instructions’ normalized times. The 
ideal machine time does not reflect constraints imposd by memory access times or, 
for the PLM, prefetching, write buffering, and choice-point caching. 
4.3.2. Real Machine Time. A second time is introduced to allow for these 
constraints. This is the real machine time, and is calculated using different assump- 
tions for each category. For the PLM, all features of the design are included: 
prefetch, memory fetch, write buffer, and choice-point cache. The net effect is to 
slow the PLM by a factor of 0.75. 
The assumptions used to calculate the LOW RISC macro expansions’ execution 
speed are based on the cache behavior. This paper presents no evidence for cache 
performance, which is a topic for research, although Tick has studied memory 
referencing behavior [35, 36],6 and Ross and Ramamohanarao have examined pag- 
ing strategies for PROLOG [30]. Instead, the cache hit rate will be treated as an 
unknown percentage (denoted by (Y for both caches; the miss rate is 1 - a(), and the 
effects of three different hit rates calculated. Although it may seem naive to assign 
the same hit ratio to both the instruction and the data cache, the frequency of short 
jumps will lower the instruction hit rate, while the prevalence of shallow backtrack- 
ing (which causes the same data structures to be referenced frequently) will raise the 
data hit rate. Even without this assumption the two hit ratios are within 10% of each 
other. 
Examination of the LOW RISC code shows that between 25 and 39 percent of the 
instructions make memory data references. Assuming the higher figure, the proces- 
sor would be slowed on 39 percent of the instructions executed. If the data cache 
requires at least three nonoverlapped cycles before data become available, then a 
time-frequency factor of 300 X 39 X (1 - a) must be added to the ideal machine 
time to account for data delays. 
The frequency figures in [9] indicate that 27 percent of the WAM instructions are 
likely to cause an instruction cache miss due to transfer of control if the numerous 
small branches in LOW RISC open code are included. The delayed-branch feature of 
the LOW RISC allows these instruction cache misses to be partially hidden in the 
pipeline break. During the pipeline break the processor continues to execute 
instructions in the delay slots from the instruction cache. Assume that the instruc- 
tion cache requires three nonoverlapped cycles before instructions from the 
branched-to routine become available. If the LOW RISC can execute two overlapped 
instructions during this time, then one of the three cache fill cycles will be hidden. 
The LOW RISC will be delayed by 200 ns while the cache is filled. The result is that a 
time-frequency factor of 200 X 27 X (1 - CX) must be added to the ideal machine 
time. The total delay for macro-expanded WAM instructions in the ideal program 
can be reduced to the time-frequency factor of [(300 x 39) + (200 x 27)] x (1 - a), 
which is equal to 17100 X (1 - a). Performance for cache hit rates of 70, 80, and 90 
percent are shown in Table 3. In general, the results indicate that LOW RISC code is 
about twice as fast as the PLM code. 
6Tick has accomplished a thorough study of the WAM architecture, including caching, which is 
described in [37]. 
202 J. W. MILLS 
TABLE 3. Comparison of real machine-time execution speeds. 
Berkeley PLM 
LOW RISC 
open-coded 
WAM instructions 
Ideal machine time (from Table 2) 27080 13851 
“Real” machine time (a = 0.70) 36016 18981 
Relative execution speed 1.0 1.9 
“Real” machine time (a = 0.80) 36016 17271 
Relative execution speed 1.0 2.0 
“Real” machine time (a = 0.90) 36016 15561 
Relative execution speed 1.0 2.3 
4.3.3. Eficts of Optimization. The results in Table 3 do not show the effects of 
optimizing clauses for dereference or trailing, or using open-coded PROLOG 
procedures such as the Determinate Concatenate. The effects of these optimizations 
on the performance of a real program are now examined. The program, a PROLOG 
metacompiler, was instrumented during the design of the LOW RISC architecture. 
The metacompiler translates various gramars into PROLOG parsers. It uses a 
deterministic finite automaton encoded as a unit-clause database. An example of a 
clause in this database (which we saw earlier in Section 2.6, A Unit Clause) has the 
form 
di(input_token, output-token, this-state, next-state). 
No clauses in the database have variables except the last, which is a “trap” clause. 
The metacompiler was instrumented to profile the use of every clause. The results 
for this program showed that the program spent 25 percent of its execution time 
appending lists and 17 percent of its time shallow backtracking in the unit database. 
The remainder of the time was divided between system predicates and user goals. 
The examples of Sections 2.5, Determinate Concatenate, and 2.6, A Unit Clause, 
can be used to estimate the performance of the metacompiler. The Determinate 
Concatenate’s inner loop executes in 14 LOW RISC instructions. Since this is the 
heart of an append/3 in PROLOG, it is reasonable to approximate the execution 
time of append/3 with the execution time of the Determinate Concatenate’s inner 
loop. The di/4 procedure expects its arguments to dereference to either constants or 
unbound variables, allowing dereferencing, trailing, and type-declaration optirniza- 
tions to be applied. Applying these optimizations results in the LOW RISC code 
shown in Section 3.3.1, which executes nine instructions. 
Table 4 shows the ideal execution speed for the Berkeley PLM and the optimized 
LOW RISC routine, and the difference in their execution speeds. The figures in Table 
4 do not account for the performance losses caused by “diluting” an optimized 
routine in an otherwise unoptimized program, then running it on a real machine. 
Table 5 shows the dilution effect applied to the metacompiler program. The “all 
other” routines are emulated, and assumed to run at 200,000 LIPS. On the LOW RISC 
they are encoded using a LOW RISC WAM emulator (78 percent), and are roughly four 
times the size of corresponding WAM code (one emulated instruction requires 32 bits 
on the LOW RISC, as opposed to 8 bits on the WAM). 
LOW RISC MACHINE 203 
TABLE 4. Effects of optimization (ideal machine time) 
Berkeley PLM LowIusc 
Performance 
Execution Speed Execution Speed difference, 
Routine Cycles time (ns) (10’ LIPS) Cycles time (ns) lo3 (LIPS) A speed 
Determinate Concatenate 32 3200 425= 14 700 1428 i-3.4 
(append/s) 
di/4 49 4900 204 9 450 2222 + 10.8 
“Benchmark uses cdr-coded lists. 
TABLE 5. EXects of optimization (real machine time) 
Routine 
Berkeley PLM LOWRIsc 
Execution- Speed Execution- Speed Performance 
time Speed increase time Speed increase difference, 
decrease (%) (lo3 LIPS) C%) decrease (W) (lo3 LIPS) r%) AsDeed 
append/3 25 319 78 25 1128 282 
di/4 17 153 26 17 1755 298 
All others 58 150 87 58 200 116 
Total 191 696 + 3.6 
Of the optimized routines, only the di/4 procedure will contribute significantly 
to the program’s size; the append/3 is so small that its size is not significant. If the 
PLM get constant instruction encodes constants in 16 bits, then the optimized LOW 
RISC version of di/4 (Section 3.3.1) is only 3.6 times larger than the WAM version of 
di/4 (Section 2.6.1), i.e., 36 bytes for the LOW RISC compared to 10 bytes for the 
PLM. Thus the code size increases only (0.78 X 4) + (0.22 X 3.6), or 3.9 times over the 
PLM, instead of 6.6 times. 
Because real machine time is assumed, all LIP rates are slowed from the ideal 
figures shown in Table 4. The PLM is slowed by 25 percent overall. The LOW RISC is 
assumed to run with instruction and data-cache hit rates of 80 percent, thus slowing 
the append/3 routine and the optimized di/4 routines by 21 percent; the LOW RISC 
WAM emulator is not slowed, because its code is in read-only memory (see Figure 1). 
The execution speed increases 3.6 times over the PLM. 
5.1. Advantages 
The LOW RISC should execute PROLOG quickly. Optimized PROLOG procedures 
are estimated to run in excess of 1,700,OOO LIPS, typical procedures at from 200,000 
to 700,000 LIPS. 
The LOW RISC architecture offers a simple instruction set tailored to PROLOG. 
WAM instructions, PROLOG goals, and PROLOG procedures can be coded effi- 
ciently with LOW RISC instructions. Optimizations not possible with the WAM can be 
applied with the LOW RISC instruction set. 
204 J. W.MILLS 
The LOW RISC instruction set allows a PROLOG implementation to keep pace 
with evolving compiler technology. Enhancements can be added as improvements to 
the compiler rather than as hardware or microcode changes.’ 
5.2. Disadvantages 
LOW RISC code is larger than WAM code. Unoptimized open-coded WAM instructions 
are nearly seven times the size of a microcnde WAM’S instructions. Unoptimized 
LOW RISC code contains many conditional branches and subroutine calls. Trace 
scheduling [12] may be of value to reduce branching, and to fill pipeline breaks 
where branches are unavoidable. 
The LOW RISC requires a more complex compiler than a Warren-Tick machine. 
To obtain the most benefit from the instruction set a number of optimizations must 
be performed, including code rearrangement o fill pipeline breaks, unification, 
strength reduction, and dereference and trail optimization. 
5.3. Further Work 
Research into LOW RISC compiling techniques has led to the construction of a clause 
compiler that produces reasonable code, but the optimizations described in this 
paper require globally determined attributes. Continuing the investigation of 
PROLOG-to-Low RISC compilation will require us to produce a procedure com- 
piler. 
The next step is to build a LOW RISC prototype,* and evaluate a complete series 
of PROLOG benchmarks using it. Even though the estimates were carefully done, 
unknown factors will affect performance. Making the LOW RISC real is necessary to 
prove the validity of an architecture which I believe is significant for high-perfor- 
mance PROLOG implementations. 
APPENDIX. LOW RISC ARCHITECTURE 
The design and implementation of two WAM emulators influenced the LOW RISC 
architecture. The first emulator, designed in 1984 at Argonne National Laboratory, 
emphasized the use of segmented memory [23]. The second emulator, which is used 
in ALS PROLOG, was specified in terms of msc-like primitive operations [22], 
which were simplified into this seven-instruction architecture. 
The LOW RISC is a tagged, register-oriented Von Neumann machine. Usage 
distinguishes instructions from data in memory. It is possible to place data in 
memory and then execute them as a LOW RISC program, or treat a program as a data 
object. 
7Joachim Beer’s proposed extensions to the WAM instructions set that avoid occurs checking are a 
good example. 
8Michael Burroughs, Robert Wehrmeister, Dr. David Winkel and I are reducing the LOW RISC design 
to a circuit specification for programmable array logic. Work on the prototype is supported in part by an 
NSF grant to Indiana University under the CER program: A Conduit from Theory to Practice, DCR 
85-21497. 
LOWRISCMACHINE 205 
Both registers and memory are tagged cells. A general-purpose register file, 
distinct from memory, provides fast access to a relatively small number of data 
objects. 
A.I. Tagged Cells 
Tagged cells are 32 bits wide. Each cell is divided into a three-bit tag field and a 
29-bit value field. 
A.l.1. Instructions. A tagged cell pointed to by the LOW RISC instruction counter 
will be interpreted as an instruction. Seven instructions are defined. Their operation 
is described in Section A.4, Instruction Set. 
A.Z.2. Data Tugs. Eight tag values are possible. Of the eight, only binary value 
000 is meaningful in the architecture. An object with this tag value has the type 
“bound variable”, and is interpreted as a pointer reference to another tagged cell in 
the memory space. The interpretation of the classes denoted by the remaining tags is 
undefined in architecture. 
The binary tag values 001 and 100 each define a class composed of only a single 
type of object. 
The binary tag values 010 and 011 define two types of objects grouped into the 
class A with an implicit generic type. The binary tag values 101, 110, and 111 define 
three object types grouped into the class B with an implicit generic type. Cells 
tagged with any of these values may be categorized generically by class, specifically 
by object, or hierarchically by object within class. 
A. I. 3. Data Values. The value field may be interpreted as either an unsigned or a 
twos-complement integer. Unsigned integers range in value from 0 to 536870,911. 
Twos-complement integers range from - 268,435,456 to t 268,4335,455. 
If the binary tag value is 000, then the value field will be interpreted as an 
unsigned integer, and treated as a memory address. The architecture assigns no 
meaning to the value field for all other tags. 
A.2. Memory Organization and Addressability 
Memory is organized as a vector of tagged cells, with an address associated with 
each cell. Address values range from 0 to 229 - 1, yielding a 512-megaword address 
space. The address range is limited by the width of the value field. 
A.3. Register Organization 
A visible set of LOW RISC registers is shown in Figure 4. There are four groups of 
registers: seven control registers, five global registers, one status register, and 115 
general-purpose registers, of which only 19 are visible at any time. 
A.3.1. Control Registers. There are seven control registers. The control registers 
are hardwired with the tag values shown in Figure 4. Five of these registers 
(B, H, HB, P, and TR) correspond to the registers in the WAM. 
206 J. W. MILLS 
B [OOOi I 
H _ooll 
I-B _OlOl 
c *Olll 
-rR .lOOl 
w _lOll 
p 11101 statusl1 11 I I 
RO 
Rl 
R2 
R3 - 
R16 
R17 
R18 
I I I 
I 
R15 
R 
r 1 1 
FIGURE 4. A visible set of registers (entire general-purpose register file not shown). 
The C register, or cutpoint register, has been added. This allows cut to be 
implemented as described by Bowen et al. [3]. 
The W, or window pointer register, replaces the WAM environment register (E). 
The W register is used as a base register for a variable-sized register window. The 
operation of the windowed register file is described in [17]. An example of the use of 
the windowed register file is given in Section 2.3, Unification, in this paper. A 
variable-sized register window is important to a logic-programming design, since the 
number of arguments varies substantially between calls. Matsumoto’s static analysis 
of 15 PROLOG programs (including Chat-80) indicated that while 73 percent of all 
goals had arity less than four and that 94 percent of all goals had arity less than 
seven, the arity could increase beyond 10 [20]. 
No single register is dedicated to the Warren CP register. Instead, the first two 
registers in the current window are used as the CP (continuation pointer) and CW 
(continuation window pointer) registers (see Section A.3.3, General Purpose Regis- 
ters). 
No S (structure pointer) register is provided in the LOW RISC. When necessary, 
the S register is allocated from the current window. 
A3.2. Status Register. The LOW RISC status register reflects the type of the most 
recent results for which SC was selected, and the relationship between the two tagged 
cells used to produce the result. Tag and value flags are separate. The status register 
is hardwired with the tag value 111. 
A.3.2.1. Tag Flags. There are 12 tag flags, separated into three groups: 
(1) 
(2) 
(3) 
Eight object flags record the individual type of the result. 
Two class flags record the generic type of the object (membership in class A 
or class B). 
Two arithmetic flags (higher-than, equal-to) record the relation of the two 
operands’ tags processed by the ALU. 
A.3.2.2. Value Flags. There are two value flags (higher-than, equal-to) used to 
record the relation of the two operands’ values processed by the ALU. 
A.3.3. Global Registers. Five global registers are provided. They are always 
visible, and do not have hardwired tags. 
LOWRISCMACHINE 207 
A.3.4. General-Purpose Registers. 115 general-purpose registers are provided. 
Only 19 are visible at one time. The 19 are selected by placing a value in the W 
register. Since the W register holds a complete 29-bit address, the register window is 
treated as 19 fast memory locations. Changing the W register will change the current 
window in the register file. Attempting to select a window whose registers are not in 
the register file causes registers to be saved to or restored from memory. The address 
in register W is automatically adjusted to point to the new window. 
A.4. Instruction Set 
The LOW RISC has seven instructions, although some instructions are “overloaded” 
to serve multiplepurposes. The primitives necessary to implement a WAM can be 
written with this instruction set, as can a number of PROLOG predicates. Some 
functions have been omitted (signed arithmetic, bitwise logical operations), but 
could be added either directly to the instruction set, or implemented on a coproces- 
sor and accessed using the hook instruction. Special purpose coprocessors 
(floating-point arithmetic, unification, relational-database machines) can be used to 
augment the LOW RISC instruction set with the hook instruction. 
A.4.1. Data Transfer and Arithmetic Instructions. This group of instructions 
transfers data between memory and the registers, and between registers. The add 
and sub perform tag manipulation as well as arithmetic. Only unsigned integer 
addition and subtraction are supported. 
Load and store instructions are delayed. The result of a load or a store is not 
available to the instruction following the load or store. 
Load, store, add and sub have the same types and restrictions on their operands: 
(1) source1 must be a register. 
(2) “ = 0” is optional. If used, the value of source1 will be ignored, and zero 
substituted. 
(3) source2 may be a register or an immediate data value. Immediate data may 
be encoded in source2 only. If tag is not specified, source2 is sign-extended as 
it is decoded from the instruction. The immediate value has a limited range. 
If it is interpreted as a twos-complement signed integer, it is in the range 
-8192 to 8191. If it is considered to be unsigned, it is in the range 0 to 16383. 
(4) tag is optional. If used, the upper three bits of source2 will override the 
destination tag. tag used with an immediate data value prevents source2 from 
being sign-extended. 
(5) dest must be a register. 
(6) SC, set condition, is optional. If present, it specifies that the result will set 
appropriate flags in the status register. 
Table 6 shows how the “ = 0” and tag overrides affect the destination tag and value. 
A.4.1.1. load sourcel( = 01, source2, (tagzjdest, [SC]. The value in the tagged cell 
at the address calcuated by summing source1 and source2 is placed in dest after any 
208 I. W. MILLS 
TABLE 6. Effects of overrides on destination tag and value. 
Operand 1 
(sourcel) 
rl 
rl 
rl=O 
rl=O 
rl 
rl 
rl=O 
rl = 0 
Operand 2 Operand 3 
(source2) (dest) 
r2 
r2 
r2 
r2 
imm 
imm 
imm 
r3 
tag: r3 
r3 
tag: r3 
r3 
tag: r3 
r3 
tag: r3 
add, sub load, store 
dest tag dest value dest tag dest value 
rl op r2 rl op r2 
tag rl op r2 
rl r2 
tag r2 
rl rl op imm 
tag rl op imm 
rl imm 
tag imm 
r3 
tag 
r3 
tag 
r3 
tag 
r3 
tag 
r3 
r3 
r3 
r3 
r3 
r3 
r3 
r3 
overrides are applied. Arithmetic flags are not set by the address calculation, but 
type flags are set by the value fetched. 
A.4.1.2. store sourcel[ = 01, source2, [tagjdest, [SC]. The value in dest is placed 
in the tagged cell at the address calculated by summing source1 and source2 after 
any overrides are applied. Arithmetic flags are not set by the address calculation, 
but type flags are set by the value stored. 
A.4.1.3. add sourcel[= 01, source2, [tag:]dest, [SC]. The value in source2 is added 
to sourcel, and the result, after applying overrides, is placed in dest. Arithmetic 
flags are set by the addition; type flags are set by the type of the final result. 
A.4.1.4. sub sourcel[ = 01, source2, [ta&]dest, [SC]. The value in source2 is sub- 
tracted from sourcel, and the result, after applying overrides, is placed in dest. 
Arithmetic flags are set by the subtraction, type flags are set by the type of the final 
result. 
A.4.2. Control Instructions. Program execution is controlled by the if and switch 
instructions, and by add and load instructions using the program counter (P) as the 
destination operand. Jumps caused by if, switch, and add instructions are delayed 
one instruction cycle. Jumps due to load instructions are delayed an extra cycle 
while the target address is fetched from memory. 
A.4.2.1. if condition, offset24. The if instruction performs a conditional branch 
to an instruction in memory. The address of the instruction is specified by an offset 
relative to the program counter. The last result for which the SC (set condition) field 
was active will determine whether the branch is taken. The target address must be in 
the range -8,388,608 to +8,338,607. Unconditional branching is performed by 
selecting the always condition. 
Branches may be taken on any of the following 43 conditions grouped into six 
major categories. Encodings are needed for only 31 tests. A condition for which 
there is no explicit test is either constructed from the switch instruction (indicated 
by italics), or, if shown in parentheses, is obtained by reversing the sense of the 
condition listed immediately to the left. 
LOW RISC MACHINE 209 
(1) Object type of the last result: 
bound-variable 
tag-001 
tag-010 
tag-011 
tag-100 
tag-101 
tag-110 
tag-111 
not_bound_variable 
not_tag_OOl 
not_tag_OlO 
not-tag-01 1 
not_tag_lOO 
not_tag_lOl 
not_tag_llO 
not-tag-111 
(2) Implicit class of the last result: 
class-A not_class_A 
class-B not-class-B 
(3) Arithmetic relationship between the operand-l tag and the operand-2 tag: 
tagl_higher (tag2_lower_or_same) 
tagl_higher_or_same (tag2_lower) 
tags-equal 
tags-not-equal 
tagl_lower_or_same (tag2_higher) 
tagl_lower (tag2_higher_or_same) 
(4) Arithmetic relationship between the operand-l value and the operand-2 
value : 
valuel_higher (value2_lower or - - same) 
vaIuel_higher_or_same (value2_lower) 
values_equal 
values_not_equal 
valuel_lower_or_same (value2_bigher) 
valuel_lower (value2_higher_or_same) 
(5) Arithmetic relationship between operand 1 entire and operand 2 entire: 
cells-equal 
cells_not_equal 
(6) Unconditional: 
always 
A.4.2.2. switch register, OOloffset4, lOOoffset6, Aoffset6, Boffset8. The switch 
instruction selects an unsigned relative offset from the instruction word and adds 
this offset to the program counter. If the tag value is 001, the four-bit offset is 
210 J. W.MILLS 
selected; if 100, the first six-bit offset is selected; if the object is a member of class 
A, the second six-bit offset is selected; and if a member of class B, the eight-bit 
offset is selected. If the tag is a “bound variable” (binary 000), then the program 
counter is incremented normally, and program execution continues at the next 
instruction. 
Since the relative offsets are of varying sizes, execution may continue within any 
one of five overlapping code areas following the switch. Unlike a case instruction, 
varying sized routines may be packed after the multiway branch without wasting 
space. 
The switch instruction is valuable, since it allows dereferencing to be combined 
with a switch-on-type instruction, can be used to encode short unconditional and 
some conditional branches, and even provides direct encoding of PROLOG predi- 
cates such as “var”, “not_var”, etc. 
A.4.2.3. load sourcel, source2, P. Loading the program counter allows an uncon- 
ditional branch to reach any location in the address space. Branch tables indexed by 
a value or the result of a calculation can be implemented with this instruction. 
A.4.2.4. add sourcel, source2, P, sub sourcel, source 2, P. These instructions 
allow unconditional branching to any location in the address space. The “dest 
value” column in Table 4 shows how four types of branch are possible: 
(1) long relative if rl = offset and r2 = P; 
(2) short relative if rl = P and imm = offset; 
(3) based absolute if rl = 0 and r2 = base, or rl = base and imm = 0; 
(4) based indexed if rl = base and r2 = index, or rl = index and imm = base. 
A.4.3. hook parameterl5, addressl4. The hook instruction is used to interface the 
LOW RISC to special-purpose coprocessors. The details of the interface will not be 
covered here, but the general sequence of a hooked interaction is as follows: 
(1) 
(2) 
(3) 
(4) 
Upon encountering the hook instruction, the LOW RISC writes parameter15 to 
the memory location specified by the sign-extended addressl4. 
A coprocessor, which must be able to access the memory location written to 
by the LOW RISC, reads parameter15 and interprets it as a command (perhaps 
with parameters) to execute a program in its private memory. 
After completing the program the coprocessor places any results into shared 
memory, and interrupts the LOW RISC. 
The LOW RISC is vectored to a handler for the interrupt. It processes the 
interrupt, then returns to the interrupted task. 
Coprocessors are treated as channels such as the Intel 8089 I/O processor. The LOW 
RISC and a coprocessor could also be more closely coupled, as is done with the Intel 
8086 and the Intel 8087 numeric data processor, but the instruction-decoding logic 
would become more complex. 
LOW RISC MACHINE 211 
Motorola Inc., and Sam Daniel in particular, arranged financial support for my doctoral studies and 
encouraged the investigation of RISC architectures. 
Kenneth Bowen, Antony Faustini, David Patterson, William Wadge, and anonymous referees reviewed 
various versions of this paper and provided valuable suggestions; I incorporated most of them in this 
paper, and used those I didn’t as topics for continued research. 
Applied Logic Systems, Inc. supported the development of a WAM emulator, which allowed me to test an 
early version of the LOW RISC architecture. 
Ross Overbeek of Argonne National Laboratory continues to provide a focus for Warren-machine study. 
His introduction to logic-programming implementation was invaluable-well, actually it was worth the 
price of my Ph.D. But that’s pretty good nowadays. Thanks, Ross. 
REFERENCES 
1. Beer, J., A Critique of Warren’s Abstract Instruction Set, GMD-FIRST/TU, Berlin, 
1985. 
2. Bowen, D. L., NIP: New Implementation of PROLOG, Dept. of Artificial Intelligence, 
Univ. of Edinburgh, Edinburgh, Scotland, incomplete draft, 29 Nov. 1983. 
3. Bowen, K. A. et al, The Design and Implementation of a High-Speed Incremental 
Portable PROLOG Compiler, Technical Report CIS-85-4, School of Information and 
Computer Science, Syracuse Univ., Syracuse, NY, 1985. 
4. Borriello, G., Cherenson, A., Danzig, P., and Nelson, M., Special or General-Purpose 
Hardware for PROLOG: A Comparison, Report No. UCB/CSD 87/314, Computer 
Science Division (EECS), Univ. of California, Berkeley, CA, Oct. 1986. 
5. Beuttner, K., private communication to J. Mills, Nov. 1986. 
6. Clocksin, W., and Melhsh, C., Programming in PROLOG, 3rd ed., Springer, Berlin, 1987. 
7. Die], H. et al., An Experimental Computer Architecture Supporting Expert Systems and 
Logic Programmin g, IBMJ. Res. Deuelop. 30(1):102-111 (Jan. 1986). 
8. Dobry, T. et al., Design Decisions Influencing the Micro-architecture for a PROLOG 
Machine, in: MICRO I7 Proceedings, Oct. 1984. 
9. Dobry, T. et al., Performance Studies of a PROLOG Machine Architecture, presented at 
12th International Symposium on Computer Architecture, June 1985. 
10. Dobry, T. et al, Extending a PROLOG Machine for Parallel Execution, HICSS-19, Jan. 
1986. 
11. Fagin, B. et al., Compiling PROLOG into Microcode: A Case Study Using the NCR/32- 
000, in: Proceedings 18th Annual Workshop on Microprogramming, IEEE Computer 
Society Press, 1985. 
12. Fisher, J., Trace Scheduling: A Technique for Global Microcode Compaction, IEEE 
Trans. Comput. C-30(7):478-490 (July 1981). 
13. Gabriel, J. et al., A Tutorial on the Warren Abstract Machine for Computational Logic, 
ANL-84-84, Argonne National Lab., 1984. 
14. Gross, T., and Hennessey, J., Optimizing Delayed Branches, Proceedings 15th Annual 
Workshop on Microprogramming, SIGMICRO Newsletter 13:114-120 (Dec. 1982). 
15. Gregory, S., Sequential Parlog Machine Specification (draft), Mar. 1985. 
16. Hermenegildo, M., An Abstract Machine Based Execution Model for Computer Archi- 
tecture Design and Efficient Implementation of Logic Programs in Parallel, Ph.D. Thesis, 
Dept. of Computer Sciences, Univ. of Texas at Austin, Aug. 1986. 
17. Katevenis, M., Reduced Instruction Set Computer Architecture for VLSI MIT Press, 
Cambridge, MA, 1985. 
18. Kogge, P., The Architecture of Pipelined Computers, McGraw-Hill, New York; 1981. 
212 J. W. MILLS 
19. 
20. 
21. 
22. 
23. 
24. 
25. 
26. 
27. 
28. 
29. 
30. 
31. 
32. 
33. 
34. 
35. 
36. 
37. 
38. 
39. 
40. 
41. 
Kraft, G., and Toy, W., Mini/Microcomputer Hardware Design, Prentice-Hall, Engle- 
wood Cliffs, NJ, 1979. 
Matsumoto, H., A Static Analysis of PROLOG Programs, Programming Systems Group 
Note (Preliminary), AI Applications Institute, Univ. of Edinburgh, Scotland, Nov. 1984. 
McCabe, F., The Sigma Machine, (unpublished), 24-25 Oct. 1984. 
Mills, J., A Description of the Operation of the Warren Abstract PROLOG Machine 
Using a rust-Iike Instruction Set, private communication to K. A. Bowen, 1985. 
Mills, J., An Implementation of the Warren Abstract PROLOG Machine for Segmented 
Memory Architectures, Technical Memo TM-44, Argonne National Lab., Apr. 1986. 
Mills, J., Coming to Grips with a RISC: A Report of the Progress of the LOW RISC Design 
Group, ACM SIGARCH Computer Architecture News, 15 Mar. 1987, pp. 53-62. 
Onai, R., Shimizu, H., Masuda, K., and Aso, M. Analysis of Sequential PROLOG 
Programs, J. Logic Programming 2:119-141 (1986). 
Overbeek, R. et al., PROLOG on Multi-processors, Technical Report, ArgoMe National 
Lab., Argonne, IL 60439,1985. 
Patterson, D. and C. Sequin, A VLSI RISC, IEEE Computer, Sept. 1982. 
Patterson, D., Reduced Instruction Set Computers, Comm. ACM 28, No. 1 (Jan. 1985). 
Patterson, D., A progress report on SPUR: February 1, 1987, ACM SZGARCH Computer 
Architecture News, 15:15-21 (Mar. 1987). 
Ross, M. L. and K. Ramamohanarao, Paging Strategy for PROLOG Based Dynamic 
Virtual Memory, in: Proceedings 1986 Symposium on Lagic Programming IEEE Com- 
puter Society Press, 1986. 
Short, B., A Preliminary Evaluations of the LOW RISC, Computer Science Dept., Arizona 
State Univ., May 1987. 
Srini, V., VLSI-PLM Chip, Laboratory Note, Univ. of California, Berkeley, Oct. 1985. 
Taylor, G., Hihinger, P., Lams, J., Patterson, D., and Zom, B., Evaluation of the SPUR 
Lisp Architecture, in: Proceedings 13th Annual International Symposium on Computer 
Architecture, Tokyo, Japan, 2-5 June 1986. 
Tick, E. and D. H. D. Warren, Towards a Pipelined PROLOG Processor, in: 1984 
Znternational Symposium on Logic Programming, IEEE Computer Society Press, 1984. 
Tick, E., PROLOG Memory-Referencing Behavior, Research Paper 85-281, Computer 
Systems Lab., Stanford Univ., 1985. 
Tick, E., Lisp and PROLOG Memory Performance, Research Paper 86-291, Computer 
Systems Lab., Stanford Univ., 1986. 
Tick, E., Studies in PROLOG Architectures, Technical Reportt CSL-TR-87-329, Com- 
puter Systems Lab., Stanford Univ., 1987.. 
Van Roy, P., A PROLOG Compiler for the PLM, Master’s Report Plan II, Computer 
Science Div., Univ. of California, Berkeley, Aug. 1984. 
Warren, D. H. D., An Abstract PROLOG Instruction Set, Technical Note 309, SRI 
International, Oct. 1983. 
Warren, D. H. D., Applied Logic-Its Use and Implementation as Programming Tool, 
Technical Note 290, SRI International, June 1983. 
Suzuki, N., Kubota, K., and Aoki, T., SWORD-32: A Bytecode Emulating Microproces- 
sor for Object-oriented Languages, in Proceedings of the International Conference on 
Fifth-Generation Computer Systems, ICOT, pp. 389-397, 1984. 
