Multilog and data or-parallelism  by Smith, Donald A.
!l 
NORTH- HOLLAND 
MULTILOG AND DATA OR-PARALLEL ISM 
DONALD A. SMITH 
i> This paper describes the design, implementation, performance, and analy- 
sis of MultiLog--a logic programming system whose distinguishing feature 
is the presence of multiple, concurrent binding environments (multiple sub- 
stitutions) with a single thread of control. In MultiLog, for certain goals, 
some subset of the solutions is collected and installed as the active set of 
substitutions. Subsequent goals execute with unification performed con- 
currently on the multiple substitutions, using a single thread of control. In 
this way, multiple binding environments partially replace backtracking as 
the operational embodiment of disjunction. The slogan "one control, mul- 
tiple environments" ummarizes the "data or-parallelism" of MultiLog. In 
this paper, we present an operational semantics (multi-SLD resolution) for 
MultiLog; discuss MultiLog's design and implementation; display bench- 
mark results for prototype uniprocessor, MIMD, and SIMD implementa- 
tions; present performance models that explain observed speedups; and 
consider various extensions and generalizations. Our aim is to evaluate the 
viability of data or-parallelism as an alternative both to backtracking and 
to control or-parallel search. <] 
1. MULTILOG AND MULT I -SLD RESOLUTION 
The implementation of disjunction in top-down logic programming languages like 
Prolog typically relies on backtracking and depth-first search, or on the provision 
of multiple threads of control: control or-parallelism. However, both backtracking 
and control or-parallelism have disadvantages: backtracking returns answers one at 
a time, often causing similar work to be repeated when a choice turns out to be the 
Address correspondence to Donald A. Smith, Department of Computer Science, University of 
Waikato, Hamilton, New Zealand. E-marl: dsm±th¢cs.waikato, c.nz. 
Received March 1995; accepted April 1996. 
THE JOURNAL OF LOGIC PROGRAMMING 
(~) Elsevier Science Inc., 1996 0743-1066/96/$15.00 
655 Avenue of the Americas, New York, NY 10010 PII S0743-1066(96)00067-2 
196 D.A. SMITH 
wrong one; and control or-parallelism (an approximation to breadth-first earch) is 
difficult to implement and expensive in resources. 
An alternative implementation strategy called multi-SLD resolution has been 
described in several publications [41, 43, 46, 45]. (See also [30, 20, 52, 53].) The 
essential idea is to extend the SLD inference rule to permit multiple binding en- 
vironments (multiple substitutions) and to provide a mechanism for collecting the 
solutions to a goal and turning these solutions into a set of substitutions (a disjunc- 
tive constraint). We call the resulting logic programming system MultiLog. 
The canonical example that illustrates multi-SLD resolution and its consequent 
data or-parallelism is the query 
[ ?- generate(X) , tes t (X) .  
Standard Prolog enumerates the solutions to generate /1  one by one via backtrack- 
ing, and tests each solution separately with tes t /1 .  A control or-parallel imple- 
mentation, such as Aurora [35] or Muse [2], starts up multiple Prolog search engines 
during execution of generate / I ,  when the selected atom matches with more than 
one clause head. In contrast, an implementation based on multi-SLD resolution be- 
gins a subcomputation to collect some subset of the solutions to generate /1 .  After 
collecting an appropriate number of substitutions, it then performs the testing in 
tes t /1  using a single thread of control in the context of the collected substitutions. 
In other words, a multi-SLD interpreter enumerates the solutions to generate /1  
subset by subset via backtracking; it creates from each subset a set of environments 1 
which are tested en masse by tes t / l ,  using a single thread of control, with only uni- 
fications performed concurrently (data parallelism). As a result, tes t /1  is executed 
once per subset rather than once per solution, and less work is performed overall. 
In addition, for many Prolog programs the same or similar computation (e.g., the 
creation or traversal of a list) is performed for each invocation of tes t ,  but using 
MultiLog's distinction between engine and multivariables, this shared computation 
can be "factored out" and performed only once (Sections 3.2 and 3.2.1). 
The syntax of MultiLog need not differ at all from that of Prolog. But to allow 
the programmer to indicate to the compiler for which goals to collect solutions, we 
have added the unary control operator (annotation) dis  j ,  whose argument is an 
arbitrary goal. On encountering the goal dis3 G, MultiLog starts a subcomputation 
to collect all or some subset of the solutions to G. After collecting an appropriate 
number of solutions, MultiLog turns these solutions into a disjunctive constraint--a 
set (or disjunction) of binding environments. The essential point is that subsequent 
goals execute in the context of these multiple binding environments, with unification 
and arithmetic performed "in parallel" on the multiple substitutions, but with one 
thread of control (data or-parallelism). 
Whereas the goal bagof(X,G,L) computes all solutions to G and collects the 
bindings of X into the list L, the goal d is j  G computes olutions to G and makes 
the solutions into a disjunctive set of binding environments internal to the Prolog 
system. Clearly, the goal dis3 G will never terminate if G enters an infinite loop, 
or if G succeeds infinitely often and we require dis3 to collect all solutions. Even 
if G's search tree is finite, G may have too many solutions to handle efficiently 
1 Our use of "environment" to refer to a representation f bindings hould not be confused with 
its meaning inside the WAM as a frame on the control stack. In this paper, we use "environment" 
exclusively in the sense of a binding environment. 
MULTILOC AND DATA OR-PARALLELISM 197 
with multiple environments. Accordingly, we specify that MultiLog may revert to 
backtracking after any number of solutions has been collected: the then current 
set of solutions is fed to the success continuation, and if control backtracks into 
the d i s j  goal, additional solutions are collected,.. . ,  and so on. In this way, back- 
tracking and multiple environments can be combined in a single program and for 
a single goal. Also, in this way, MultiLog can adjust the amount of space it uses: 
data or-parallelism involves a time-space tradeoff, and it is important o balance 
backtracking with the use of multiple environments. 
Data or-parallelism is not a special case of control or-parallelism, except in the 
weak sense in which SIMD processing is a special case of MIMD processing. For 
readers having even a basic familiarity with the Warren Abstract Machine [1], 
the best way to see the distinction between control or-parallelism and data or- 
parallelism is to note that in a control or-parallel implementation, there are mul- 
tiple WAM instruction streams (multiple Prolog engines) running in parallel; in a 
data or-parallel implementation, there can be a single instruction stream: only the 
unifications need be done in parallel. 
These comments are not meant to rule out the possibility of combining control 
or-parallelism and data or-parallelism in a single system. A reasonable approach 
might be to collect the solutions to d i s j  goals using control or-parallelism, and to 
perform the testing using data or-parallelism. However, in our current, prototype 
implementations, solution collection is performed sequentially using backtracking. 2 
Another piece of evidence attesting to a fundamental distinction between data 
or-parallelism and control or-parallelism is the fact that the former leads to (some- 
times spectacular) speedups over standard Prolog, even on a uniproeessor. (See 
Section 5.) We explain the reason for these speedups in Section 6. A control or- 
parallel Prolog would certainly not run faster than standard Prolog on a single 
processor. 3 
The operational semantics of MultiLog is formalized in the notions of multi- 
SLD resolution, mult i -SLD derivation (Section 3), and mult i -SLD tree [46, 47]. For 
multi-SLD resolution, the input to a resolution step is a goal, a set of substitutions, 
and a program; the output is a set of substitutions and a resolvent. In [46, 47], 
multi-SLD resolution is shown to be a sound and complete inference rule, given a 
breadth-first interpreter. In this paper, we concentrate on the design and analysis 
of MultiLog. 
In the next section, we give an example of a MultiLog program that should 
clarify the difference between the data or-parallelism of MultiLog and the control 
or-parallelism of systems like Muse [2] and Aurora [35]. Next, in Section 2, we review 
related work. In Section 3, we give a formal operational semantics for multi-SLD 
resolution and introduce the important distinction between engine variables and 
2In fact, there can be multiple d i s j  goals in a derivation so that on entry to a d i s j  goal, there 
may already be multiple substitutions. In such a case, collection of solutions will be performed 
using data or-parallelism. 
3Actually, one can imagine anomalous ituations in which it might run faster. On a heavily 
loaded single processor system, a multiprocess program might finish sooner than a single-process 
program due to its being scheduled for a greater proportion of CPU cycles. Similarly, if only one 
solution is needed, a control or-parallel Prolog may find it sooner. Such special cases would not 
contradict our point. MultiLog often does run faster (and, according to the analysis in Section 6, 
should run faster) than Prolog, even when all solutions are needed and even when scheduling is 
not an issue. 
198 D.A. SMITH 
multivariables, along with the related notion of templates. Then, in Section 4, we 
describe the Multi-WAM architecture for execution of MultiLog and sketch working 
implementations that run on the BBN Butterfly TC2000 and KSR1 (MIMD), on 
the MasPar MP1 (SIMD), and on uniprocessor computers. Benchmarks in Section 5 
show useful speedups and good absolute performance for a range of search problems. 
For most examples, uniprocessor Multi-WAM using multiple nvironments is faster 
than the standard WAM using backtracking. For uniprocessor MultiLog, we have 
obta ined speedups  over Pro log of 88 on  on  naive generate-and-test p rograms;  for 
S IMD Mul t i Log ,  we  have  obta ined speedups  of more  than  2000 relative to Prolog. 
(One  can contrive programs wi th  arbitrarily large speedups.)  Section 5.4 discusses 
the applicability of and  limitations of data  or-parallelism. Section 6 provides a 
formal  mode l  that explains the speedups  and  space consumpt ion  and  gives insight 
into the nature of data  or-parallelism. Section 7 summar izes  our  work  and  lists 
areas for future research and  improvement .  
1.1. Breadth-First Graph Searching 
Consider the following program for finding cycle-free paths in a graph: 
path (StartNode, Path) : - path_aux (St artNode, [StartNode], Path). 
path_aux (_, Path, Path). 
path_aux (StartNode, InPath, 0utPath) : - 
d is j  edge (StartNode,N), 
notmember (N, InPath), 
path_aux (N, [hi [ InPath], 0utPath). 
The dis j  annotation is used before edge(StartNode,N) to treat its answer 
bindings via multiple environments; but normal, backtracking disjunction is used 
for calls to the two clauses of path_aux/3. Let the query be path (1, P) with edge/2 
defined according to the graph of Figure 1 by the following unit clauses: 
edge( i ,2) ,  edge(I ,3),  edge(i ,4),  edge(2,1). 
edge(3,2), edge(3,5), edge(4,3), edge(5,2). 
The effect of the dis j  is to cause a breadth-first earch in which, at the ith step, 
(i > 0) cycle-free paths of length i are bound to P. 
I ? path( l ,Y ) .  
Yes  P = [I] More?  ; 
Yes  P = [2,1] or  P = [3,1] or  P = [4,1] More?  ; 
Yes  P = [5 ,3 ,1]  or P = [2 ,3 ,1]  or  P = [3 ,4 ,1]  More?  ; 
Yes  P = [5 ,3 ,4 ,1 ]  or  P = [2 ,3 ,4 ,1 ]  or  P = [2 ,5 ,3 ,1 ]  More?  ; 
Yes  P = [2 ,5 ,3 ,4 ,1 ]  More?  ; 
No 
1 
2 '~ 3"  4 
\ /  
5 
F IGURE 1. Graph for path example. 
MULTILOG AND DATA OR-PARALLELISM 199 
Each time d is j  edge (StartNode,N) is about to be called, StartNode is bound 
to some value in each environment; N is unbound. The collection of solutions is 
performed by backtracking over the clauses for edge/2; for each clause, only some 
(or none) of the environments will be consistent with the unification of the atom 
with the head of the chosen clause. Successful environments, containing bindings for 
N, are copied and accumulated in a list; environments which are inconsistent with 
the unification are trailed in a choice point or deleted with their space recovered, 
depending on whether there any more clauses for edge/2. (Section 4.1 explains the 
need for environment trailing.) After copying, backtracking occurs and N's binding 
is undone in the active (input) environments, in preparation for renewed forward 
execution. After all (or enough) clauses have been tried, the collected environments 
are installed as the active environments. (Section 4 describes everal optimizations 
of this naive execution model.) The execution of the call not_member(N,InPath) 
then occurs in multiple substitution environments sothat the same unifications and 
tests are performed on multiple data. 
2. RELATED WORK 
We have already contrasted the data or-parallelism of MultiLog with the control 
or-parallelism of Aurora and Muse [2, 35, 56]. 
The work on finite domain logic programming [16, 25] utilizes disjunctive con- 
straints of the restricted form X = cl V ... V X = c,~ involving one variable 
X where the domain of the variable is a finite set of constants Cl,...,c,~. In 
contrast, MultiLog allows arbitrary disjunctions of equations involving different 
variables and nonconstant terms. Finite domain logic programming uses consis- 
tency techniques such as forward checking [25] to prune the search space in ways 
different from standard Prolog (or MultiLog). Theoretically, it would be possi- 
ble to use multiple environments alone (without constraints) to solve the sort of 
combinatorial problems for which finite domain logic programming is appropri- 
ate. But the result would likely be unsatisfactory since the problems would, in 
effect, be solved by enumeration. We think a promising research direction involves 
combining such CLP techniques with multiple environments, as in Firebird [51] 
(below). 
The cardinality operator of [26] generalizes the earlier work on finite domains 
by expressing more general kinds of disjunctions. Its syntax is #(l,  u, [¢1,. . . ,  Cn]), 
with the meaning "the number of constraints in ¢1, .. -, ¢~ that are true is between 
l and u," where 1 and u are integers. When I >_ 1, this implies disjunction. However, 
the implementation of the cardinality operator is based on heuristic rewrite rules 
and is not complete; the ¢ are restricted to constraints (we allow arbitrary goals 
as arguments to d is j ) ;  and the notion of multiple constraint environments i  ab- 
sent. Indeed, the intention of the authors of [26] in introducing the ¢ operator was 
fundamentally different from our intention in introducing dis  j: ¢ is meant o be a 
primitive atop which other predicates can be built for solving various combinato- 
rial problems, whereas d i s j  is meant o realize a partial breadth-first, data-parallel 
search by replacing backtracking. 
Firebird [51] is a committed-choice onstraint language [38] whose execution 
model involves two components: a front end inference ngine (running the con- 
trol part of a committed choice logic interpreter) and a back end, data parallel 
constraint solver. "In a non-deterministic derivation step, if there is any unbound 
200 D.A. SMITH 
domain variable X in the system with domain {a l , . . . ,  an}, Firebird will create n 
or-parallel branches, each of which executes with an additional constraint X = ai, 
1 < i < n." Whereas MultiLog's environments are created by d i s j  goals, Fire- 
bird's environments are created by the constraint labeling operation. There is no 
engine/multidistinction f r variables (see our Section 3.2), so all logic variables re- 
side on the SIMD processors--a distinct disadvantage, according to our analysis 
here. Firebird avoids environment copying by having idle processors track active 
ones. 
In the language LPS (Logic Programming with Sets) of [32], sets are available 
as a user data type, along with a membership predicate E. Terms are of two sorts: 
regular (atomic) terms and set terms. In [8], the "set grouping" syntax is available 
for binding variables to sets of solutions to goals. This is analogous to Prolog's 
setof. 
In the work  on constructive negation [9, i0, 39, 49], arbitrary first-order for- 
mulas, including disjunctions, are managed by the logic p rogramming interpreter. 
In a nutshell, the idea is logical rewriting, in which logical formulas are simplified 
by the incremental unfolding of user atoms and by quantifier elimination [12, 36, 
39, 42, 49]. Constructive negation subsumes both Prolog and Mult iLog-- the dif- 
ference being that Prolog nondeterministically reduces formulas to conjunctions of 
equations, while MultiLog, under the direction of the disj control operator, allows 
disjunctions of conjunctions of equations. But  the unrestricted generality of using 
arbitrary formulas (with nested quantifiers and negations) clouds the simplicity of 
adding disjunctive constraints to logic programming.  The  present work  presents 
a simple generalization of standard logic p rogramming operational semantics: the 
generalization from a single environment to multiple environments. 
S.-A. T i rn lund and his students have introduced the Reform mode l  of logic 
p rogramming [5, 37]. The  basic idea is to unfold recursive program clauses so that 
goals from multiple recursion levels appear as conjuncts in a single clause body. In 
this way, the successive computat ions corresponding to different levels of recursion 
can be performed using and-parallelism. In contrast, mul t i -SLD resolution is a type 
of or-parallelism: substitutions from multiple solutions to earlier goals are processed 
in parallel by later goals. 
Afro et al. [3] describe an extension to Prolog in which there is an explicit syntax 
for expressing various quantifiers or quantifier-like constructs, including and, sum, 
and product.  The  authors extend the operational semantics to allow computat ion 
of the quantified formula on a data parallel machine. The  result is thus (in our 
terminology) a form of data and-parallelism. Bark lund and Millroth [7] consider 
bounded quantification where the quantifier is existential. In this case, the result is 
a form of data or-parallelism. However,  the authors do not conceive of it in terms of 
mul t i -SLD resolution: as a simple generalization of Prolog's operational semantics, 
as a replacement for backtracking, and as a means  for collecting solutions. Nor  
have they attempted to implement  this extension [6]. 
We recently learned of or-vectorization [30], a technique close to mul t i -SLD reso- 
lution, for implementat ion on vector supercomputers. The  authors report parallel 
speedups of about 8 relative to scalar operation for N-queens problems. The  pos- 
sibility of sequential speedups was not considered. Nor  was the technique analyzed 
or formalized to the extent we  have done in our work. 
The  key idea of the Andor ra  approach [13, 23, 58] is to give preference dur- 
ing resolution to the selection of deterministic atoms over nondeterministic atoms. 
MULTILOG AND DATA OR-PARALLELISM 201 
When deterministic atoms are not preferred, such goals get executed unnecessar- 
ily often: once for each solution to the preceding nondeterministic goals. When 
deterministic atoms are preferred, they get executed once, before execution of the 
corresponding nondeterministic goals. Furthermore, if the deterministic goal fails, 
then the nondeterministic goals are not executed at all. Like data or-parallelism, 
the Andorra approach was originally conceived of as a mechanism for use with 
parallel implementations, but it turned out, for many applications, to be better 
than the standard approach even for sequential implementations. 
Finally, independently and approximately concurrently to our work, Jordi Tubella 
and Antonio Gonz~lez [20, 52, 53] have developed the Multipath execution l~odel 
(MEM), which is quite similar to MultiLog. Much of the theory and analysis de- 
veloped here should apply to MEM as well. 
3. OPERATIONAL SEMANTICS  FOR MULT I -SLD RESOLUTION 
In this section, we make precise the operational semantics of MultiLog by formalizing 
the notions of multi-SLD resolution and multiderivation. We introduce the dis- 
tinction between engine and multivariables, ~along with the related optimization 
technique, templates. Finally, we point out a variation on multi-SLD resolution 
involving lazy evaluation of the disjunctive constraint component of the abstract 
machine state. 
We assume the standard efinitions and notations of logic programming [34]. 
3.1. Mult i-SLD Resolution and Multiderivation 
It is convenient to adopt the language of the Constraint Logic Programming (CLP) 
paradigm [29, 28] for discussing the semantics of MultiLog. This is because an 
(idempotent) substitution is equivalent to a solved form set of equations, so that 
a set of substitutions can be regarded as a disjunction of solved form conjunctions 
of equations. (We use the notions interchangeably.) Using the framework of CLP 
allows us to describe multi-SLD in greater generality and abstractness. 
Prolog can be characterized as a CLP language whose domain of computation 
is the Herbrand Universe (the algebra of finite trees [36]), whose satisfaction com- 
plete axiomatization is Clark's Equality Theory (CET) [11], and whose allowed 
constraints are conjunctions of equations between terms in the Herbrand universe 
[31, 36, 40]. Herbrand's solved form algorithm (unification) is a sound and complete 
procedure for deciding the satisfiability of sets of equations in this domain [27, 33]. 
MultiLog generalizes Prolog in two ways: first, by using disjunctions of conjunc- 
tions of equations as allowed constraints, and second, by providing a mechanism 
for collecting the solutions to a goal and turning these solutions into a disjunctive 
constraint. In the CLP framework, any first-order formula in the underlying con- 
straint language is potentially allowed as a constraint; so in this respect, MultiLog's 
use of disjunctions is nothing new. But the second feature~the mechanism for col- 
lecting solutions--is not part of the CLP framework, and requires independent 
justification. 
In an abstract operational model of a CLP language, the state of a derivation 
is a pair G o C, where G (the goal component) is a list of user atoms, C (the 
constraint component) is an allowed constraint, and <> is a synonym for A. Starting 
from an initial state whose goal component is the query and whose constraint 
202 D, A, SMITH 
component is the empty constraint rue, the resolution rule tells how to progress 
nondeterministically to the next state. In standard CLP languages, this rule is just 
resolution, and a sequence of states progressing according to the resolution rule is 
called a (unary) derivation. 
In order for the derivation to continue from a state G¢C, it is necessary for C to 
be solvable. That is, the existential closure of C must evaluate to true in the theory 
of the constraint domain. Logically, when C is a disjunction, it is sufficient for just 
one disjunct to be solvable. But in our development, we assume that unsolvable 
disjuncts are deleted from the constraint component, so that at each step, the 
constraint component consists of a disjunction of solvable formulas. 4 To this end, 
let S be a function which, applied to a constraint C, reduces C to a simplified form. 
In MultiLog, S is Herbrand's olved form algorithm applied to each disjunct, with 
false disjuncts deleted (subsumed isjuncts could also be removed). The result is 
a DNF formula: a disjunction of solved form sets of equations. 
Let [] be the empty goal. Assume some fixed computation rule for selecting 
atoms to participate in resolution steps. Define a multiderivation from query Go and 
program P to be a (finite or infinite) sequence of states So, S1, • •., where So -= (G0¢ 
true) and for i > 0, Si-1 derives S~, written S~-1 ~ Si, by the multiresolution rule 
of Figure 2. Let ~*  be the reflexive transitive closure of ~ .  If a multiderivation 
is finite and ends in a state with an empty goal component and with a non false 
constraint C, then we write S0====~*([::] o C). 
The multiresolution rule is divided into two cases, depending on whether the 
selected atom is a normal user atom [e.g., p( f (X)) ] ,  or a d i s j  goal [e.g., d i s j  
q(Y,g(X,Y))]. (In practice, the interpreter or compiler could decide which goals 
should be labeled with d i s j  operators. In this sense, the choice between the two 
4Section 3.3 discusses the consequences of relaxing this assumption. 
1. (normal multi-resolution rule) Suppose the current state is A1, . . . ,  A .o(Et  V 
•.. V E,,); the selected atom, Ai, is a normal user atom; H ~-- B I , . . . ,  Bk is 
a new renaming of a clause in P; and 
c -  V (Ej ^ (A, = H)) 
l<j(_m 
with S(C) ~ false. Then 
Al . . . . .  A, o(E l  V . . .V  E,~) 
Al . . . . .  A~-t, Bt . . . .  , B~, A~+I,...,  A,, o S(C) . 
This rule is just the standard resolution rule for CLP languages explicitly 
specialized to the case where the constraint component is a disjunction• 
2. (d is j  multi-resolution rule) Suppose the current state is A I , . . . ,A ,  o C; 
the selected atom Ai is a disjunctive goal (d is j  A); F C_li,~te {CS°tl(A v 
C)=:~*(oogs° l )};  and F # $. Then 
At , . . . ,A ,  oC  
At,..., Ai-1, Ai+,, •.., A, o Vcs.,~F CS°~ . 
[n words, a nonempty, finite subset F of the solutions to A o C is obtained. 
The new state has a goal component consisting of the remaining oals and 
a constraint component equal to the disjunction of constraints in F. 
F IGURE 2. Multi-SLD resolution. 
MULTILOG AND DATA OR-PARALLELISM 203 
multiresolution rules is arbitrary, and each time an atom is selected, a new decision 
could be made about which rule to use.) 
The nondeterminism in the choice of clauses and in the choice of solutions for 
d i s j  goals is essential: breadth-first earch (or some equivalent search strategy such 
as iterative deepening) is, in general, necessary to guarantee completeness. For a 
goal disj A, breadth-first search requires concurrent consideration of (disjoint) 
subsets of solutions to A. 
By answer, we mean a constraint (disjunction of substitutions) returned by a suc- 
cessful multiderivation. By solution, we mean a substitution returned as a disjunct 
in some answer. 
In [46, 47], the additional notion of multi-SLD tree is defined, and multi-SLD 
resolution is shown to be a sound and complete inference rule, given a breadth-first 
interpreter. 
3.2. The Distinction Between Engine and Multivariables 
At each step of a multi-SLD derivation, the constraint component consists of a 
disjunction (set) of substitutions. A naive implementation f multi-SLD resolution 
would store and manipulate these substitutions totally independently, assuggested 
by the abstract definitions. Yet, in general, the various substitutions share much 
structure. 
Consider that a multi-SLD derivation consists of a sequence of normal multi- 
resolution steps interspersed with (occasional) d i s j  multiresolution steps. Each 
surviving disjunct after a normal step is consistent with the head unification as- 
sociated with the step. And each disjunct contains, where appropriate, equations 
(bindings) resulting from the head unification associated with the step. For exam- 
ple, if the head of the clause contained [H I T] in some argument position, then all 
disjuncts with [] in that position failed the unification; and any disjuncts with a 
variable in that position were extended with a binding of that variable to [H I T]. 
Consequently, at any step of a multiderivation, the various disjuncts of the 
constraint component share equations resulting from normal multiresolution steps 
appearing in the multiderivation up to that point. That is, each normal multires- 
olution step involves a unification Ai = H, which (logically) appears as a conjunct 
in each disjunct of each subsequent constraint component. So for any subsequent 
constraint component E1 V ... V E~, it holds that for each j, Ej -~ (A~ = H). 
The significance of this fact is that only bindings dependent on d i s j  goals can 
differ between substitutions appearing together in the constraint component. Bind- 
ings of variables corresponding to normal multiresolution steps are shared. For 
emciency's sake, it is beneficial to store the bindings of such variables once, inde- 
pendently of the bindings of those variables whose values vary among substitutions. 
This fact is the basis for the distinction between "engine" (sequential) and "multi" 
(parallel) variables, and the related distinction between the Prolog engine and the 
unification workers (Section 4). 5 
A multi-SLD interpreter manages a disjunction (set) of substitutions at each step 
of execution. Our design represents this disjunction in the form a A/3, where a is 
a conjunction (a substitution) representing the shared, common bindings, and/3 is 
a disjunction representing the bindings that differ among substitutions. Variables 
5Tim Hickey shares credit for conceiving of the distinction between engine and multivariables. 
204 D.A. SMITH 
bound in c~ are the engine (sequential) variables; variables bound in ~ are the multi 
(parallel) variables [43, 47]. Unifications involving engine variables are faster than 
unifications involving multivariables since the disjuncts in fl need not play a role. 
The unification of the selected atom and the head of the chosen clause in a mul- 
tiresolution step is performed in two stages, with input c~ A ~ and output c~ / A ~ (or 
failure). First, the unification A~c~ = Ha is performed. If this unification fails, the 
entire unification fails. If Aic~ = Ha results in the unification of an engine variable 
with a term not containing that engine variable, then the engine variable gets bound 
to that term. (This is so even if the engine variable gets unified with a multivari- 
able: the former gets bound to the latter.) The resulting bindings appear in the 
output c~ I. If the unification Aic~ = Hc~ requires the unification of a multivariable 
with anything other than an engine variable, then that unification is performed in 
each disjunct of ~, with output in ~'. 
Consider, for example, the following program and query, which binds L to lists 
of binary digits. 
bit(0), bit(1), bits([]), bits([H I T]):- disj bit(H),bits(T). 
i ?- bits(L). 
Yes L = []. More? y 
Yes L = [A], (A=0 or A--l). More? y 
Yes L = [A,B], (A=0,B=0 or A=O,B=I or A=I,B=0 or A=I,B=I). More? y 
The dis j-independent variable L and each tail of L (the variable T in the body of the 
second clause for bits/l) get bound either to [] or to a LIS cell. It is reasonable 
to store the bindings of L and T once, globally, in s t. In contrast, the value of H 
varies among environments, and must be stored as the binding of a multivariable, 
in ~t. This representation is reflected in the format of the output above. 
In many programs, it is common for the heads of lists to be multivariables, while 
the tails are [] or list cells which, if they were allocated as multivariables, would 
have the same value in each environment. To avoid such redundancy, the compiler 
or programmer arranges for the tail variable to be an engine variable, so that it 
gets bound in only one place, the Prolog engine. 
As another example, in the graph searching program of Section 1.1, the accumu- 
lated list of visited nodes (the second parameter to graph_aux) could be an engine 
variable so that, for instance, the third answer would be printed out like this: 
Yes P = [A,B,C],(A=5,B=3,C=I or A=2,B=3,C=I or h=3,B=4,C=l), 
reflecting the fact that the "spine" of the list is constructed once, in the Prolog 
engine, and only the elements differ among environments. Standard Prolog, as well 
as Aurora [35] and Muse [2], would construct he list once per each path in the 
graph. 
3.2.1. Templates. A useful optimization technique called templates further lever- 
ages the benefits of data or-parallelism when used in conjunction with the en- 
gine/multivariable distinction. 
Let {81,. . . ,  8m} be the solutions collected for a d i s j  goal, and let {Vl , . . . ,  Vn} be 
the variables bound in {81,.. . ,  Ore}. For each i (1 < i < n), consider the set of terms 
{v~81,..., ViOm }, regarding each term as a finite tree. Then we say that all solutions 
have the same shape, iff for each variable vi, the set of trees are isomorphic modulo 
the leaves. That is, solutions differ only by binding different constants to leaves of 
MULTILOG AND DATA OR-PARALLELISM 205 
the trees, and all internal nodes are the same. More formally, say that a substitution 
is a finite domain substitution if it is of the form [wl ~-- t l , . . .  ,win ~ tm], where 
Wl , . . . ,  wm are variables and t l , . . . ,  tm are either constants or variables. Let $ = 
{01,. . . ,  0,~} be the set of solutions to some goal G. All solutions have the same 
shape iff there exists a substitution 0 such that for i = 1 , . . . ,  n, 0~ = 0rh for some 
finite domain substitution rh. The substitution 0 is then called the shape of the 
solutions; GO represents the most specific instance of G that is more general than 
each term in GO1,..., GOn. 
When all solutions to a d i s j  goal have the same shape, then each multivariable 
gets bound to some finite tree such that only the leaves differ among environments. 
In that case, it is likely that the operations involved in performing subsequent 
unifications will be the same in all environments and can be easily carried out in 
a data parallel fashion. But if the generated environments have different shapes, 
then unification will likely involve different operations in different environments 
because variables will be bound to terms containing different function symbols 
and different arities in different environments. In an SIMD implementation for 
which one processor handles unifications for one environment, some processors will 
have to be idle during unification operations for function symbols or positions not 
appearing in the corresponding environments. Furthermore, it is likely that during 
the processing of subsequent goals, only some of the environments will be consistent 
with the head unification for certain clauses; the remaining environments will need 
to be stored in choice points, so that on backtracking, the environments can be 
reactivated for consideration by additional clauses. 6 
For many combinatorial search problems, all solutions to the generators have 
the same shape in the above sense. Yet, there is an even stronger condition that 
increases the efficiency of data parallelism. The point is that even if a variable 
is bound to a term of the same shape in each environment, he variable might be 
bound to a different such tree in each environment (see example below). To increase 
the efficiency of data parallelism, it is desirable for such a variable to be bound 
globally, as an engine variable, to a term of the given shape, with multivariables at 
the leaves; only these multivariables should vary among the environments. 
So if 0 is the shape of all solutions to a goal d i s j  G, it is useful to prebind the 
variables in G according to the bindings in 0. The term vO, where v is one of the 
variables bound by 0, is called a template. This optimization preserves correctness 
since, by assumption, all solutions have the given shape and new (multi) variables 
are used at the leaves. 
For example, suppose on entry to the call 
disJ delete (L, Item,Remainder) 
L is bound in the Prolog engine to a list of length n. Then, using the standard 
definition for delete/3, after the call, Item will be bound in each environment to 
an element of L, and Remainder will be bound in each environment to a (different) 
list, of length n - i, of the remaining elements. To avoid the redundant allocation of 
multiple lists and to simplify multiunification, it is beneficial to prebind Remainder 
to a template, in this case, a list of length n - I: 
mult i_list (L, Remainder), disj delete (L, Item, Remainder) . 
6The reader will probably not fully comprehend this point until s/he understands Section 4. 
206 D.A .  SMITH 
Here, we assume that multi_list binds its second argument to a list of multivari- 
ables of length one less than the length of its first argument. Templates can be 
applied to data structures other than lists in an analogous way, binding the engine 
variable to (approximately) the most specific generalization of the values in the 
multiple solutions. 
This optimization technique (due in large part to Tim Hickey) has proved crucial 
to the performance of MultiLog. Space is saved since the term representing the 
shape is created but once. Multiunification is simplified since each environment 
can be a finite domain substitution. In the common case where the multiunification 
processors are slower than the engine processor, time is saved as well. 
We mention, finally, that templates might be beneficial even for an SLD imple- 
mentation since, with templates, the multiple executions caused by backtracking 
would not all need to recreate the same-shaped term. However, in our tests, when 
de le te /3  above is run using templates, execution is slower since, when the tem- 
plate Remainder gets unified with a ground list, all of its component head variables 
must be unified with elements of that list. 
3.3. Two-Dimensional Backtracking 
In our description of multi-SLD resolution above, we assumed that at each step 
of execution, the disjunction of substitutions comprising the constraint component 
is reduced to disjunctive normal form. We also mentioned that logically it would 
be sufficient for just one disjunction to be solvable. It turns out that there is an 
interesting operational semantics, which we call two dimensional backtracking, that 
corresponds to the latter option. 
In a multi-SLD interpreter incorporating two-dimensional backtracking, the con- 
straint component is solved lazily, on demand, and can be represented as a stream 
(a lazy list). At each step, it is known that at least one substitution, at the head 
of the stream, is compatible with the sequence of multiresolution steps up to the 
current point, but it is not yet known whether the remaining substitutions are com- 
patible. When the selected atom is unified with the head of a clause, unification 
occurs in the single substitution at the head of the stream. 
If unification succeeds in that substitution, then execution moves forward, with 
an updated stream of substitutions: the new head substitution contains the compo- 
sition of the old head substitution and the mgu of the selected atom and the head 
of the clause; the tail of the stream contains a procedure that, when invoked, will 
lazily test whether the unifications up to this point are consistent. 
If unification fails in that substitution, then instead of backtracking to a previous 
clause (control backtracking), the tail of the stream of substitutions i  evaluated 
to determine if the next substitution (if any) is consistent with the current and 
previous unifications. If so, execution continues forward. If not, the next tail of the 
stream is evaluated, and so on. Only if no substitutions remain in the stream will 
backtracking occur to a previous clause. 
Thus, with two-dimensional backtracking, backtracking occurs both in the con- 
trol (returning to an earlier clause) and in the data (evaluating the tail of the 
stream of substitutions). Two-dimensional backtracking can be generalized to use 
n streams instead of regular streams: the system eagerly evaluates unification in up 
to n substitutions at once. Also, the technique can be generalized so that solutions 
to a d i s j  goal are collected lazily, on demand. 
MULTILOG AND DATA OR-PARALLELISM 207 
Two-dimensional backtracking may be valuable for controlling the memory usage 
of multi-SLD interpreters. 
4. THE MULT I -WAM ARCHITECTURE 
This section describes a natural extension of the Warren Abstract Machine (WAM) 
[1] for execution of MultiLog. The architecture, which we call the Multi-WAM, is 
appropriate for both sequential and parallel implementation. And within parallel 
implementations, it is mappable to both MIMD and (preferably with restrictions) 
SIMD computers. The architecture has been implemented in C by extending a
WAM interpreter based on [1]. Prototype implementations are running on: 
• various uniprocessor computers, using a single processor both to manage 
multiple environments and to execute Multi-WAM instructions, 
• the BBN TC2000 Butterfly and KSR1 (MIMD computers), 
• the MasPar MP-1208 (a SIMD computer). 
For the reader unfamiliar with the details of the WAM, the overview of the Multi- 
WAM in Section 4.1 should be sufficient background for understanding Sections 5 
and 6. To completely understand the remainder of this section, the reader should 
be familiar with the WAM. We recommend [1] as an accessible tutorial. 
4.1. Overview of the Multi- WAM Architecture 
The Multi-WAM architecture (Figure 3) contains two main components: 
• a Prolog engine, for executing control instructions and for performing unifi- 
cations involving engine variables; and 
• a set of unification workers, for servicing unification and environment ma- 
nipulation requests from the Prolog engine. 
Each unification worker is responsible for managing unifications involving nmlti- 
variables in some number of binding environments. 
The key to understanding the Multi-WAM is to grasp how the engine/multi- 
distinction is realized. Engine variables exist only in the Prolog engine, and contain 
bindings resulting from multiresolution steps outside of, and independent of, any 
d is j  goals. By design, engine variable bindings cannot vary among binding environ- 
ments. In contrast, multivariables, which are analogous to the conditional variables 
of Aurora [35], are stored in the workers, 7 in a binding vector and contain bindings 
resulting from d is j  goals. Multivariable bindings typically contain different values 
in different environments. When stored as values, multivariables are distinguished 
from other sorts of values (engine variables, constants, lists, and structure terms) by 
the tag bits: multivariables have the MULTI tag. Each multivariable corresponds 
to a single element of the binding vector, and when the multivariable gets bound, 
the binding is stored in that element. 
When solutions to a d is j  goal are collected, each solution, represented by a 
binding vector in a worker, is copied. 
7In fact, as an optimization, multivariables are sometimes stored in the Prolog engine as well. 
We explain why below, in Section 4.2. 
208 D.A. SMITH 
The Multi-WAM Architecture 
Prolog Engine 
(executi~ Mulfl-WAM infractions) 
Unification j -  / Requests / 
Environment j "  manipal~.io~ requ~ 
Work~ #1 
Rtw~om~n~ L2 
W0d~#2 
E_~t  .~.j... 
~ t 2 J  
Workor #N 
gnv/~rm~ N-~ 
• l .~ .~.~.~:  
F IGURE 3. Multi-WAM architecture. 
The engine/multidistinction has several benefits. First, it leads to a decreased 
size of worker environments as compared to a design in which all variables were 
multivariables; this allows for faster copying of environments during d i s j  goals, s
Second, the engine/multidistinction speeds up unification since instructions not 
involving multivariables can execute entirely in the Prolog engine. In particular, 
all code that precedes the first d i s j  can execute sequentially. Even after multiple 
environments have been created, many variables are bound "globally" in the Prolog 
engine, independently of these environments (see example in Section 3.2); such 
variables can be allocated as engine variables. Third, work corresponding to engine 
unifications is performed once per set of solutions collected by d±sj goals; in Prolog, 
the corresponding work would be performed once per each solution. In this sense, 
engine variables can be used to "factor out" work that is common to the testing of 
multiple solutions to a generator goal. 
When execution of a MultiLog program begins, there is a single active worker 
environment, and there are no allocated multivariables. As long as no d i s j  goal 
is executed, unifications will involve engine variables only, and the overall behavior 
would be much the same as for standard Prolog. 
When the first d i s j  goal is entered, a subcomputation is begun to collect solu- 
tions to that goal. 
If the d i s j  goal succeeds, the solution is saved: the binding vector representing 
the solution is copied. (As a consequence, any variable that gets bound during 
execution of a d i s j  subcomputation must be a multivariable, so that the bind- 
ing can be saved when the binding vector gets copied. Either the compiler must 
SAn alternative to copying environments is to let idle processors track active processors, as in 
Firebird [51]. 
MULTILOG AND DATA OR-PARALLELISM 209 
pre-allocate such a variable as a multivariable, or the Prolog engine must convert 
any engine variable that gets bound inside a d±sj into a multivariable. We dis- 
cuss this point further in Section 4.8 below.) Then execution backtracks o that 
additional solutions to the d±sj goal can be collected. During backtracking, bind- 
ings are, in general, undone in both the Prolog engine and in the active work- 
ers in the environments; but the saved binding vectors are unaffected by back- 
tracking at this stage. The d±sj goal can succeed an arbitrary number of times. 
For each success, the binding vector is copied. When sufficient (or all) solutions 
have been collected--here, "sufficient" depends on the amount of available memory 
(in MIMD MultiLog) or on the number of processors (in SIMD MultiLog)--the 
solutions need to be activated, so that multi-SLD resolution can continue with 
subsequent goals that follow that first d±sj goal, in the context of the saved so- 
lutions. Solutions are activated by adding the saved binding vectors to the list 
of active worker environments (or by resetting a flag, in the SIMD implementa- 
tion). So, at the start of the first d i s j  goal, there was a single active environ- 
ment; but after solutions have been collected, there are, in general, multiple active 
environments. 
Suppose the system collected 100 solutions to the d±sj goal, and is now working 
on subsequent goals (e.g., a test procedure in a (fused) generate-and-test program). 
If the Prolog engine now tries to unify a multivariable with anything other than an 
unbound engine variable, the Prolog engine sends a unification request o each of 
the workers; each worker in turn performs the unification in each of that worker's 
environments. If the unification succeeds in a given environment--in which case we 
also say that the environment succeeded on that unification--then the bindings (if 
any) are stored in that environment. If the unification fails in an environment--we 
say that the environment failed the unification---then that environment is deacti- 
vated. If the unification fails in the Prolog engine or in all worker environments, 
then backtracking must occur. If control backtracks to the d i s j  goal, additional 
solutions will be collected. If the d±sj goal has no solutions left at all, then it fails, 
and control backtracks to some previous choice point, if any. If unification fails in 
only some of the worker environments and does not fail in the Prolog engine, then 
execution continues forward with the surviving environments. 
Suppose, now, that execution has continued forward past our initial d±sj goal 
and has reached another d±sj goal. Because of failed unifications, there may now 
be only, say, 75 environments active on entry to the second d±sj goal. For each 
success of this second d±sj goal, the surviving environments (<_75 of them) are 
copied and set aside. When enough (or all) solutions have been collected, they are 
all reactivated and execution again continues with subsequent goals. 
It is important, then, to realize that execution of the Multi-WAM consists of two 
interleaved phases: 
1. execution outside of any d is j  goal (normal multiresolution steps). This exe- 
cution is much the same as standard Prolog execution, except hat there are 
multiple substitutions active; 
2. execution inside a d±sj goal (d±sj multiresolution step), during which time 
solutions are collected in a failure-driven loop. 
Multi-SLD incorporates a tradeoff between time (backtracking) and space (multiple 
environments); a program that performs an exponential amount of backtracking 
may, using multi-SLD, use an exponential amount of memory. For example, the 
210 D.A. SMITH 
b i t s  program of Section 3.2 is one such program. But a multi-SLD interpreter can 
always revert to backtracking when free memory is in short supply. 
One more point needs to be explained to complete this overview of the Multi- 
WAM. We said earlier that when an environment fails a unification, the environment 
is deactivated. It might seem that the memory blocks of such failed environments 
can be immediately deallocated and placed back on the free list. This is not al- 
ways so. To support backtracking to untried clauses for a previous goal, it is, in 
general, necessary to undo the deactivation of an environment that fails some uni- 
fication (just as it is necessary to undo the binding of variables on backtracking). 
More precisely, if there exists an extant choice point for an atomic goal that was 
selected subsequently to the d i s j  goal that resulted in the creation of the failed 
environment, hen on backtracking, that failed environment will need to be reacti- 
vated. For example, for the path-finding program in Section 1.1, suppose that d i s j  
edge (StartNode,N) has already been called one or more times in the multideriva- 
tion so that there are multiple environments. Suppose further that StartNode is 
bound to 3 in one environment and to 4 in another environment. Then, if d i s j  
edge (StartNode,  N) is called again, there are multiple clauses that may unify with 
the selected atom, edge(StartNode,N).  Selecting the unit clause edge(3,2)  will 
result in success for the first environment (where StartNode is bound to 3), but 
not for the second environment (where StartNode is bound to 4). But since there 
is also the later clause edge(4,3) ,  the environment with StartNode bound to 4 
needs to be retained in a suspended, trailed state for later reactivation. 
In short, failed environments can be deallocated (and their space reused) only 
when they are not protected by choice points. Otherwise, the environments must be 
temporarily deactivated and associated with the protecting choice point, so that on 
backtracking, they can be reactivated. The data structure holding such temporari ly 
deactivated environments i called the environment trail, and the act of saving 
environments here is called environment trailing. For some programs, environment 
trailing has a large impact on performance. The need to trail environments i
analogous to the need to trail the bindings of variables in the standard WAM (and 
Multi-WAM) when the variables are protected by a choice point. 
This overview points up the following essential features of the Multi-WAM 
execution model: 
• the distinction between engine variables and multivariables, 
• the related distinction between the Prolog engine and the workers, 
• the use of a binding vector to represent bindings to multivariables, 
• the need to allocate multivariables for logic variables that get bound during 
d i s j  goals, 
• the need to copy binding vectors when a d i s j  goal succeeds, 
• the need to activate saved binding vectors when a d i s j  goal has collected 
sufficient solutions, 
• the fact that each worker manages unifications in one or more envi- 
ronments, 
• the need to perform unification in workers when a multivariable gets unified 
with anything other than an engine variable, 
• the need to trail environments when they are protected by choice points, 
MULTILOG AND DATA OR-PARALLELISM 211 
* the fact that multi-SLD involves a tradeoff between time (backtracking) and 
space (multiple environments), and 
• the fact that multi-SLD can resort to backtracking whenever memory use is 
getting too high. 
Many details that have been glossed over in this discussion will be clarified in the 
following sections. 
4.2. Prolog Engine Data Structures 
All implementations of the Multi-WAM--sequential, MIMD, and SIMD--share a 
common core of data structures and routines in the Prolog engine. Figure 4 shows 
these data structures: a combined choice/control stack, a heap, a trail, a code area, 
and a new data area: the binding vector. 
The reader may be wondering why the binding vector, which holds bindings of 
multivariables in the workers, need exist in the Prolog engine as well. Here is why. 
If it ever happens that a inultivariable gets unified with a nonvariable term (that is, 
a constant or a structure) while control is outside all d±sj goals, then, as always, 
environments inconsistent with the unification must be deleted; but furthermore, 
since the binding applies globally to all environments, it is more efficient, for the 
sake of clause indexing [22] and to hasten subsequent unifications, to store the 
binding of that multivariable globally in the Prolog engine. Hence, the need for 
the binding vector in the Prolog engine. 
The binding vector has three sections: a small, fixed-size section at the start 
for multi-X variables, a large central section for multiheap variables, and a stack 
Pro log  Eng ine  
STAC  [C o I 
(Control Area 
TRAIL and Choice Registem 
HEAP H,E,P,CP, B,HB 
(in shared [-~[~. , .[-~ TR, S,B0,mode 
memory) V,VB, disj_flag 
WB, Disj_B 
Binding Vector ] max_H, max_V 
I~c~ 
nt ro l  F~t'ame~ 
revious E ] 
v I 
rst_Multi-Y Index 
Y1...YN - ; 
I hoice Fro, me 
• Count (m) of A regs 
• AI, ... Am 
• E, CP, B, P, TR, H, V 
FIGURE 4. Prolog engine data structures. 
1 
212 D.A. SMITH 
at the end for multi-Y variables. The multiheap section grows towards higher 
indices, while the multi-Y section grows toward lower indices. In current prototype 
implementations, their sizes are fixed by command-line parameters, as is the total 
size of the variable vector. A more robust implementation would need to do stack 
copying or some such more sophisticated memory management technique. 
Multi-X registers are used together with multi-Y variables for arithmetic, for 
parameter passing, and for avoiding allocation of multiheap variables in much the 
same way that X registers and Y variables are used in the WAM. A multi-X register 
or multi-Y variable is used only when the value stored in the corresponding engine 
X register or Y variable needs to vary among environments. (In such a case, the 
X register or Y variable gets bound in the Prolog engine to a value whose tag field 
is MULTI, and whose address field contains the index in the binding vector of the 
corresponding multi-X register or multi-Y variable.) As a consequence, the number 
of multi-X registers in use is at most equal to the number of X (A) registers in use 
since multi-X registers are stored in choice points in each environment, just as X 
registers are stored in choice points in the Prolog engine (and in the WAM). 
In contrast, a multi-Y variable is allocated for each Y variable in the control 
stack, whether or not the multi-Y variable actually gets used. This is different 
from the case with multi-X registers and multiheap variables, which are allocated 
on demand. The space for these unused multi-Y variables is thus wasted. In brief, 
the reason is that a Y variable can get bound to a multivalue at any time during its 
lifetime, and the variables have to be allocated in the order of creation, not in the 
order of access, so they can be easily deallocated uring last call optimization and 
backtracking. 9 An unimplemented, complex, alternative design avoids the wasted 
space by keeping track of allocated multi-Y variables in lists. Alternatively, compiler 
analysis might determine that some Y variables will be bound independently of d±sj 
goals, and so can safely be allocated without a corresponding multi-Y variable. 
The number of multiheap and multi-Y variables in use increases dynamically dur- 
ing execution and decreases during backtracking. The engine register V is the index 
of the next free multiheap variable. For multi-Y variables, the index is obtained 
from a slot (F±rst~ult±-Y_Index) in the control frame at E. VB is analogous to 
HB, and is used to determine whether bindings of multiheap variables have to be 
trailed. The disj-flag is true iff execution is inside a disjunctive goal. WB is the 
(global) choice pointer for the worker choice stacks. Disj_B marks the value of B 
when a d is j  goal is encountered. The variables max_H and max_V keep track of 
the maximum values obtained by H and V during a d±sj goal; during backtracking 
within a d±sj goal, H and V cannot be reset to values below max_H and max_V, 
respectively; otherwise, needed ata might be lost. 
The data structures in the unification workers vary somewhat, depending on 
whether the implementation is for uniprocessor, MIMD, or SIMD computers. In 
the next section, we describe the worker data structures and their use for both 
MIMD and uniprocessor Multi-WAM, the only significant difference between these 
two implementations being the number of processors involved, along with the con- 
sequent need for load balancing. Section 4.6 explains SIMD Multi-WAM. 
9We have noticed that the use of multiple environments instead of backtracking for a non- 
deterministic goal tends to decrease the size of the local stack; this is because fewer frames are 
protected by choice points. 
MULTILOG AND DATA OR-PARALLELISM 213 
Worker  Data  Structures ( nMD and Unip~sor ) 
x Mul f iLog 
Worker Choice Stack 
Worker [ choice r-..~ "~ 
Trail I- Saved_WTn 
J I.uol at Trailed Enw ,Mult l -X Reg ldem 
. . . . . . . .  J. . . . . . . . .  1 . . . . . .  . ¢ j  t - -  . . . . . .  / . . . . . . . .  1 . . . . . . . .  _; 
/ 
.ist of Active Env i ron~s  ~ist of Saved Environments 
  v,ro  .nt 1 mullS register V =eg 
ice_PlLaLorealionjime 
Register 
(Worker  Trail 
~rR Pointer) 
(Worker Choicq 
Pointer) 
F IGURE 5. Worker data structures (uniprocessor and MIMD). 
4.3. MIMD and Uniprocessor Multi- WAM 
Figure 5 shows the data structures in each worker of the MIMD and uniprocessor 
implementations: a trail, a list of active environments, a list of saved environ- 
ments for d i s j  goals, 1° and a choice stack for managing backtrack requests from 
the Prolog engine. Notice that there is one global trail per worker, not one trail 
per environment. Each entry of the choice stack contains a worker trail pointer 
Saved_WTR to be restored on backtracking, multi-X registers pushed by TRY or 
TRY_ME_ELSE, and a pointer to a list of environments created before the choice 
point, but deactivated by failures occurring since that choice point was created, n 
The stored environments will need to be reactivated if control returns through the 
choice point. Only those multi-X registers which correspond to X registers that are 
bound to multi-X registers need be pushed onto the worker choice stacks. There is 
no heap in the workers: all list and structure cells exist on the Prolog engine's hared 
heap. Of Prolog engine data structures, only the heap is stored in shared memory; 
worker environments never refer to the stack since multivariables are bound only 
to constants, to other multivariables, or to pointers into the heap. 
When a unification of a multivariable with something other than an engine vari- 
able occurs, the Prolog engine sends a unification request o each worker, which 
iterates through its subpool of environments, performing the unification in each 
environment. Unless a multivariable was allocated subsequently to the most recent 
choice point, bindings to that multivariable have to be trailed in workers. If the 
1°This design allows only a single d is j  goal to be collecting solutions at one time. (The design 
still allows multiple d is j  goals in a single multiderivation.) To allow nested is j  goals, one needs 
a stack of lists of saved environments. 
11The s tate  variable of SIMD MultiLog (Section 4.6) provides a more elegant way of managing 
environments. 
214 D.A. SMITH 
compiler could determine that all instances of a given multivariable needed to be 
trailed, then the trailing could be done globally for all environments in the Prolog 
engine. We have not implemented this optimization. The use of a strong mode 
system, as in Mercury [48], would help obviate the need for trailing. 
After multiunification, successful environments are extended with bindings. 
Failed environments are deactivated, but the space is reclaimed only when the 
environment is not protected by a choice point. Register cho ice_ptr_at_creat ion  
is used to determine whether an environment that fails a unification is protected 
or not. Environment register multi_S is analogous to the S register of the WAM, 
and is used by GET_ and UNIFY_ instructions to point to subarguments on the 
heap during multiunification. 
The design of uniprocessor MultiLog is identical to the MIMD design, except hat 
a single process(or) acts as both the Prolog engine and the sole unification worker; 
also, load balancing (next section), synchronization, remote memory access, and 
remote procedure call are not needed. 
The simplicity and small size of the data structures needed to manage multiple 
environments enable the use of many more binding environments than physical 
processors: in effect, MultiLog manages multiple "virtual" workers per physical 
processor. In contrast, in standard control or-parallel systems, each worker needs 
to maintain all of the data structures of a WAM engine, and so these systems 
generally have one worker per processor. 
~.~. Load Balancing 
Since each worker in MIMD MultiLog maintains a pool of active environments, it is 
necessary to perform load balancing. When a d i s j  goal returns an answer, a request 
is sent to each worker to copy its active environments. At that time, a decision is 
made whether to copy locally (the preferred choice since local copying is faster and 
private memory and caching can be more easily used) or to another worker. 
Good load balancing is crucial for MIMD MultiLog, just as it is for control 
or~parallel systems like Aurora and Muse. In MIMD MultiLog, the task of load 
balancing is to distribute environments evenly among workers, taking into 
account: 
• the possibility of idle workers if environments are not evenly distributed; 
• the cost of remote  copying; 
• the cost of using shared versus unshared memory .  On  the BBN TC2000,  
shared memory  is not automatically cached, while private memory  is cached. 
To  the extent that environments can be copied locally, the benefits of caching 
will be realized. On  the KSRI ,  with its Al l -Cache (tin) memory  architecture, 
it is still true that local memory  accesses are faster. 
Our  current, naive load balancing procedure works  as follows. Each  t ime a d is j  
goal succeeds, so that binding vectors need to be copied, a global counter c is incre- 
mented,  and  each worker w compares  the number  of environments allocated locally 
to the number  allocated in remote worker c + w (modu lo  the number  of workers). If 
the remote  number  is less than ~th the local number ,  for global parameter  k, then 
the answer  is saved remotely; otherwise, the answer is saved locally. The  parame-  
ter k can be set on the command line. A too low value for k (e.g., k = i) causes 
too much remote copying; a too high value leads to unbalanced work  loads among 
MULTILOG AND DATA OR-PARALLELISM 215 
workers. Performance is sensitive to the value chosen for k. For most examples, we 
used k = 2. 
Because the counter c is repeatedly incremented, each worker "cycles" through 
the other workers while looking for a remote target for possible copying; so no 
worker is likely to be chronically starved or oversupplied with work. Furthermore, 
since each worker has the same value for counter c, it never happens that multi- 
ple workers copy remotely to the same processor; this lessens memory contention. 
Since, on the average, environments should be equally likely to fail unifications in 
one worker as in another worker, remote copying should be performed rarely for 
runs with many disjuncts. 
This simple load balancing method leads to a good balance for most examples we 
have tried. Moreover, on larger examples, it copies locally about 9570 of the time, 
copying elsewhere mostly at the start of execution when there are few environments. 
In contrast, when we use a random load balancing procedure which copies environ- 
ments to randomly chosen workers, performance is significantly poorer. However, 
the method is still inadequate. For certain programs, work is poorly distributed and 
performance suffers. The problem, we think, is that load balancing is performed 
only when answers are saved. Suppose a d is j  early in the program generates many 
environments, but later on, after the disjunctive goal has already finished, the work- 
load becomes unbalanced ue to a disproportionate number of failures in a given 
worker. The current load balancing procedure, described above, would not be able 
to restore balance. It may be necessary to perform load balancing at times other 
than when a d±sj goal succeeds, for example, after certain multiunifications. 
4.5. An Alternative Representation for Substitutions 
In the design described above, each substitution is represented by a contiguous vec- 
tor of bindings. An alternative representation associates with each multivariable 
a vector of b ind ings~ne for each substitution--so that all of the bindings for a 
given multivariable are contiguous. (The difference is analogous to the difference 
between storing a two-dimensional rray in column-order versus row-order.) This 
second representation, which is similar to the version vectors model of [24], should 
result in better locality of reference during unification. For example, suppose that 
during execution of the instruction CET_CONSTANT 1 A3, A3 dereferences to a mul- 
tivariable. Then the unification of this multivariable with the constant 1 must be 
performed in each substitution. This will be faster if all bindings for this vari- 
able are contiguous. However, a disadvantage of the second representation is that 
environment copying will be more difficult since environments will no longer be 
contiguous in memory. We are implementing both designs and investigating their 
relative performance. 
4.6. SIMD Multi-WAM 
Unification in multiple environments i  not a data parallel operation in the narrow 
sense since variables can, in general, be bound to different sizes and shapes of terms 
in different environments. And While it is possible to do data parallel unification 
when terms vary in size and shape, the implementation (e.g., memory management) 
becomes more complex. Furthermore, processors end up spending much of their 
time idling during unification operations for which their terms are of the wrong 
shape. Accordingly, SIMD MultiLog makes the following finite domain restriction: 
216 D.A. SMITH 
Worker Data Structures (SIMD MultiLog ) 
Worker 
Trail 
Worker Choice Stack 
~WrR 
l-X Registers 
Environment v reg 
~,, 
Registers: 
• state (ACTIVE, INACTIVE, SAVED, TRAILED) 
• choice_index_at_creation • trailed_choice_ptr 
• WTR (worker trai l  pointer} 
F IGURE 6. Worker date structures (SMID). 
variables in worker environments may get bound only to other variables n or to con- 
stants, but not to lists or other "structure" terms. This restriction is reasonable, 
given the distinction between engine variables and multivariables, since many pro- 
grams either already obey the restriction, when appropriate variables are made into 
engine variables, or can be converted, using templates (Section 3.2.1), to equivalent 
programs obeying the restriction. Currently, the conversion is done manually, and 
it is a research question as to how to automate this task. 
In SIMD MultiLog, there is one environment per worker and, in the current 
implementation, one worker per SIMD processor) 3 Instead of using lists of active, 
saved, or trailed environments, there is a s ta te  variable that determines the status 
of a worker's environment. Figure 6 shows the data structures in each SIMD worker. 
Figure 7 depicts environment management in the SIMD workers; in that  figure, each 
square represents a processor containing the data structures depicted in Figure 6. 
The s ta te  variable is bound to one of four values; ACT IVE  (for processors partici- 
pating in unifications), INACT IVE  (for unused processors), SAVED (for processors 
whose environments are saved as answers to a disjunctive goal), and TRAILED (for 
environments trailed with a choice point). The t ra i led_cho ice_pt r  points to the 
choice point active at the time the TRAILED environment failed a unification; 
when control backtracks to that choice point, the environment is made ACT IVE  
Z2Making the further estriction that multivariables can be bound only to constants, but not to 
other variables would yield an even simpler (although less expressive) system since dereferencing 
in workers would then be unnecessary; cf. [48]. 
Z3On a machine like the Thinking Machines CM2, the memory of each physical processor can 
be divided into multiple sections for use by multiple "virtual" workers; the data parallel C com- 
piler supports uch virtual workers automatically, and arranges for iteration over the component 
sections. However, on our target computer, the MasPar MP1, there is no support for virtual. 
workers, and we did not attempt to simulate them ourselves. 
MULTILOG AND DATA OR-PARALLELISM 217 
SIMD MultiLog 
Act ive  
Environment 
Environment 
\ In act ive  
Environment 
~Tra i led  
Environment 
One environment ] 
per processor. 
Envlmnments whlel~ 
fall a uni~mtlon 
become Tra i led  
if protected by a 
ehelco p~nt. Other- 
wise they become 
I nact ive .  
During SAVEASANSWER, Act ive  environments are copied to Inact ive  preeeseors, 
where they become Saved.  
During ACWIVATE_SAVED.ANSWEIIS, Saved  environments become Act ive.  
FIGURE 7. Environment management i  SIMD MultiLog. 
During baektraeking 
cortm Tra i led  
en~dconments become 
Act ive  and certain 
Act ive  become 
Inact ive .  
again. Copying is done by enumerating INACTIVE environments and using the 
MPI's global router to copy binding vectors to the appropriate processors; copying 
locally via the XNET (mesh) links seems infeasible since unused processors will 
not, in general, be nearby. 
With the implementation described above, processors which contain TRAILED 
environments can do no useful work until control backtracks to the appropriate 
choice point so that the TRAILED environments can become ACTIVE again. The 
result is that for many programs (those that require a lot of environment trailing), 
the vast majority of processors remain idle and processor utilization suffers. To 
partially overcome this problem, we have implemented an optimization in which 
the memory of each processor is divided into multiple sections; when an environ- 
ment gets trailed, the environment pointer is advanced to the next section (space 
permitting) so that the processor can continue to do useful work. On backtrack- 
ing, the pointer gets reset back to the earlier value (extra slots in worker choice 
stack frames hold the environment pointer and the state). This optimization has 
successfully increased processor utilization. However, it requires a less efficient, in- 
direct indexing method for binding vectors: different processors reference different 
addresses in memory during variable dereferencing; on the MasPar machines, such 
indirect addressing by itself leads to a 15-25% overall slowdown in execution com- 
pared to direct addressing (i.e., when the optimization is turned off). But overall, 
we have observed 10-35% speedups due to this optimization, thanks to the better 
processor utilization. 
In many ways, the code for managing environments in SIMD MultiLog is much 
simpler than in MIMD or uniprocessor MultiLog because there is no need to man- 
age linked lists of environments, no need to dynamically allocate memory for en- 
vironments, no need to perform load balancing, and no need to iterate through 
218 D.A. SMITH 
environments. SIMD MultiLog's design is closer to the abstract definitions of multi- 
SLD resolution. 
4. 7. Multi- WAM Instruction Set 
The Multi-WAM instruction set is a superset of the WAM's, 14 with just a hand- 
ful of new instructions: for allocating multivariables and for saving and reacti- 
vating the answers to d i s j  goals. The instructions PUT_MULTI_VARIABLE, 
UNIFY_MULTI_VARIABLE, etc., are like their standard WAM counterparts, ex- 
cept that newly allocated variables are given the "MULTI" tag in the Prolog engine 
and have space allocated for them in the binding vectors. The Multi-WAM specifies 
the following instructions for handling d i s j  goals: SPLIT, for initializing a disjunc- 
tive goal; SAVE_AS_ANSWER, for saving answers to disjunctive goals when the 
goal succeeds; and ACTIVATE_SAVED_ANSWERS, for installing saved answers 
as active environments. The goal disj delete(L,Item,Remainder) compiles as 
ALLOCATE 
SPLIT L87,3 
CALL delete,O 
SAVE_AS_ANSWER 
L87 : ACTIVATE_SAVED_ANSWERS 
DEALLOCATE 
SPLIT's operands are a label (L87) for backtracking in case the goal (de lete)  fails, 
and an integer (3) indicating the number of A registers to save. SPLIT pushes a 
choice point and initializes in each worker a data structure needed to hold saved 
answers. If control reaches SAVE_AS_ANSWER, the active environments are copied 
into this data structure and control backtracks, earching for more solutions. But H 
(the heap pointer) is not reset: bindings in workers can refer to structures created 
during the goal. (But in SIMD MultiLog, H can be reset, thanks to the finite domain 
restriction.) When the call to the goal (de lete /3)  fails, control returns to the label 
(L87), where ACTIVATE_SAVED_ANSWERS pops the choice point pushed by 
SPLIT and activates the saved answers. The state of the Prolog engine is returned 
to the state that existed before the disjunctive goal, except hat H and V retain the 
maximum values attained. In particular, all bindings in the engine binding vector 
are undone back to the state at the start of the d is j  goal. The environments active 
in the workers represent all solutions to the disjunctive goal (de lete /3) .  
As an optimization, if the choice point pushed by SPLIT (in disj_B) is at the 
top of the stack when control reaches SAVE_AS_ANSWER (meaning that this is 
the last possible answer to de lete) ,  the active environments are left active and 
control jumps into the ACTIVATE_SAVED_ANSWERS instruction. This avoids 
redundantly copying and then reactivating these environments. 
The decision whether to use multiple environments or backtracking for a goal 
could be made by SPLIT when the disjunctive goal is first entered. But this may 
result in too many environments: if there are n active environments at the start 
of the goal and the goal succeeds m times, there may be up to m * n active envi- 
ronments after the goal. A better option is the following, called dynamic reversion 
141ndeed, the code generated by our MultiLog compiler runs under both our MultiLog emulator 
and our WAM emulator. 
MULTILOG AND DATA OR-PARALLELISM 219 
to backtracking. If the amount of memory in use exceeds a predetermined rever- 
sion threshold (or if there are too few available INACTIVE processors in SIMD 
MultiLog), then SAVE_AS_ANSWER interrupts itself, activates the current saved 
answers, and invokes the success continuation, leaving SPLIT's choice point and 
the d i s j  goal's choice points on the stack. If control backtracks to the d i s j  goal, 
then more solutions are collected and saved, and so on. In this way, backtracking 
and multiple environments can coexist for a single goal. 
The use of dynamic reversion to backtracking enerally results in increased 
speedups of from 10 to 15% for SIMD MultiLog. However, the benchmark re- 
sults reported later in this paper involve executions without dynamic reversion to 
backtracking. 
Many (Multi-)WAM instructions--e.g, the SET instructions, the PUT instruc- 
tions, CALL, PROCEED, ALLOCATE, DEALLOCATE, GET_LIST when it en- 
ters read mode execute ntirely on the Prolog engine, as do unification and arith- 
metic instructions not involving multivariables. But most of the unification and 
arithmetic instructions are modified to take account of multiple binding environ- 
ments and to invoke multiunification in the ~workers in case a multivariable gets 
bound to something other than an engine variable. Optionally, during clause in- 
dexing, common bindings can be "lifted" into the Prolog engine; this means that 
if the indexed multivariable is bound to the same term in each environment, hen 
the corresponding multivariable in the Prolog engine binding array gets bound to 
the value. 
4.8. Compiling the Engine/Multivariable Distinction 
Compilation of MultiLog is very similar to compilation of Prolog. The only sig- 
nificant differences are the need to place SPLIT, SAVE_AS_ANSWER, and ACTI- 
VATE_SAVED_ANSWERS instructions, and the need to allocate multivariables. 
To maximize efficiency, it is better to decide at compile time whether variables 
should be allocated as multi- or engine variables. One could, conservatively, let all 
variables be multivariables (except in SIMD MulitLog, due to the finite domain re- 
striction), but this would be inefficient since potentially shared bindings and work 
would be repeated. If, on the other hand, all varables are allocated as engine 
variables, then the run-time system must dynamically convert o multivariables all 
engine variables bound during a d i s j  goal. 
All variables created inside a d i s j  goal should be multi, as they will likely be 
bound differently in different environments; the compiler can see to it that these 
variables are multi if it generates code for d i s j  procedures using PUT_MULTI_ 
VARIABLE_X, etc. In addition, an engine variable created outside a d i s j  goal 
but passed as an argument to a d i s j  goal can also get bound in that d i s j  goal; in 
such a case, the variable will need to be converted to a multivariable. The compiler 
might use abstract interpretation and similar analysis techniques to allocate as 
many variables as possible as engine variables. As it is undecidable in general 
which variables will need to be multi, the compiler will either have to make a 
conservative guess, or rely on run-time conversion to multivariables. Alternatively, 
the user might declare which variables hould be multivariables (the system could 
reserve variable names starting with a certain prefix for multivariables). 
Currently, programs are compiled by machine using a MultiLog-to-WAM 
compiler written by Tim Hickey. Then the code is edited by hand, and certain 
220 D.A. SMITH 
PUT_VARIABLE instructions are replaced with PUT_MULTI_VARIABLE, etc. 
One program (the Waltz line labeling benchmark) uses run-time conversion of en- 
gine variables to multivariables. Some insight (akin to the insight needed for : -  
para l le l  annotations in control or-parallel systems) is needed to properly place 
the d±sj operators in source programs. 
4.9. Cut, Input/Output, and Other Impure Features 
Multi-SLD resolution can be extended to handle nonlogical primitives, such as ! 
(cut), wr i te,  read, asser t ,  var, arg, and bagof. We consider the operational 
semantics and implementation of cuts and input/output in this section. We refer 
the reader to [47] for information about other primitives. 
4.9.1. Cut. For control or-parallel logic programming systems, the operational 
semantics of the cut operator ( ! ) is problematical. The basic reason is that the cut 
is a control operator, and in these systems, there are multiple loci of control. When 
a cut is executed, there are several reasonable alternatives for what should be done. 
In one alternative---called cavalier cut--the first clause to encounter a cut should be 
committed to, and execution of all other clauses associated with the same or more 
recent choice points should be aborted. But depending on which clause executes 
faster, different runs of the same program can result in different choices being 
removed and different answers being returned. In another alternative---called strict 
cut--a cut should be effective only when it is leftmost in the search tree; otherwise, 
execution of the cut should be delayed. Under this semantics, the program will 
return the same answers as a sequential Prolog interpreter, at the cost of possible 
loss of parallelism. 
Since there is only one control in MultiLog, the operational semantics of cut is 
unproblematical in this respect. Whenever a cut is selected by a MultiLog inter- 
preter, all choice points created since the call that invoked the current clause should 
be removed from the choice stack. But there is still a problem with the semantics of 
cut- -a problem analogous to the problem for control or-parallel systems. Consider 
the following program and query: 
p (a) .  p(b) .  
I ? -  d i s j  p (X) , ! .  
Depending on how many solutions to p/1 are obtained for the d i s j  goal, different 
answers will be returned. If only the first solution to p(X) is obtained, then, when 
the cut is reached, the choice point for p/1 will be removed so that the only answer 
returned will be X=a. But if both solutions to p/1 are obtained, then the cut will 
have no effect and the answer eturned will be X=a or X=b. 
To get the same behavior as Prolog, the MultiLog system could enforce the 
following rather extreme rule ("strict cut"): 
When a d i s j  goal is in the scope of a cut, then use backtracking 
(rather than multiple environments) to solve the argument goal. 
The analog to the cavalier cut--and the method used in our current implementa- 
t ions- i s  simply to commit to whatever solutions to the d is j  goal have been ob- 
tained at the time the cut is reached. Alternatively, the programmer might be 
allowed to parameterize d i s j  and ! to specify the precise handling of multiple 
environments; this, however, may be burdensome. 
MULTILOG AND DATA OR-PARALLELISM 221 
There is a subtlety in the implementation of cuts in MultiLog, which we again 
illustrate with an example: 
p (a ) .  p (b ) .  q (a ) : -  ! , fa i l .  q (b) .  
I ?- p(X),q(X). 
yes X=b More? ; 
no 
In Prolog, the query succeeds, despite the cut, since the scope of the cut extends 
only so far as the call to q/l. But if we try the query 
i ?- disj p(X),q(X). 
then there is trouble. Assuming that both solutions to p(X) are collected, when the 
first clause for q/l is entered, there are active two environments, with X bound to 
a and b, respectively. Only the first environment survives the head unification; the 
second environment is trailed in the environment trail associated with the choice 
point. When the cut ( ! ) is reached, execution should commit to the first clause. But 
it is not correct to simply delete the choice point. For then, what should happen to 
the trailed environment? Suppose the trailed environment is deactivated and added 
to the free list of environments. Then when the fail is reached, execution will 
backtrack to the top level and report "no": the solution with X bound to b is lost. 
The essential point is that the cut should apply only to the active environments 
(with X=a) and not to the trailed environments (with X=b). When there are environ- 
ments trailed in a cut choice point, the choice point and trailed environments should 
not be deleted; rather, the choice point should be marked "CUT" so that any en- 
vironments that subsequently fail some unification will not be trailed in the choice 
point. If execution subsequently backtracks to the choice point, then the trailed en- 
vironments will be reactivated. This will have the effect of "cutting off" the choices 
only for those environments hat were active at the time of the cut, as desired. 
In conclusion, cuts can be used in MultiLog, but the interaction with multiple 
environments leads to semantic and implementation complexities analogous to the 
complexities with cut in control or-parallel systems. 
4.9.2. Input and Output. Consider the goal d i s j  p (X) ,wr i te (X) .  Although X 
may be bound after execution of d i s j  p(X), the bindings will be stored in multi- 
variables rather than the engine variable for X, 15 and so wr i te(X)  will just print 
out a variable such as _34. 
How then can one print out the answers to a query? That is, how can one get 
the data out of the multiple environments back into the Prolog engine so that they 
can be printed out or otherwise manipulated? 
We propose an instruction sequent ia l i ze /0  that creates a choice point and 
loads the environments one by one into the Prolog engine under control of a failure 
driven loop. In this way, d i s j  p (X) , sequent ia l i ze ,wr i te (X)  writes out the 
answers returned by p(X) one by one. So d i s j  G, sequent ia l i ze  is roughly 
equivalent to G. 
Note that sequent ia l i ze /0  is effective for the whole success continuation in the 
sense that once sequent ia l i ze /0  is selected in a multiderivation step, the entire 
15This is assuming the standard implementation of disj. But if the "lifting" optimization 
mentioned in Section 4.7 is performed, then X may be bound. 
222 D.A.  SMITH 
resolvent begins execution in the context of a single environment. It may be desir- 
able, in contrast, to have an operator that allows sequentialization f environments 
only for the duration of a single goal. We suggest he operator sequent ia l i ze /1  
which, like sequent ia l i ze /0 ,  loads the active environments one by one into the 
Prolog engine under control of a failure-driven loop; but unlike sequent ia l i ze /0 ,  
sequent ia l i ze /1  is effective only for the duration of its argument goal. All envi- 
ronments active at the start of sequent ia l i ze /1  remain active after execution of 
sequentialize/1. We can define sequentialize/l as follows. 
sequent ia l i ze  (C) : - sequent ia l i ze ,  ca l l  (G), fa i l .  
sequent ia l i ze  (_). 
As for the input predicate read/ l ,  it should simply multiunify its argument with 
the data read in from the input stream. For formatted output using format, we 
suggest he specification of formatting characters for accessing and printing out (in 
rows and columns) bindings in worker environments. 
3.10. Representing Disjunction: Alternatives to DNF 
It is instructive to consider different representations for disjunctions. Standard 
Prolog maintains only a single substitution at a time, and implicitly represents 
solutions in disjunctive normal form: by a temporal sequence (i.e., disjunction) of 
substitutions. For the program 
r(1). 
p(a). p(b). 
I ?-  r (E ) ,p (X) ,p (Y ) ,p (Z) .  
Prolog returns the eight solutions 
E=I,X=a,Y=a,Z=a ; 
E=I,X=a,Y=a,Z=b ; 
E=I,X=b,Y=b,Z=b ; 
no 
In general, backtracking returns in succession approximately c '~ substitutions, where 
c is the average number of (consistent) alternatives per variable and n is the number 
of variables; each substitution is of size O(n). Without he engine/multidistinction, 
multi-SLD would represent i s multiple environments in DNF. We can write this as 
E=I,X=a,Y=a,Z---a or E=I,X=a,Y=a,Z=b or . . . .  or E=I,X=b,Y=b,Z=b 
where the c n substitutions are stored concurrently. For both methods, time*space 
is 0 (n.  c n). With the engine/multidistinction, multi-SLD's representation would be 
E=I,(X=a,Y=a,Z=a or X=a,Y=a,Z=b or . . . .  or X=b,Y--b,Z=b) 
where the common binding E -- 1 is represented but once; in general, space is still 
exponential: O(m * ca), with m _< n. 
With an alternative scheme, called environment trees [44, 47], the representa- 
tion is 
E=I, (X=a,(Y=a,(Z=a or Z-b) or Y=b,(Z--a or Z--b)) or 
X=b,(Y=a,(Z=a or Z--b) or Y--b,(Z=a or Z=b))) 
This formula is best visualized (and is stored internally) as a tree (Figure 8). Each 
leaf represents the substitution consisting of the bindings along the path to the 
MULTILOG AND DATA OR-PARALLEL ISM 223 
b b b 
F IGURE 8. Environment tree. 
root. During unification, the tree is traversed top-down. Failures (or trivial successes 
like X = a on the left subtree) at a node apply to all subtrees below; new bindings 
are stored at leaves. As the tree is built, disjuncts are copied at the leaves (so Y = 
aVY = b appears twice in the tree and Z = a V Z = b appears four times). The space 
n i cn+l ) ,  required is still exponential (~-~=0 c ~. but we have saved a logarithmic fac- 
tor. On the other hand, unification is now a more complex operation. We have imple- 
mented this alternative method, which resembles the Argonne model [56], in Scheme 
prototypes of MultiLog. Analysis [44] indicates that its execution time should, the- 
oretically, be competitive with MultiLog's copying method on both uniprocessor 
and multiprocessor computers; constants, however, are likely to be higher. 
Yet another epresentation is called hierarchical env i ronment  rees. The idea is 
to simplify the DNF formula by "lifting" common dis junct ions of (conjunctions of) 
bindings as far toward the root of the tree as is possible. For our example above, 
the representation is simply 
E = 1 A (X  = aVX = b) A (Y  = aVY  = b) A(Z  = aVZ = b). 
(In general, the representation would not be so compressed.) Whereas the en- 
gine/multidistinction allows the lifting of a conjunction of bindings to the root of 
the tree, hierarchical environment trees allow the lifting of arbitrary dis junct ions 
(e.g., Z = a V Z = b) of conjunctions of bindings towards the root- -as  long as the 
disjunction holds for every leaf of the subtree and each disjunct appears in some leaf. 
In general, each node of a hierarchical environment tree is a conjunction E A D A S, 
where E is a common conjunction of bindings, D is a common disjunction of con- 
junction of bindings, and S is a disjunctive formula representing the subtrees below 
that node. Space is at best O(c • n) (when all disjunctions are independent16); at
worst, space usage is the same as for environment trees. An open question, and a 
topic of current research, is whether unification with such hierarchical environment 
trees can be made efficient for certain applications. Already, constraint languages 
represent often complex formulas involving disjunction and negation. Here, we are 
suggesting that even the equality constraints resulting from unifications hould be 
maintained in something other than DNF. Need logic programming stick to DNF? 
16The need to efficiently represent disjunctions in natural language processing systems (where 
lexical and structural ambiguity lead to numerous alternative r adings) has motivated the tech- 
nique distributed isjunction [17]. The idea is to avoid distributing (to DNF) such "independent" 
disjunctions. 
224 D.A. SMITH 
5. BENCHMARKS 
This section presents benchmark results and analyses for MultiLog on a uniproces- 
sor, on an MIMD machine (KSR1), and on an SIMD machine (MasPar MP1). We 
summarize and discuss the significance of these results in Sections 5.4 and 6. 
5.1. Uniprocessor MultiLog 
Figure 9 shows benchmark results for MultiLog on a uniprocessor. The column la- 
beled "SICS" shows run times in seconds for SICStus Prolog version 2.1 #6 using 
compiled (C-emulated) code; the column labeled "WAM" shows the time in sec- 
onds using our sequential WAM. Both times are for finding (but not collecting) 
all solutions via backtracking. SICS is, on average, about twice the speed of our 
WAM emulator. The column labeled "Multi-WAM" shows the time on uniprocessor 
MultiLog using multiple environments. The WAM and the Multi-WAM are quite 
similar; in fact, the Multi-WAM can execute WAM code, as it was constructed 
by extending the WAM emulator with parallel unification and with provision for 
collection of multiple environments. The column labeled "Speedup" shows time 
for the "WAM" divided by time for the "Multi-WAM." All examples were run on 
an HP P /A  750 workstation running HP-UX A.08.07 at 66 MHz with 32 Mbytes 
of memory. Most of the code was compiled by a program; but we compiled the 
engine/multidistinction by hand by deciding which variables to allocate as multi- 
variables. Many of the programs are standard benchmarks from [50]. Waltz is a line 
labeling problem. Costas is a combinatorial search problem [14, 19] suggested by 
Andr~ Vellino [54]. 
Two programs that stand out in Figure 9 are b i t s -pa l -n - -because  it got such 
good speedups on a uniprocessor--and sat - -because its speedup was actually a 
slowdown. 
b i t s -pa l -n  finds lists of binary numbers which are palindromes, utilizing the 
predicate b i t s  from Section 3.2: 
bits_pal_n(N,L):- length(L,N),bits(L),palindrome(L). 
palindrome(L):- nrev(L,L). 
nrev( [ ] , [ ] ) ,  nrev([H [ T ] ,R ) : -  nrev(T ,TR) ,append(TR, [H] ,R) .  
sat 10.6 22.9 
11 bratko 10.6 15.4 
knight 3.3 3.9 
11 queens 26.5 24.1 
tri 7.9 22.3 
path 11.5 14.6 
waltz 1.4 3.1 
WIM 43.6 48.5 
bits 6.9 11.8 
cube 5.0 9.9 
8 cost~-fu~ed 3.9 7.4 
8 costas 9.8 '16.0 
20 bits-pal 64.9 112.4 
20 bits-pa!-n 576.1 939.7 
31.8 I 0.7 
15.7 0.9 
3.6 1.1 
17.1 1.4 
15.9 1.4 
9.9 1.5 
1.9 1.6 
26.5 1.8 
4.3 2.7 
3.2 3.1 
I.I 6.7 
2.3 7.O 
9.1 12.4 
10.6 88.7 
F IGURE 9. Performance of uni- 
processor MultiLog versus Prolog. 
MULTILOG AND DATA OR-PARALLELISM 225 
One can describe bits_palm as a naive "generate and large test" program. In Mul- 
tiLog, the elements of the list L are allocated as multivariables, and the unifications 
involved in the call to palindrome are performed using data or-parallelism due to 
the presence of the d i s j  operator in b i t s .  
Because of MultiLog's data or-parallelism, work performed in the Prolog engine is 
shared by multiple solutions to the generator. Now, for b i t s -pa l -n ,  there is a large 
amount of Prolog engine work due to the append operations in the naive reverse 
routine. This work consists almost entirely of list construction and unification in 
the Prolog engine. For each multivariable in L, nrev(L,  L) causes a multiunification 
that accesses the workers only once. But O(n 2) calls to append/3 are needed to 
reverse a list of length n. Each call causes a unification in the Prolog engine 
and/or the creation of a cons cell on the heap. Since in Prolog this engine work 
is performed once per each solution to the d i s j  goal, MultiLog got a large data 
or-parallel speedup. (See also the discussion of b i t s -pa l -n  in Section 5.3.) 
sat  is a satisfiability checker for propositional formulas. Part of the reason for 
sacs  poor speedup is that, unlike b i t s -pa l -n ,  it requires frequent railing of 
environments. 17 During satisfiability checking, each logical connective invokes a 
search through a truth table; various environments succeed for various clauses in the 
table so that much environment trailing occurs. Too, the frequent calls to the truth 
table resulted in frequent multiunification i the workers, resulting in a low propor- 
tion of engine work. As might be expected, environment copying times affected per- 
formance, and another eason for sat 's poor performance was its frequent copying 
of large environments during late-occurring d is j  goals. In contrast, for b i t s -pa l -n ,  
the copying occurred early in execution when the environments were small. 
It is important o eliminate unnecessary nondeterminism in MultiLog because 
of the need to push and pop choice points and because of the need to trail and 
untrail both bindings and environments. Techniques for minimizing memory use 
in workers are especially important due to the high cost of copying. For several 
benchmarks, performance improved drastically when we added cuts and rewrote 
code to eliminate unnecessary nondeterminism. (These comments apply to MIMD 
and SIMD MultiLog as well.) 
5.2. MIMD MultiLog 
Figure 10 shows speedups relative to the standard sequential WAM on the KSR1. 
(The numbers for p = 1 show speedup on one processor using the parallel imple- 
mentation of MultiLog relative to a sequential WAM implementation  the same 
computer. In contrast, the numbers in Figure 9 were for a sequential implementa- 
tion of MultiLog versus Prolog. Figure 11 shows speedups relative to one processor 
on the KSR1, using the Multi-WAM parallel implementation for all measurements.) 
Program efficiencies ranged from about 25 to 50%. Even for b i t s -pa l -n ,  most of 
the speedup is present even for p = 1, reflecting the fact that uniprocessor MultiLog 
beats Prolog for this problem. 
Compared to the speedups for the mature systems Aurora and Muse, which were 
nearly linear (with slope near 1) for a wider range of benchmarks, these results are 
17The path-searching program of Section 1.1 is another example of a program which requires 
a lot of trailing of environments since each clause of edge/2 is likely to succeed for only a few of 
the active environments. 
226 D.A. SMITH 
11 8 11 
1 58.2 0.5 2.8 1.5 1.2 0.7 1.4 
2 100.4 0.8 3.7 1.8 1.5 1.2 1.9 
4 106.8 1.0 6.1 2.3 2.3 1.6 2.3 
6 187.4 1.6 7.3 2.0 3.5 1.9 2.8 
8 194.5 1.7 8.2 3.0 3.8 2.4 3.1 
10 329.8 2.2 10.0 3.5 4.9 2.5 3.0 
12 339.5 2.7 13.4 3.2 6.5 2.8 3.3 
F IGURE 10. Speedups relative to WAM on KSR1. 
uneven. Yet the results are acceptable, and we believe that there is wide scope for 
improvements and tuning in things like load balancing, memory management, use 
of garbage collection, clever compilation, and optimizations (e.g., binding lists [57]) 
aimed at reducing copying overhead. Relatively little effort--a few months' work- -  
went into implementing and tuning the MIMD implementation. 
5.3. SIMD MultiLog 
Figure 12 shows performance relative to the sequential WAM for MultiLog on the 
MasPar MP1 with 8192 processors, using a DEC 5000 (MIPS) workstation as the 
front end (and for the WAM benchmarks) running Ultrix V4.2A. 
For all examples but one, SIMD MultiLog ran faster than Prolog using our WAM 
emulator. The reader should remember, however, that even uniprocessor MultiLog 
is faster than Prolog for most problems. Figure 13 compares the uniprocessor 
speedups to the SIMD speedups. The third column, the ratio of the second column 
to the first column, shows the speedup due to the use of parallel hardware: the 
amount of extra speedup obtained by use of the SIMD processors. Except for the 
last three problems, the extra speedup is small (under 3). And for the last three 
problems, the extra speedup was under 30. 
The spectacular speedups of b i t s -pa l  and b i t s -pa l -n  are explained as follows. 
The call b i t s_pa l (24 ,L )  invokes a naive generate and test procedure (shown in 
Section 5.1) that enumerates all 224 bit strings of length 24 and tests each string 
for "palidromicity." Because d i s j  is used before the call b i t  (H) and because there 
are 8192 (213) processors on the MP1, the first 13 calls to b i t  utilize multiple 
environments instead of backtracking. Thus, the first 13 elements of L get bound to 
multivariables in the Prolog engine and to all 213 binary lists of length 13 in the 213 
environments. The rest of the elements of L are bound by backtracking. As a result, 
2O 
~-~-~ bi~-pal-n ~ 
1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 
2 1.7 1.6 1.3 1.2 1.2 1.7 1.4 
4 1.8 2.0 2.2 1.5 1.9 2.3 1.6 
6 3.2 3.2 2.6 1.3 2.9 2.7 2.0 
8 3.3 3.4 2.9 2.0 3.2 3.4 2.2 
10 5.7 4.4 3.6 2.3 4.1 3.6 2.1 
12 5.8 5.4 4.8 2.1 5.4 4.0 2.4 
F IGURE 11. Speedups relative to one parallel processor on KSR1. 
MULTILOG AND DATA OR-PARALLEL ISM 227 
path 58567 56.0 91.5 0.6 
sat 1790 59.2 40.5 1.5 
tri 283 62.6 32.2 1.9 
WIM 444 142.5 72.6 2.0 
12 queens 14200 425.1 213.7 2.0 
knight 3280 12.0 5.8 2.1 
waltz 8640 8.3 3.4 2.4 
11 bratko 2680 47.3 18.7 2.5 
cube 64 27.3 8.9 3.1 
11 queens 2680 74.6 21.1 3.5 
8 costas 444 46.0 4.8 9.6 
10 costas 2160 5450.1 226.8 24.0 
20 bits 2 zU 31.1 0.4 77.8 
20 bits-pal 1024 383.7 1.3 295.2 
24 bits-pal-n 4096 66668.1 32.2 2064.0 
F IGURE 12. Performance of
SIMD MultiLog on the MP1 
relative to WAM. 
each call to pa l indrome/ l  acts on 8192 different environments instead of on one 
environment,  and the test for "pal indromicity" is carried out in data  parallel. If  we 
increase the number of instructions executed by each call to pa l indrome/ I ,  then 
we can increase the speedup. When naive reverse is used instead of l inear reverse, 
then speedup increases from about  300 (b i t s -pa l )  to over 2000 (b i t s -pa l -n ) .  
Of the three implementat ions benchmarked,  the SIMD results are the least satis- 
factory, most ly because of high synchronizat ion overheads and because of difficulties 
with processor and memory uti l ization. For the path  program, average processor 
ut i l izat ion was a miserable 55 on 8192 processors, and over a quarter  of the t ime 
was spent copying environments. (In contrast,  b i t s -pa l -n ' s  ut i l izat ion was over 
7600, and an insignificant amount of t ime was spent copying.) For 11 queens,  the 
metr ic was 543.7; for kn ight ,  it was 512.2; for b ra tko ,  it was 558.3. 
The main explanations,  we think, for the low uti l izat ion are the following. 
1. The short average ffective life span of most environments. That  is, most en- 
v i ronments are deact ivated relatively soon after creation (or after react ivat ion 
during untrai l ing). The average number of Mul t i -WAM instruct ions from the 
Name Uniprocessor 
MultiLog 
Speedup 
path 1.5 
tri 1.4 
sat 0.7 
WIM 
knight 
waltz 
11 bratko 
1.8 
I.I 
1.6 
0.9 
cube 3.1 
11 queens 1.4 
8 costas 7.0 
20 bits 2.7 
20 bits-pal 12.4 
20 bits-pal-n 88.7 
SIMD SIMD/Uniprocessor 
MultiLog 
Speedup 
0.6 0.4 
1.9 1.4 
1.5 2.1 
2.0 1.1 
2.1 1.9 
2.4 1.5 
2.5 2.8 
3.1 1.0 
3.5 2.5 
9.6 1.4 
77.8 28.8 
295.2 23.8 
2064.0 23.3 
F IGURE 13. Speedups of SIMD MultiLog relative to uniprocessor MultiLog. 
228 D.A. SMITH 
Name SIMD Average 
MultiLog Environment 
Speedup Lifctiine 
path 0.fi 13.8 
sat 1.5 14.0 
knight 2.1 29.8 
waltz 2.4 12.8 
11 bratko 2.5 37.9 
cube 3.1 69.2 
11 queens 3.5 75.7 
8"costas 9.6 264.7 
20 bits 77.8 9561.9 
20 biLs-pal-n 2064.0 1197 
F IGURE 14. SIMD MultiLog speedup versus 
environment lifetime. 
2. 
. 
time of environment activation to the time of removal or trailing is shown 
in Figure 14. As can be seen, the average lifetime closely correlates with the 
speedup. 
The presence, for many programs, of large numbers of trailed binding envi- 
ronments, which occupy processors and make them unavailable for doing real 
work. For each of queens, bratko, and knight,  on average, there were about 
four or five times as many trailed environments as active environments at 
each step of execution. In contrast, for b i t s -pa l -n ,  there were, on average, 
about ten times as many active environments as trailed environments. 
The fact that these benchmarks used naive reversion to backtracking, in which 
the decision whether to use backtracking or multiple solutions to a goal is 
made when the goal is first entered (Section 4.7). As a result, the reversion 
threshold (Section 4.7) must be set low; otherwise, the system would attempt 
to allocate more than 8192 processors. This phenomenon leads to poor pro- 
cessor utilization---except for some of the last examples, where it is near the 
maximum possible (8192). is 
(After the above benchmark results were obtained, we implemented, on 
a 4096 processor MasPar machine, dynamic reversion to backtracking and 
the related technique (mentioned at the end of Section 4.6) of dividing the 
memory of each processor into multiple sections o that processors containing 
TRAILED environments can be used to hold ACTIVE environments saved 
for subsequent d is j  goals. These changes did increase the average number 
of active environments considerably. For example, 11 Queens' average went 
from 543.7 (on an MP1 with 8192 processors) to 904.4 (on an MP2 with 4096 
processors); Knight's average went from 512.2 to 518.3; Cube's average went 
from 274.8 to 516.6. Although performance with these optimizations i signif- 
icantly (on average, about 20~0) better than performance without them, due 
to better processor utilization, the fewer total number of processors resulted 
in speedups that are little better than those presented here. Since the analy- 
ses in Section 6 utilize the older results, we have decided to stick with them 
rather than rework all the results for a new machine.) 
18The benchmarks for uniprocessor and MIMD MultiLog also utilized naive reversion to back- 
tracking. However, the availability of hardware paging meant that it was less critical that the num- 
ber of active nvironments bekept low. Also, the small number of available processors meant hat 
it was efficient to have relatively few active nvironments and to revert o backtracking quite early. 
MULTILOG AND DATA OR-PARALLELISM 229 
4. The high overheads of SIMD processing due to synchronization costs for send- 
ing instructions to the processors and accumulating the results. According to 
our (possibly unreliable) measurements from profiling tools, SIMD MultiLog 
spent, on average, over 75% of its time in overhead routines. 
5. The slowness of the SIMD processors: about 1/50th the speed of the front 
end workstation, according to our benchmarks. 
We have voluminous data that help explain the results. Rather than boring the 
reader and extending an already long paper, we summarize our findings in the next 
section and in Section 6. 
5. 4 . Applicability and Limitations of Data Or-Parallelism 
We believe that the benchmark results, along with the analyses of the next section, 
demonstrate that MultiLog's computational model should be a viable choice for 
many combinatorial search problems. We certainly think that it is important hat 
the logic programming community be made aware of this surprising and relatively 
unknown alternative to backtracking and control or-parallel search. 
Yet at this early stage, it is still unclear just how practical the model will be, as 
there are still weaknesses and unresolved questions. The important meta-questions 
are: Which weaknesses are inherent in the model?, Which are artifacts of poor 
design choices?, And which are artifacts of poor implementation? 
Inherently, data or-parallel search is most appropriate for a restricted class of 
search programs: regular combinatorial search problems in which all solutions to 
the generators have the same shape (as defined in Section 3.2.1). Such programs 
can take best advantage of the engine/multidistinction and the consequent time and 
space savings that are described in Section 6. Also, such programs lead to the least 
amount of environment trailing. In our experience, many massive combinatorial 
search problems are already in a form amenable to data or-parallel search, or can 
easily be reformulated so as to be amenable using templates (Section 3.2.1) or 
more substantial rewriting. Still, to get the benchmark results above, we had to 
do quite a bit of tweaking, 19 and it is unclear at this point to what extent these 
transformations can be automated. 
Even if many programs cannot readily be converted to a form appropriate for 
data or-parallelism, the technique may still be useful as an important, special- 
case tool. We think a crucial question is how well data or-parallelism will work 
with constraint processing (e.g., forward checking) since many combinatorial search 
programs are best solved using constraints in combination with search. (See [51] 
for work related to this question.) 
MultiLog offers the hope of exploiting massive data parallelism (SIMD comput- 
ers), something that no control or-parallel system can be expected to do. Since 
many thousands of processors are already available on commercial SIMD comput- 
ers, the potential speedup (even if efficiency is low) can be expected to be higher 
than for MIMD computers. It is critical to the success of SIMD MultiLog that 
the average number of ACTIVE environments be kept high; for this to happen, it 
is necessary that many environments survive many unifications, and that not too 
much environment trailing occurs. Unfortunately, it seems that in many generate 
19perhaps most published speedups are the result of a comparable amount  of tweaking. 
230 D.A. SMITH 
and test programs, few environments survive the "test" parts of the program, en- 
vironments' lifetimes are short, and multiple clauses result in much environment 
trailing. 
As for the design and implementation f MultiLog, we feel there is considerable 
room for improvement: in memory management, methods for (avoiding) copying 
environments, use of garbage collection, processor utilization, representations for 
environments, and compilation. (For just one example, several programs pent over 
25% of execution time in the multivariable dereferencing routine. Clever compila- 
tion might eliminate the need for general dereferencing.) Less than one man-year of 
work went into the design and implementation f MultiLog. The conclusion collects 
together some specific areas for possible improvement. 
In summary, with the current echnology for SIMD computers, it appears that 
data or-parallelism on SIMD computers has not been shown to be cost effective. 
Speedups have been demonstrated, but the only cost-effective speedups (e.g., >25) 
occurred for contrived examples. We think the low speedups are partly due to 
the slowness of the SIMD processors relative to the front-end RISC workstation, 
partly due to synchronization overheads, and partly due to the above-mentioned 
problems with processor utilization. Only the latter problem may be inherent in 
the execution model of data or-parallelism. 
However, the good and rather surprising news is that MultiLog seems to be 
competitive with or faster than Prolog on a uniprocessor. Again, control or-parallel 
Prologs have no hope of achieving this. Finally, the speedups for MIMD MultiLog 
are respectable; with proper load balancing and tuning, we see no reason why 
significantly better speedups cannot be achieved because MIMD processing does 
not have the same performance and severe synchronization problems that the SIMD 
processing entails. 
In the next section, we analyze why multi-SLD resolution and data or-parallelism 
often beat standard SLD resolution, even on uniprocessors. 
6. PERFORMANCE MODELS FOR MULTILOG 
In this section, we give linear performance models that explain why Multi-SLD 
resolution is faster than SLD resolution for many combinatorial search problems, 
even on a uniprocessor. The models explain both the time and space usage of 
]VI(lltiLog as compared to Prolog, give insight into the nature of data or-parallelism, 
and provide a basis for predicting the effects of various optimizations. We present 
data from actual uniprocessor and SIMD executions to support he validity of the 
models. Lastly, we relate the models' predictions to the supposed limits on parallel 
speedup suggested by Amdahl's Law. 
Surprisingly, the benchmark results reported in Section 5 suggest that MultiLog's 
execution model is competitive with or superior to Prolog's, even for a sequential 
implementation f MultiLog. But, in fact, that MultiLog beats Prolog is no surprise 
once one understands the performance model we present in this section. 
The plan of this section is as follows. In Section 6.2, we present a model for the 
performance of uniprocessor MultiLog; the model expresses a law of diminishing 
returns on the expected speedup. In Section 6.4, we present a model for the perfor- 
mance of SIMD MultiLog; the model places no theoretical limits on speedup. In Sec- 
tion 6.5, we present a simpler analysis that depends on a convenient simplification. 
In Section 6.6, we present data comparing actual and predicted running times to 
MULTILOG AND DATA OR-PARALLELISM 231 
support the validity of the models. Finally, in Section 6.7, we conclude by relating 
our results to Amdahl's Law. 
6.1. The Metric M 
At each step of the execution of a MultiLog program, there is some number of 
active environments. Metric M is the average over all steps of the number of active 
environments, and hence is a measure of the potential data or-parallelism of a 
program execution. Additionally, as we shall see, M is a measure of how much 
redundant engine work (control and unification of engine variables) is avoided by 
the use of multi-SLD resolution. Thus, M is intimately related to both sequential 
and parallel speedup. 
We have observed (by running instrumented versions of MultiLog and Prolog) 
that if one multiplies the number of Multi-WAM instructions I executed by Multi- 
Log for a given program by the metric M, then one obtains a value M * I which is 
quite close to the number of WAM instructions I P executed by the corresponding 
Prolog program. S° Why should I * M approximately equal IP? 
Assume that there are I Multi-WAM instructions executed in the MultiLog pro- 
gram, and let M(i) be the number of binding environments active during instruction 
i. (Note that each Multi-WAM instruction comprises one or more true machine in- 
structions; in the case of unification, the Multi-WAM instruction may involve a 
large number of multiunification operations, but these are all counted as a single 
Multi-WAM instruction.) Then in the corresponding Prolog program, instruction i 
will be executed approximately M(i) times since each environment resulting from 
execution of a d i s j  goal will be returned separately via backtracking in Prolog. 
p I . . . .  Hence, I ~ }-]~=1 M(z). But since M is the average number of environments, 
I M ~ (Ei=I  M(i)/I). So, M • I ~ I P. 
6.2. A Model for the Uniprocessor Performance 
In a uniprocessor implementation of MultiLog, if one multiplies the number of 
seconds pent in the Prolog engine by the metric M, one should, we claim, obtain 
a value close to the number of seconds aved by MultiLog's data or-parallelism as 
compared to Prolog. This is because the metric is a measure of the average number 
of active environments, but in Prolog, each active environment that resulted in 
the metric would be processed sequentially by backtracking. Now, in uniprocessor 
MultiLog, the work performed in the unification workers (except for copying) has to 
be performed in Prolog as well as in MultiLog; but the work performed in the Prolog 
Engine is shared for multiple environments, and results in speedup in proportion 
to the number of active environments. 
6.2.1. An Example Application of the Performance Model. Profiling of unipro- 
cessor MultiLog indicates that about 30% of b i t s -pa l -n ' s  execution time was spent 
in the Prolog engine code. Since the metric for MultiLog b i t s -pa l -n  was 483.6 and 
the running time was 10.6 seconds, this means that as much as 0.3 * 10.6 * 483.6 = 
1437.8 seconds should have been saved by using MultiLog. In fact, 929.1 seconds 
20 i ,  M ~ I P in part  because the Mult i -WAM code contains a few instruct ions not executed 
by the WAM: SPLIT, SAVE_AS_ANSWER, etc. 
232 D.A. SMITH 
were saved, out of 939.7. Similarly, for b i t s ,  the metric was 128.0, 2% of execution 
time was spent in the Prolog engine, and the running time was 4.3; so as much as 
0.02.4.3.  128.0 = 11.0 seconds hould have been saved by using MultiLog. In fact, 
7.5 seconds were saved, out of 11.8. Our analysis is not too far off. 
For these two examples, less than 1% of execution time is spent copying environ- 
ments. But for programs that do spend a substantial amount of time copying, the 
predicted speedup must be lowered in proportion to the time spent copying since 
MultiLog, but not Prolog does copying. 
6.2.2. Formalization of the Model. Let us formalize the reasoning exemplified 
above. Our anMysis is not meant to be useful for predicting speedups of MultiLog 
as compared to Prolog. That is, it cannot be used to estimate the running time of a 
program in MultiLog if you know the running time in Prolog. Rather, the purpose 
is to explain why MultiLog gets the speedups it does, and to provide a basis for 
predicting the effects of various optimizations. In fact, our analysis can be used to 
predict he approximate running time of Prolog, given the running time and various 
statistics for MultiLog, as we shall see. 
The time T for uniprocessor MultiLog execution consists of the sum of two terms, 
TE and Tw, defined as follows. The time TE is the time spent doing control instruc- 
tions and engine-variable unifications. In a parallel implementation of MultiLog, 
TE would correspond to work done by the Prolog engine. In contrast, Tw is the 
time spent doing multiunifications, environment copying, and a small amount of 
control work: workers push and pop choice points containing multi-X registers (see 
Figure 5). In a parallel implementation, Tw would correspond to work done by the 
unification workers (the slave processors). 
Let TwC. c be the time spent copying environments in workers, and let Tw R be the 
time doing "real" work in the workers: unification and control work that Prolog 
would have to do as well. Suppose the metric (the average number of environments 
active) is M. Then, for each unit of work done in MultiLog's Prolog engine, the same 
work is done approximately M times on average in Prolog. And for each unit of 
work done in the workers, except for copying, the same work is done approximately 
once in Prolog. This is because we are now considering uniprocessor MultiLog 
in which multiunifications are performed sequentially. So the time T P for Prolog 
execution of the same program is expected to be 
T P =TE*M+T~.  
The expected speedup S (on a uniprocessor) should be 
T P TE.M+TRw 
S-  
T TE+T~+T c 
If we let be and be then we can wr i te  S as S -- f + T#' ,  
which has the same form as the formula for f P. As expected, the formula for S 
implies that if we can increase the ratio of TE to T--say, by allocating more of the 
variables as engine variables--then speedup will be higher since T~ is multiplied 
by M. Similarly, if we lower copying costs Tw c , then ~rWR~ will be higher and T 
smaller. Finally, if we increase the metric M, then, too, speedup will increase, 
all other things being equal, since there will be more sharing of engine work TE. 
What the formula does not take into account is the fact that if M is too high, 
then too many environments will be allocated, and excessive paging will result in 
MULTILOG AND DATA OR-PARALLELISM 233 
worse performance. Moreover, as shown in the next section, TE, T R,  and M are 
not independent: in particular, if M increases, then TE (and T~) will decrease. As 
a result, for uniprocessor MultiLog, the benefits resulting from increasing M are 
limited by the proportion of work performed in the unification workers. 
6.2.3. The Law of Diminishing Returns. The execution of a given MultiLog 
query results in certain values for T~, M, Tw R, and so on. If one increases the 
metric M by adding a d is j  operator before some goal, then TE will likely de- 
crease and Tw R will increase because fewer engine instructions will be executed 
(since backtracking will be replaced by data or-parallelism) and more multiuni- 
fication and control operations will occur in workers. For example, consider the 
query 
I ?- generate(X) , tes t (X) .  
and suppose that generate/1  consists of 1000 facts and tes t /1  performs 500 in- 
structions per call. If there are no d is j  operators in the clauses for generate /1  and 
tes t / l ,  then execution of the query will result in a metric M = 1, with TE = T 
and T~ = 0. The total number of instructions executed will be about 500,000. 
Assuming one unit of time per instruction, TE = T = 500,000. 
If one adds a d is j  operator before the call to generate / i ,  then TE will decrease 
to a value less than T because fewer instructions will be performed in the Prolog 
engine, and Tw R will increase to a value greater than 0 because multiunification and 
control operations will now be done in the workers. Suppose generate /1  returns 
100 solutions at a time per answer and backtracks 10 times to generate all 1000 
solutions. Then the total number of multiresolution steps will be approximately 
1000 + 500 * 10 = 6000 = TE, and the metric M will be 500,000/6000 = 83.3. T~ 
will be as much as 100 * 10 * 500 = 500,000, depending on how many of tes t / l ' s  
instructions involve multivariables created by generate/1.  For example, if half 
of them involve such multivariables, then Tw R = 100 * 10 * 250 = 250,000 so that 
T = TE + TRw + T C = 6000 + 250,000 + T C. Assuming, for the sake of simplicity, 
that copying time is negligible, T = 256,000 and S = 500,000/256,000 = 1.95. 
If one now increases the reversion threshold, 21 so that 200 solutions are re- 
turned per answer, and 5 answers are returned, then the total number of mul- 
tiresolution steps will be approximately 1000 + 500 * 5 = 3500 -= TE, and the 
metric M will be 500,000/3500 = 142.9. Again, assuming that copying time 
is negligible Tw R = 250,000, T = TE + T R -= 3500 + 250,000 = 253,500 and 
S = 500,000/253,500 = 1.97. Notice that T has decreased from 256,000 to 253,500, 
an insignificant improvement. The reason is that already, when T = 256,000 virtu- 
ally all execution time is spent in the workers (Tw R ), so there is little benefit to be 
had by further decreasing TE. 
This example illustrates that increasing the metric will have little benefit when 
most of the work is already being done in the unification workers. There is a "law 
of decreasing returns" that limits the speedup of uniprocessor MultiLog according 
to the proportion of time spent in the unification worker code. If 1/3 of the time is 
spent doing multiunification and worker control operations, then speedups greater 
21The reversion threshold is a run-time parameter governing when to revert o backtracking 
during the collection of solutions to a disj goal. The higher the reversion threshold, the more 
solutions are collected. 
234 D.A. SMITH 
5 
4 
3 
2 
1 
0 
0 
i i i 
i I i I 
20 40 M 60 80 I00 
F IGURE 15. The law of dimin- 
ishing returns on increasing the 
metric. 
than 3 are impossible. This law helps explain why our best speedups were ob- 
tained with low reversion thresholds. However, in this form, the law applies only to 
uniprocessor MultiLog. In SIMD MultiLog, work done in the unification workers is 
sped up in proportion to the metric, and it is beneficial to increase the metric even 
when TRw/TE is high, at least until M = p (the number of processors). Section 6.5.2 
(Figure 15) uses a formal model of speedup different from the one in this section 
to explain the law of diminishing returns. 
6.3. Multilog's Space Usage 
It is difficult to relate the space usage of a MultiLog program to the space usage 
of the corresponding Prolog program since MultiLog's space usage depends on so 
many factors: 
• the reversion threshold (i.e., whether backtracking or multiple environments 
are used to solve a nondeterministic goal), 
• the proportion of environments hat fail unifications, 
• the proportion of environments hat need to be trailed with choice points, 
• the number of extant choice points, 
• the proportion of variables that are multivariables. 
In the worst case, all variables are multivariables, all nondeterministic goals use 
multiple environments, and all environments succeed on all unifications (or need 
to be trailed). Even if there are few solutions to the query or even if average 
space consumption is low, the maximum number of active environments i  bounded 
only by the amount of backtracking in the Prolog program: a Prolog program 
that runs in constant space but does an exponential amount of backtracking can 
run in exponential space if all solutions to the generators are collected during a 
d i s j  goal. We point out again, though, that MultiLog can revert to space-efficient 
backtracking whenever space consumption becomes a problem. 
For SIMD MultiLog, the presence of trailed environments has been a major prob- 
lem, as we saw in Section 5.3, and because of trailed environments, the space usage 
cannot directly be modeled as a function of the metric M. However, for uniprocessor 
and MIMD MultiLog, trailed environments are less troublesome, thanks to pag- 
ing: the active set of pages (those pages actually in use over some short interval 
MULTILOG AND DATA OR-PARALLELISM 235 
of time) depends on the active environments only. So by ignoring the need for 
trailed environments, we can model the "active" space usage as follows. First, 
note that the (binding) trail space used in each worker is no larger than the size 
of the binding vector (and generally much smaller). Second, if we make the rea- 
sonable assumption that there are, on average, few choice frames in the choice 
stack, then we can ignore the space used by the (worker) choice stacks. Fi- 
nally, if we let r be the proportion of variables that are multivariables, we can 
bound the average "active" space usage by the expression O(M * r). So, again, 
the analysis points to the importance of allocating as few variables as possible as 
multivariables. 
6.4. A Model for the SIMD Performance 
There are three differences between the formal model for SIMD speedups and the 
formal model for uniprocessor speedups of Section 6.2. First, in SIMD MultiLog, 
multiunification and control work done in the SIMD processors is sped up by an 
amount proportional to the metric due to the parallelism of the unification workers; 
this is different from uniprocessor MultiLog, for which only work done in the Prolog 
engine is sped up (via sharing). Second, the SIMD processors are a factor of D 
slower than the front-end workstation (on the MasPar MP1, D ,~ 50). Third, as we 
mentioned in Section 5.3, there is substantial overhead from synchronization and 
communication between the SIMD processors and the Prolog engine, overhead not 
present in the uniprocessor case. 22 
The total execution time T of a MultiLog program, then, consists of TE (time 
spent in the Prolog engine doing real work: control and engine unifications), TC 
(time spent saving (copying) answers), T R (time spent doing "real" work in the 
workers: multiunification and control operations), and To (time spent in overhead 
routines--both in the engine and the workers). 
Starting with the same assumptions and conventions as in Section 6.2, let D 
be the ratio of the speed of the front-end workstation to the speed of the SIMD 
processors. 23Then, assuming that there are at least M processors, the time T P for 
Prolog execution of a program is expected to be 
TP = TE , M + T~ z * M 
D 
As in the uniprocessor case, TE is multiplied by M because ngine work is performed 
in Prolog once for each environment active in MultiLog. T R, the time spent doing 
real work in the workers, is multiplied by M to reflect he speedup due to parallelism 
and is divided by D to reflect the slowness of the SIMD processors. TC, the time 
spent copying, does not appear in the formula because in Prolog, copying need 
not be performed at all. Similarly, To, time spent in overhead routines, need not 
appear in the formula. 
22In uniprocessor MultiLog, there is overhead from factors uch as accessing the variable vectors 
and iterating through and managing environments, but it represents a much smaller fraction of 
execution time than in SIMD MultiLog. 
23A more refined analysis might divide D into ratios for various classes of instructions. 
236 D.A. SMITH 
The speedup S then is TP/T  = [TE * M + (T R * M/D) / (TE  + T R + To + Tc)] .  
If we let T~ be the ratio TE/T  and Tw n '  be the ratio TR/T ,  we can write this as 
S = 7~ , M + TRw ' * M_ 
D 
which has the same form as the formula for T P above. Note that for a given M 
and D, as the ratio of real engine work TE to real SIMD work Tw n increases, so 
does the speedup because of the slowdown D. And, as expected, S will increase 
as well if we can increase the metric M, or decrease D, To, or T C, all other 
quantities being equal. Of these quantities, it seems that M is probably the easiest 
one to alter: by implementing dynamic reversion to backtracking (Section 4.7) 
and better memory and processor management (Section 5.3). To could be lowered 
either by buying a more efficient SIMD computer or by clever compilation to reduce 
communication and synchronization costs (e.g., avoiding SIMD dereferencing). T C 
might be decreased by the binding array optimization of [57]; with this technique, 
processors avoid copying the entire bindin~ environment during task switches by 
maintaining an array of bindings, in order of binding time, and then when switching 
to a new task, unbinding back up to the least common ancestor and rebinding back 
down to the new execution point from the new processor. It is likely that D can 
be lowered only by buying faster SIMD processors. 
6.5. A Simpler Analysis 
The analyses of Sections 6.2 and 6.4 have the disadvantage that they do not permit 
us to formally predict the running time of a MultiLog program from the running 
time of a Prolog program. The basic reason is that the variables TE and T R 
depend on the MultiLog program's execution and on M, and hence do not provide 
an independent basis of comparison. 
To overcome this shortcoming and to obtain a simpler, more elegant analysis, 
we make the following assumption: 
The execution time T P of a Prolog program can be expressed as the 
sum T P -- T e ÷ T "~ of an engine part T e (representing engine control 
and unification operations) and a multipart T m (representing multiuni- 
fication and control operations). If the program is annotated with d i s j  
operators and executed under MultiLog, then all work that resulted in 
the engine part T ~ will become engine work in the MultiLog run, and 
all work that resulted in the worker part T m will become multiwork in 
the MultiLog run. The equations T ~ = TE * M and T m = Tw always 
hold (ignoring constants representing overheads), where TE (Tw) is the 
total time spent in the engine (workers). 
This assumption is clearly not true: as more d i s j  goals are added to a program, 
additional variables become multivariables bound in the workers and unbound in 
the Prolog engine, so that proportionately more unification work is done in the 
unification workers. Yet, the predictions we make with the assumption will be an 
upper bound on the actual performance. 
Moreover, if, instead of adding d i s j  goals to a MultiLog program, one merely 
increased the threshold for reverting to backtracking (so that more solutions are 
returned per each answer and fewer answers are returned), then the assumption 
MULTILOG AND DATA OR-PARALLEL ISM 237 
would hold: the same amount of multiwork would be performed since no new 
multivariables would be allocated. If more environments are allocated, then corre- 
spondingly fewer backtracks would occur. Similarly, multiplying TE by M would 
yield T ~. 
6.5.1. The Model for Uniproeessor MultiLog. Let C be the slowdown that occurs 
in the Prolog engine due to engine overheads such as accessing the variable vectors 
and paging. C ~ 1.6 for uniprocessor MultiLog. Let B be the slowdown that occurs 
in the unification workers due to overheads: accessing the variable vectors, copying 
environments, and iterating through environments. (Note that we count the cost of 
copying environments as part of the overhead of using multivariables rather than 
as a separate term Tc . )  B ~ 1.6 for uniprocessor MultiLog. Then we can express 
the run time Tun i of uniprocessor MultiLog as a function of T e, T m, M,  C, and B 
as follows: 
Te .C  
Tuni  - - -  ~-T  m*B.  
M 
In uniprocessor MultiLog, the engine work T ~ is sped up by a factor of M, but the 
multi work T m is not. (The formula ignores the thrashing that will occur if M is 
too large.) 
Letting T el = T~/T  P and T m' = T '~/T  P, we can express uniprocessor MultiLog 
speedup as 
T P T ~ + T m 1 
Sun i  ~-- Tuni  T~*CM + Tm * B T~'*CM + Tin' * B" 
1 and B > 2, then the denomi- Note that T e' = 1 -T  m'. This implies that i fT  m' k ~
nator will be greater than 1, so the speedup will be less than 1--i.e., it will be a slow- 
down. The only way for uniprocessor MultiLog to get any speedup at all is by shar- 
ing engine work. To get high speedups, it is desirable that T ~' >> Tm'. If C = B = 2, 
Te '= 0.9, T in '= 0.1, then speedup is [1/(0.9.2/M) + 0.1.2] = [1/(1.8/M) + 0.2] < 5. 
Figure 15 shows the graph of M and S for these values of the constants. When 
1 (a slowdown) due to the overheads. The figure also illustrates M = 1, speedup is 
the "law of diminishing returns" mentioned in Section 6.2.3; note that the curve 
asymptotically approaches S = 5. Beyond M ~- 25, where the "elbow" of the graph 
occurs, there is little extra speedup resulting from increasing M. 
6.5.2. Relation to Amdahl's Law. The analysis for uniprocessor speedups i anal- 
ogous to Amdahl's Law, which places limits on the potential speedup of a parallel 
program according to the proportion of inherently sequential code. If, in a given 
program, 1/3 of the code must be run sequentially, then even with a million proces- 
sors, speedups greater than 3 will be unobtainable. Similarly, if in a given MultiLog 
program, 1/3 of the work must be performed in the unification workers, then for 
uniprocessor MultiLog, speedups greater than 3 will be unobtainable. 
For Amdahl's Law, the quantity c~ (Amdahl's fraction) expresses the fraction of 
work which is inherently sequential, and Amdahl's Law [15] states that the execu- 
tion time Tp on p processors i related to the execution time T1 on one processor by 
Tp = c~ * T1 + ((1 - c~) • T1/B) so that speedup is Sp = (T1/T~) = (p/1 + (p - 1) * ~). 
Since the limit, as p approaches infinity, of SB is limp--.oo[1/(1/p) + ~ - (l/p)] = 
(1/c~), Amdahl's Law is often referred to as an argument for the unfeasibility of 
238 D.A .  SMITH 
parallel processing. Like a, the work T m performed in the unification workers lim- 
its the potential speedup because, unlike engine work T e, which is sped up by a 
factor of M due to sharing, no sharing occurs for Tm. Section 6.7 continues this 
discussion of the relation between Amdahl's Law and data or-parallelism. 
6.5.3. The Model for S IMD MultiLog. Let C be the slowdown that occurs in the 
Prolog engine due to overheads uch as accessing the variable vectors and paging. 
Let B be the slowdown that occurs in the unification workers due to overheads such 
as communication, synchronization, accessing the variable vectors, and copying 
environments. Let D be the slowdown that occurs in SIMD MultiLog due to the 
slowness of the SIMD processors. In SIMD MultiLog, both engine work and multi 
work are sped up by a factor of M. Consequently, 
Te . C Tm , B . D 
TS IMD - -  _ _  + 
M M 
T P T ~ + T m M 
SSIMD - -  _ _  - -  
TS IMD T~*CM 4- Tm*B*DM T et * C 4- T mt * B * D" 
These formulas ignore the fact that M can be no greater than the number of 
processors p. But if virtual environments are used, then M can be as great as v * p, 
where v is the virtual/physical ratio, while the speedup in the SIMD workers is 
limited to p. With SIMD MultiLog, speedup increases linearly with M until M 
reaches p. Thereafter, there is an asymptote, as in the uniprocessor case. In the 
ideal world where B = C = D = 1, SSIMD = M. 
In our current, prototype implementation of SIMD MultiLog on the MP1, on 
average C ~ 1.6, D ~ 50, T ~' ~ 0.85, M ~ 2326.1, and B ~ 11. (Profiling indicates 
that over 75% of the execution time is spent in overhead routines, including copying! 
Less than 8% of the time in the workers is spent doing "real" work.) Under these 
assumptions, the expected SIMD speedup is 
M 2326.1 
z 
SSIMD : T e' * C + Tm'  * B * D 0.85 * 1.6 + 0.15 * 11 * 50 ~ 27.7. 
Since speedup is proportional to M when M <p,  and since for most of our 
examples M << p, probably the easiest way to speedup SIMD MultiLog programs 
is to increase M--e i ther  by dynamic reversion to backtracking (so that the system 
can collect more solutions before reverting to backtracking) or by virtual processing. 
This will result in higher processor utilization, which was 28% on average and for 
many problems was under 10%. In any case, the formulas provide a basis for 
predicting the effects of various optimizations. 
6.6. Testing the Models 
To test our performance models, we instrumented the Multi-WAM interpreters and 
used profiling tools to measure M, TE, T C,  etc. We then calculated T P for a 
dozen or so example program executions, based on the formulas of Sections 6.2 and 
6.4. The examples tested were mostly standard fused or unfused generate-and-test 
programs, including n-queens, knight's tour, Waltz line labeling, Instant Insanity, 
and a satisfiability tester. Detailed results can be found in [47]. Here, we present a 
summary. 
For uniprocessor MultiLog, the average metric M was 100.2. Average speedup 
over standard Prolog (WAM) was 9.4. The average ratio of predicted and actual 
MULTILOG AND DATA OR-PARALLELISM 239 
50 
45 
40 
35 
30 
25 
20 
15 
10 
5 
0 
l0 12 14 
÷ 
I I I 
2 4 6 
i i 
"data~actual-uni" "O-- 
"data-predicted-uni" + - 
I I 
F IGURE 16. Compar ison of scaled predicted and actual  run t imes (uniprocessor).  
T P was 1.65, reflecting, we think, constant ime overheads. The sample correlation 
coefficient between actual and predicted run time is r = 0.9998. When we plot 
the predicted time divided by 1.65 and the actual time on the same graph, we get 
Figure 16, which visually shows a close correspondence b tween scaled predicted run 
time and actual run time. In this graph, the horizontal axis lists 13 programs, while 
the vertical axis shows predicted and actual run times; lines connecting predicted 
(actual) run times are for ease of viewing. Data for the last two programs were 
scaled so all points fit on the same graph. 
For SIMD MultiLog on the MP1, the average metric was 2326.1. Average actual 
speedups were 294.3. (This figure was skewed to a value higher than the predicted 
average of 27.7 from Section 6.5.3 by the presence of a few programs with speedups 
over 1000.) If we multiply actual uniprocessor speedup 9.4 by the ratio of SIMD M 
to uniprocessor M, we get 9.4 * (2326.1/100.2) = 218.2, a figure close to the actual 
SIMD speedup. Figure 17 plots, for 14 program executions, the unscaled actual 
and predicted Prolog run times on the same graph. (The last program's data point 
has been truncated so that the remaining points would be discernible.) As can be 
500 
450 
400 
350 
300 
Runtime 250 
200 
150 
100 
50 
0 
F IGURE 17. 
L 
( p  
I I 
Programs 
Comparison of unscaled predicted and actual run times (SIMD). 
240 D.A. SMITH 
seen, predicted and actual times are quite close, given the range of possible values. 
The correlation coefficient is 0.99992, a further suggestion of the validity of the 
model. 
6. 7. Data Or-Parallelism and Amdahl's Law 
We conclude this section by continuing the discussion begun in Section 6.5.2 on the 
relation between Amdahl's Law and data or-parallelism. 
It has already been observed [21] that the negative judgment against massive 
parallelism suggested by Amdahl's Law is overly pessimistic. One problem with 
Amdahl's Law is that it assumes that n (the size of the problem) is independent 
of p (the number of processors). But, when more processors are available, one can 
often run larger problems--or programs which expose more inherent parallelism. In 
MultiLog, the truth of this observation is manifested by the fact that MultiLog can 
adjust its degree of data-parallelism by choosing between backtracking and multiple 
environments for the solution of dis j  goals. To generate more parallelism, MultiLog 
can collect more solutions to a dis j  goal; when there are too many environments 
active, MultiLog can revert o backtracking. 
As our performance models indicate, MultiLog gets better speedup compared to 
Prolog when a large amount of the execution is performed in the Prolog engine (the 
inherently sequential part of the code). The speedup results from the sharing of 
control instructions and engine unifications that occurs when multiple solutions to a 
generator are tested in parallel by subsequent goals. The fact that MultiLog obtains 
higher speedups the higher the percentage of sequential code may seem to present 
another challenge to Amdahls's Law. Yet, the speedup due to work performed in 
the Prolog engine is speedup that occurs for uniprocessor MultiLog as well as for 
parallel MultiLog. In fact, as the benchmarks and analyses indicate, most of the 
speedup of SIMD MultiLog results from speedup in the Prolog engine, and not 
from speedup in the SIMD processors. The data or-parallelism of MultiLog is in 
this sense independent of parallel computing. Rather, it is a matter of an improved 
algorithm--an algorithm that is parallel in conception, but that often beats the 
natural sequentiM algorithm on sequential hardware. 
There is a limit to the speedup obtainable by uniprocessor MultiLog for a given 
size problem; the law of diminishing returns of Section 6.5.2 resembles Amdahl's 
Law by setting an upper bound on possible speedups--a bound that depends on 
the percent of work to be performed in the unification workers. However, for SIMD 
MultiLog, the analysis puts no upper bound on the possible speedup since multi- 
unification work is being performed inparallel; for SIMD MultiLog, one can increase 
the speedup arbitrarily by increasing the metric M, as long as hardware resources 
and parallelism in the problem domain remain available. 
. CONCLUSION 
We have demonstrated the viability of data or-parallelism as an efficient alternative 
to backtracking for the management of combinatorial search in logic programming. 
The essential idea is to collect solutions (substitutions) to certain ondeterministic 
goals and to execute subsequent goals in the context of these substitutions using 
a single thread of control. In contrast to standard control or-parallel systems like 
Aurora and Muse, which use multiple threads of control to explore an SLD tree in 
MULTILOG AND DATA OR-PARALLELISM 241 
parallel, MultiLog utilizes parallel unification in multiple substitutions with a single 
thread of control (data or-parallelism). The generalization from a single substitu- 
tion to multiple substitutions i straightforward, and leads to significant forms of 
sharing: the engine/multivariable distinction and templates are powerful optimiza~ 
tions that save much redundant work by factoring out common computations so 
they can be performed once per set of substitutions. The Multi-WAM is a natural 
and minimal generalization of the standard WAM, requiring only a handful of new 
instructions and data structures in the Prolog engine, as well as a few new data 
structures in the unification workers. Prototype implementations exist for a vari- 
ety of computer architectures: uniprocessor, MIMD, and SIMD. Benchmark results 
are promising, especially--and somewhat surprisingly--for uniprocessor MultiLog. 
Importantly, unlike control or-parallelism, data or-parallelism can exploit massive 
(SIMD) parallelism, and is competitive with or faster than standard Prolog, even 
on a uniprocessor. Finally, our performance models explain the speedups and space 
consumption, and give insight into the nature of data or-parallelism. 
Nonetheless, there remains a lot of theoretical and practical work to be done 
before the true potential and limitations of data or-parallelism are realized. We 
collect here some areas for future research. 
• Combination of data or-parallelism and constraints (cf. [51]) 
• Combination of control and data or-parallelism 
• Application of the binding array technique from the SRI model [57] to lessen 
environment copying costs 
• Application of Firebird's technique of having idle processors track active ones 
[51] to lessen environment copying costs 
• Investigation of the alternative representation for bindings of Section 4.5 in 
which each multivariable is associated with a vector of bindings, one per 
substitution 
• A better model for MultiLog's space usage (Section 6.3) 
• Further investigation of dynamic reversion to backtracking (Section 4.7) 
• Cleverer memory management (e.g., virtual processors) in SIMD MultiLog 
to increase processor utilization 
• Garbage collection, both in the Prolog engine and in workers 
• Avoidance of variable trailing in workers (global trailing in Prolog engine) 
• Use of modes and types (cf. [48]) 
• Avoidance of dereferencing in workers 
• Better load balancing (for MIMD MultiLog) 
• Further investigation of two-dimensional backtracking (Section 3.3) 
• Automatic ompilation of engine/multivariable distinction 
• Automatic incorporation of templates (Section 3.2.1) 
• Compiling the finite domain restriction for SIMD MultiLog 
• A way to avoid preallocating a multi-Y variable for each Y variable 
• Clause indexing for multivariables 
242 D.A. SMITH 
• "Generator procedures" as an alternative to d±sj goals for creating different 
bindings in multiple environments 
• Alternatives to DNF: (hierarchical) environment trees (Section 4.10) 
• Application of data or-parallelism to handling ambiguity [17] in natural lan- 
guage processing. 
I thank Tim Hickey for valuable discussions and ideas. In particular, he helped devise the en- 
gine/multi distinction, conceived of the notion of templates, and suggested the simpler analysis of 
Section 6.5. I thank the Advanced Computing Research Facility of Argonne National Laboratory 
for providing access to their BBN TC2000, and MasPar Computer Corporation for providing access 
to an MP1. Finally, I thank the anonymous referees for their numerous constructive comments. 
REFERENCES 
1. Ai't-Kaci, H., The Warren Abstract Machine: A Tutorial Reconstruction, MIT Press, 
1992. 
2. Ali, K. and Karlsson, R., The Muse Or-Parallel Prolog Model and Its Performance, 
in: S. Debray and M. Hermenegildo (eds.), P-roc. 1990 North American Conf. on Logic 
Programming, ALP, MIT Press, Austin, TX, 1990, pp. 757-776. 
3. Arro, H., Barklund, J., and Bevemyr, J., Parallel Bounded Quantifiers: Preliminary 
Results, in: Proc. JICSLP'92 Post-Conf. Joint Workshop on Distributed and Parallel 
Logic Programming Systems, 1992. 
4. Association for Computing Machinery, Proc. Int. Conf. on Fifth Generation Computer 
Systems, ICOT, ACM, Japan, 1992. 
5. Barklund, J., Parallel Unification, Ph.D. thesis, Uppsala University, 1990. 
6. Barklund, J., email message, Dec. 1, 1992. 
7. Barklund, J. and Millroth, H., Providing Iteration and Concurrency in Logic Pro- 
grams Through Bounded Quantifications, in: Proc. Int. Conf. on Fifth Generation 
Computer Systems [4], pp. 817-824. 
8. Beeri, C., Naqvi, S., Ramakirishnan, R., Shmueli, O., and Tsur, S., Sets and Nega- 
tion in a Logic Database Language (LDL1), in: Principles of Database Systems, ACM 
Press, 1987. 
9. Chan, D., Constructive Negation Based on the Completed Database, in: R .A .  
Kowalski and K. A. Bowen, (eds.), Proc. 5th Int. Conf. and Symp. on Logic Pro- 
gramming, MIT Press, Cambridge, MA/London, 1988. 
10. Chan, D., An Extension of Constructive Negation and Its Application in Coroutining, 
in: E. L. Lusk and R. A. Overbeek (eds.), Logic Programming, Proc. North American 
Conf., MIT Press, Cleveland, OH, 1989. 
11. Clark, K. L., Negation as Failure, in: Logic and Databases, Plenum, 1978. 
12. Comon, H. and Lescanne, P., Equational Problems and Disunification, J. Symbolic 
Computation 7 (1989). 
13. Costa, V. S., Warren, D. H. D., and Yang, R., The Andorra-I Engine: A Parallel 
Implementation of the Basic Andorra Model, in: Furukawa [18], pp. 825-839. 
14. Costas, J. P., Medium Constraints on Sonar Design and Performance, in: EASCON 
Cony. Rec., 1975, pp. 68A-68L. 
15. DeCegamma, A. L., The Technology of Parallel Processing: Parallel Processing Ar- 
chitectures and VLSI Hardware, Vol. 1, Prentice-Hall, 1989. 
16. Dincbas, M., Van Hentenryck, P., Simonis, H., and Aggoun, A., The Constraint Logic 
Programming Language CHIP, in: Proc. Int. Conf. on Fifth Generation Computer 
Systems, ACM, North-Holland, Tokyo, Japan, 1984. 
17. D6rre, J. and Eisele, A., Feature Logic with Disjunctive Unification, in: Proc. COL- 
ING'90, Helsinki, Finland, 1990. 
MULTILOG AND DATA OR-PARALLELISM 243 
18. Furukawa, K. (ed.), Proc. 8th Int. Conf. on Logic Programming, MIT Press, Paris, 
France, 1991. 
19. Golomb, S. W., The t4 and g4 Constructions for Costas Arrays, IEEE Trans. Infor- 
mation Theory 38(4):1404-1406 (1992). 
20. Gonz~lez, A. and Tubella, J., The Multipath Parallel Execution Model for Prolog, in: 
Proc. 1st Int. Conf. on Parallel Symbolic Computation, PASCO'9~, World Scientific 
Pub., 1994. 
21. Gustafson, J. L., Reevaluating Amdahl's Law, Commun. ACM 532-533 (1988). 
22. Hans, W., A Complete Indexing Scheme for WAM-Based Abstract Machines, in: 
M. Bruynooghe and M. Wirsing (eds.), Programming Language Implementation a d 
Logic Programming, LNCS 631, Springer-Verlag, Leuven, Belgium, 1992, pp. 232-244. 
23. Haridi, S. and Janson, S., Kernel Andorra Prolog and Its Computation Model, in: D. 
H. D. Warren and P. Szeredi (eds.), Proc. 7th Int. Conf. on Logic Programming, MIT 
Press, Cambridge, MA/London, 1990, pp. 31-46. 
24. Hausman, B., Ciepielewski, A., and Haridi, S., OR-Parallel Prolog Made Efficient on 
Shared Memory Multiprocessors, in: Proc. 1987 Symp. on Logic Programming, IEEE, 
Washington, DC, Aug.-Sept. 1987, pp. 69-79. 
25. Van Hentenryck, P., Constraint Satisfaction in Logic Programming, MIT Press, 1989. 
26. Van Hentenryck, P. and Deville, Y., The Cardinality Operator: A New Logical Con- 
nective for Constraint Logic Programming, in: Furukawa [18]. 
27. Herbrand, J., Recherches sur la Theorie de la Demonstration, 1930. 
28. Hohfeld, M. and Smolka, G., Definite Relations over Constraint Languages, Technical 
Report, IBM Deutschland GmbH, 1988. 
29. Jaffar, J. and Lassez, J.-L., Constraint Logic Programming, in: Proc. lgth Symp. on 
Principles of Programming Languages, ACM Press, 1987. 
30. Kanada, Y., Kojima, K., and Sugaya, M., Vectorization Techniques for Prolog, in: 
Proc. Int. Conf. on Supercomputing, 1988. 
31. Kunen, K., Answer Sets and Negation as Failure, in: J.-L. Lassez (ed.), Proc. ~th Int. 
Conf. on Logic Programming, MIT Press Series in Logic Programming, MIT Press, 
Cambridge, MA/London, 1987. 
32. Kuper, G., Logic Programming with Sets, in: Principles of Database Systems, ACM 
Press, 1987. 
33. Lassez, J.-L., Maher, M., and Marriott, K., Unification Revisited, in: Foundations of 
Deductive Databases and Logic Programming, M. Kaufmann, 1988. 
34. Lloyd, J. W., Foundations of Logic Programming, 2nd ed., Springer-Verlag, 1987. 
35. Lusk, E., Butler, R., Disz, T., Olson, R., Overbeek, R. A., Stevens, R., Warren, D. H. 
D., Calderwood, A., Szeredi, P., Haridi, S., Brand, P., Carlsson, M., Ciepielewski, A., 
and Hausman, B., The Aurora Or-Parallel Prolog System, New Generation Computing 
7:243-271 (1990). 
36. Maher, M. J., Complete Axiomatizations of the Algebra of Finite, Rational, and In- 
finite Trees, in: Proc. Syrup. on Logic in Computer Science, IEEE Computer Society 
Press, 1988. 
37. Milroth, H., Reforming Compilation of Logic Programs, Ph.D. thesis, Uppsala Uni- 
versity, 1990. 
38. Saraswat, V. A., Concurrent Constraint Programming, MIT, 1990. 
39. Sato, T. and Motoyoshi, F., A Complete Top-Down Interpreter for First Order Pro- 
grams, in: V. Saraswat and K. Ueda (eds.), Logic Programming, Proc. I991 Int. 
Syrup., MIT Press, San Diego, CA, 1991, pp. 35-53. 
40. Smith, D. A., Constraint Operations for CLP(FT), in: Furukawa [18]. 
41. Smith, D. A., Multilog: Data Or-Parallel Logic Programming, in: dICSLP'92 Work- 
shop on Parallel Implementations of Logic Programming Systems, 1992. 
42. Smith, D. A., Simpler Quantifier Elimination for Equality Formulas, Technical Report 
CS-92-167, Brandeis University, 1992. 
244 D.A. SMITH 
43. Smith, D. A., MultiLog: Data Or-Parallel Logic Programming, in: Warren [55]. 
44. Smith, D. A., Analysis of Environment Representation Schemes for MultiLog, Tech- 
nical Report, University of Waikato, 1994. 
45. Smith, D. A., Why Multi-SLD Beats SLD (Even on a Uniprocessor), in: Programming 
Language Implementation a d Logic Programming, Springer-Verlag, 1994, pp. 40-56. 
46. Smith, D. A. and Hickey, T., Multi-SLD Resolution, in: Logic Programming and Au- 
tomated Reasoning, Springer-Verlag, 1994, pp. 260-274. 
47. Smith, D. A., MultiLog: Data Or-Parallel Logic Programming, Ph.D. thesis, Brandeis 
University, 1993. 
48. Somogyi, Z., Henderson, F. J., and Conway, T. C., Code Generation for Mercury, in: 
J. Lloyd (ed.), Logic Programming, Proc. 1995 Int. Symp., MIT Press, Portland, OR, 
1995. 
49. Stuckey, P., Constructive Negation for Constraint Logic Programming, in: Proc. 
Symp. on Logic in Computer Science, IEEE Computer Society Press, 1991. 
50. Tick, E., Parallel Logic Programming, MIT Press, 1991. 
51. Tong, B.-M. and Leung, H.-F., Concurrent Constraint Logic Programming on Mas- 
sively Parallel SIMD Computers, in: Int. Logic Programming Symp., 1993. 
52. Tubella, J. and Gonz£1ez, A., MEM: A New Execution Model for Prolog, Micropro- 
cessing and Microprogramming 39:83-86 (1993). 
53. Tubella, J. and Gonz~lez, A., A Partial Breadth-First Execution Model for Prolog, in: 
Proc. 6th Int. Conf. on Tools with Artificial Intelligence, TAI'94, 1994, pp. 129-137. 
54. Vellino, A., Costas Arrays, Technical Report, Computing Research Laboratory, Bell- 
Northern Research, 1990. 
55. Warren, D. S. (ed.), Proc. lOth Int. Conf. on Logic Programming, MIT Press, 
Budapest, Hungary, 1993. 
56. Warren, D. H. D., Or-Parallel Execution Models of Prolog, in: TAPSOFT, 1987. 
57. Warren, D. H. D., The SRI Model for Or-Parallel Execution of Prolog: Abstract 
Design and Implementation, in: Proc. 1987 Symp. on Logic Programming, IEEE, 
Washington, DC, Aug.-Sept. 1987, pp. 92-102. 
58. Yang, R., Beaumont, T., Dutra, I., Costa, V., Santos, and Warren, D. H. D., Perfor- 
mance of the Compiler-Based Andorra-I System, in: Warren [55], pp. 150-166. 
