Abstract: SICStus Prolog is a sequential Prolog implementation built around a version of the Warren Abstract Machine (WAM). For several years, SICStus has supported WAM-to-native code compilation for Sun workstations. This old scheme is neither as portable nor as open to experiments as would be desirable. With the support of my colleagues in the SICStus group, I have developed a new scheme that is more of both and performs slightly better too. Portability enhancement is achieved by introducing a new abstract machine, a fragment of an idealized general-purpose architecture, between the WAM and native code targets. This allows most of the complexity of native code compilation to be in phases containing only wellencapsulated target dependencies. Moreover, it allows the run-time library for native code to be substantially target-independent. For RISC targets, a derivative of the new abstract machine allows quasi-target-independent instruction scheduling to be performed. This paper explains the SICStus approach to native code compilation, presents the new scheme in detail, and evaluates it with respect to portability and performance.
Native Code Compilation
Of the generalapproachesopen to a programminglanguage implementor,the simplestis interpretation. Interpretersare typicallyfast to write but slow to run. Moreover,for languages like Prolog, they have bad memory behavior, if written naively, due to heavy recursion. SICStus uses interpretation for dynamic predicates, whose clause sequences may be incrementally altered. However,in SICStus, as in programminglanguage implementationsgenerally, the desire for efficiencyin time and spacealikedictatessomeformof compilationas the normalusage.
The simplestform of compilationis to an abstractmachine,designedwith respect to the languageto be implementedand free of manycomplicationstypical of concretemachines(pipelining,interrupts, etc.). The abstractmachinemust be realizedin terms of some concretemachine;emulationis doing this withoutfurthercompilation. An implementorhas great freedomof choice aboutthe abstractmachine-its level of abstraction,the techniquesused in compilingto it, and the techniquesused in emulatingit. Looselyspeaking,if the level of abstractionis low enough,a vast assortmentof compilation techniquesis availablefor generatingefficientcode; and if it is high enough,the abstractmachineis amenableto emulationtechniquesthat makethe implementationportableto manyconcretemachines.
SICStuspartakesof both advantages. The core of SICStusis an emulatorfor a versionof the Warren AbstractMachine rw AM) [Car91] . Over the last decade, the WAM has become paradigmaticfor implementorsof Prolog and other logic languages [Van94] . It serves performanceby specializing unification(gee *, pue *, unify-*), reducing nondeterminism(switch-*), and a judicious memory strategy(registersand stack-allocatedmemory);and it serves portability in that its instruction set is coarse-grained, andits architectureis otherwisesimple. Of course,it is possibleto go furtherin either direction,especiallytoward performance-ideas aboundin the literature (e.g., [Van90] ). However, for the most part, SICStus does not go further. The SICStus compiler [Car90] ,pCwam, operates predicate-by-predicate.There is no global analysis. Clause selectionis by nest-argumentindexing only. Despitethis naivete,performancehas proved high enoughfor manypurposes. The emulator, writtenin C, has proved portableto many platforms. SICStushas an activeuser communityof several hundredpeoplethroughoutthe world,manyof whomdepend on emulatedcode.
Despitethe advantagesof emulation,it is natural to look beyondit. For high performance,it seems obviousthat one shouldcompileto a concretemachine;this is native code compilation.3 However, there area numberof issuesto consider;I considerthemhere with respectto SICStus.
First, how much higher performanceshould one expect? SICStusis a mature implementation,the productof manyperson-yearsof skilledlabor, muchof it directedat tuningpCwam and the emulator. This meansthat emulatedcodeis not so easy to improveupon as one mightsuppose.4 Moreover,real applicationsspend time in built-inpredicatesthat in SICStusare writtenin C and in garbagecollection,likewise in C. Nativecode compilationneed not reduce this; attentionto Amdahl's Law is appropriate. A rough analysisof these factors suggeststhat, for SICStus,well-craftednative code compilation should yield a speed-upbetween one and somethinglike three for most programs,but one should not expect a greater speed-upfor almost any program, unless maybe one alters the SICStus approachto compilationradically(e.g.,by addingglobalanalysis).
Second,what aboutportability? The last phase of native code compilationcannotbe target-independent, and there is tensionbetweenkeepingthe earlier phases target-independentand generatingefficient code. Indeed,efficiencyis likely to engendertarget dependenciesthroughout. Portabilityloss is reduced if target dependencies in the earlier phases are well-encapsulated. Another way of reducingportabilityloss is compilingto C and using a C compileras a back-end. There has been a 3 Of course, there are degrees of concreteness. E.g., for a microcoded processor with a writeable control store, the standard instruction set architecture is not the bottom level for programming. I shall ignore such distinctions, an imprecision that seems unlikely to mislead. 4 E.g., there is a program, not written as a benchmark, for which SICStus emulated code runs about twice as fast as native code in Aquarius Prolog, an implementation using far more sophisticated compilation techniques [Van90] . The reason is a register allocation flaw, which makes Aquarius do badly on a critical predicate. This is unusual -Aquarius is competitive with SICStus native code for most programs -but it suggests the point:
SICStus emulated code is good enough that native code compilation must be well-crafted to improve upon it. lot of interestin this approachrecently(e.g., [Chi94, Hau93, Gud92] ). For reasons givenbelow,it is not appropriatefor SICStus, but it may be appropriate for many logic languageimplementations. SICStususers are mostlyacademicand industrialresearchers. The prevalenceof workstationsfrom Sun, DEC, etc. amongthese peoplemeans that SICStusnative code compilationis valuable,even if availablefor theseplatformsonly. However,to allowfor newtrendsin the workstationmarketandto ease the maintenanceburden,it is desirablefor SICStusnative code compilationto be as portableas possible,consistentwith otherrequirements.
The other requirementspertain to implementationeffort and functionality. As usual in engineering, these are somewhatat odds. On one hand,high performanceis the basic goal. On the otherhand,the SICStusgroupis small, and its membershave other responsibilities. Moreover,SICStusis not a rt}-searchprototype-though we do not guaranteeit, our attitudeis that anythingwe add to it shouldbe robust This led us to the conclusionthat our nativecode compilationshoulduse as muchas possible of the existingimplementation,at least to begin with. In particular,pCwam has been retainedas the first phase, with minor alterationsfor native code relative to emulatedcode. This decisionimplies that it is naturalfor nativecodecompilationto operatepredicate-by-predicate, unlikenative-code-via-C schemes,whichmustcompilewholeprogramsor modulesat a time to attainhigh performance.
This incrementalityis one aspect of a general commitment,namely,that native code shouldbehave just like emulatedcode, except that it may take somewhatmore time to generate,it may take somewhatmore spaceto store,andhopefully,it will run faster. "Somewhat"is vague,of course,and both program-and target-dependent;we are contentwith factors of somethinglike three or less. Besides incrementality,this commitmententails transparentinteroperabilitybetween native code and emulatedcode. Broadlyspeaking,this meansthat nativecode compilationis convenientfor development as well as "production"usage. The value of this flexibilityis a matter of opinion. However,several usershave expressedappreciationfor it. Also,it has been helpfulfor debuggingnativecode compilation, sinceit is easy to compileanyportionof a programto native code and the rest to emulatedcode.
As mentionedin section I, for severalyears, SICStushas supportednative code compilationfor Sun workstations. This old schemefulfills the foregoing desiderata acceptably,except that it is not as , portableas desired. For one thing,the WAM-to-nativecode phasesfor the two targets,the 68k5and SPARC,are disjoint and permeatedwith target-dependencies. For another, native code continually refersto a run-timelibrary,the kernel,writtenin assemblylanguage -about2000lines of it. The necessityof providingall this for a new targetis a major barrier. Therefore,I have built a new scheme, in two stages. In the first stage, I rebuilt the WAM-to-native-codephases so that most of the complexityis in phasescontainingonly well-encapsulatedtarget dependencies. In the secondstage,I rebuilt the kernel in the same fashion. Havingfinishedthe new schemefor the 68k and SPARC,I extendedit to the MIPS6,mainlyfor use on DECstations. The basic goal of my workhas been portability enhancement. However, I have sought to improve performance too. Moreover, with the new scheme,it shouldbe easierto performexperimentstoward furtherperformanceimprovement.
The New Scheme
The new schemeis built arounda new abstractmachine,the SICStusAbstractMachine(SAM). The WAMis first compiledto the SAM,using essentiallythe same compiler,regardlessof the nativecode target, and the SAMis then compiledto the desiredtarget. Moreover,the kernel is writtenlargely in SAMcode, andthis too is compiledto nativecode. For RISC7targets,commonalityis practicaleven further down the machinehierarchy. Between the SAM and such targets,the new schemeplaces a "RISCified"SAM,the RISS. Quasi-target-independent instructionschedulingis performedon RISS code. 2 show the Prolog and kernel compilationpaths. Apart from the standard utility m4 and the standardassembler as (at the end of the kernel compilationpath), each phase is written in Prolog and constitutes a self-containedmodule. WAM, SAM, RISS, and target instructions are representedby termsin a straightforwardway, and each interiorphase consumesa list of instructions from the preceding phase and produces one for the following phase. The result of a Prolog compilationmay be loaded immediatelyor written to a file for subsequentloading. The result of a kernel compilationis an object file, which is linked (along with object files for the emulator, the garbagecollector,etc.) into the SICStusexecutable. There are target dependenciesfrom the WAM leveldown. However,in wam_sam,sam_riss,and the kernel,they are well-encapsulated.
The followingsubsectionsdiscussthe SAM, wam_sam,the kernel, the RISS, and sam_rissin detail. A further subsectiondiscusses riss_sparcas an example of final compilationto a target. Since the 68kis no longerin fashion,I shall say little aboutsam_68k. Also, sincethey are mundane,I shall say little about *_bin,the assemblyphases of Prolog compilation,or *_as,the target-levelsyntax-transformationphasesof kernelcompilation.
The SAM
The SAMinstructionset architectureis presentedin Figure 3 . Concisely,it is a fragmentof an idealized general-purposearchitecture.It is load/storewith three-operandarithmeticinstructionsand selfcontainedconditionalbranches. Loads, stores,and controltransfers areundelayed. The majoridealizationsarethat any numberof registersare available,and any datum maybe an immediate.
This design was shaped by the WAM and a set of targets: primarily, the SPARC and the 68k, to whichSICStuswas committed,and secondarily,prospectiveRISCtargets,especiallythe MIPS. The SAM was to be somethingto which the WAM could be compiled satisfactorilyand which in turn couldbe compiledsatisfactorilyto the targets,with most of the complexityin the former step. In detail, this meant the following. Generally,native code compilationrealizes WAM instructionspartly instance-by-instanceand partly in the kernel,i.e., partly in-line and partly in subroutines. This was true in the old scheme,and the rationalefor it remainsvalid in the new scheme. The SAM was to be suchthat all the in-lineparts couldbe in SAMcode. The SAMwas also to be used for the kernel,but .st stores SI to the memory location whose address is R plus S2.
synthetic forms
Id(Rl,R2) Should sensitivity be desired, it is suggested that the instruction set be extendedwithaddt and subt that trap on signedintegeroverflow.
.The behavior of sl1,sri, and sra for a negative shift count or a shift count greater than the datum size is undefined.
synthetic forms
.Op denotes one of {add, sub, and, or, .C denotes one of {e, De, I, Ie, g, ge, lu, leu, gu, geu}, which signify the usual signed and unsigned integer comparisons.
.br(C, R,S,L) branches toL when R C Sthe order is the usual one.
.jmpl puts the address of the instruction after itself in link.
.C denotes one of {e, De, I, Ie, g, ge }.
. Cz denotes the concatenation of C and z.
Figure 3: The SAMInstructionSet Architecture it did not need to be completefor the kernel. AnythingSAM code could not do couldbe doneusing inclusionsof assemblylanguage. As long as most of the kernel were in SAM code, a few assembly inclusionswould not impair portabilitygreatly. Omissionfrom the SAM of capabilitiesneededonly in the kernelwouldsimplifythe post-SAMcompilationphases(sam]iss, sam_68k,etc.).
warn_sam uses all but a few of the features in Figure 3 , and these are sufficient for it -assembly inclusionsare unnecessary.8 The kernel uses the remainingfeatures. However,the SAM is not completefor the kernel. E.g.,the kernelcontainsone instanceof integer multiplicationand two of integer division. Neitheroperationis neededin-line. The simplicitygainedby omittingthem fromthe SAM is significant, since differenttargetsimplementthemin quite differentways.
Compilingthe SAMto any targetinvolvesboundingthe registerset, and it usuallyinvolvesbounding immediatesizes. Both tasks require that one or more temporaryregisters be set aside for the postSAMcompilationphases. E.g.,if the target is load/store,and the destinationof a SAMinstructionis a SAMregisterthat is mappedto a target memorylocation,then the correspondingtargetinstruction mustwrite to a temporaryregister,whichis in turn storedto the memorylocation. For RISCtargets, the other major issue is instructionscheduling. The SAM is otherwiseenoughlike the SPARC and MIPSthat it compilesto them easily.
Compilingthe SAMto the 68kis not so easy. Thoughthe 68kis obsolete,it is instructiveto describe how the new schemetreatsit as a targetunlikethe SAMin someways the SPARC and MIPS are not. The 68k has two-operandarithmeticinstructions,two register classes,and assortedmemoryaddressing modes, many usable in arithmeticinstructions. Problems with the fust two characteristicsare largely avoidedby judicious assignmentsof SAMregistersto 68kregistersand otherwisesolved,for the most part, by setting aside one addressregister for sam_68k. The third characteristicmakes it possibleto avoidusing a temporaryregisterin the 68k for many SAMinstructionsthat requireone in other targets. However,the postincrementand predecrementmodes, which are valuable, e.g., for building terms on the global stack, are incongruouswith the SAM as presented so far. To exploit them, the SAM itself is conditionallyextended with postincrementand/or predecrementloads and stores. Conceptually,the SAMis broadenedfrom a single abstractmachineto an abstractmachine schema,instancesof whichare specifiedby parametersabstractedfromtargets. This notionrecurs in the RISS.
Based on knowledge of the old scheme, experiments in compiling WAM to SAM and SAM to targets, and discussionswith colleagues,I formulated an initial design for the SAM.9 This design evolved over the course of the project, as it became clearer to me what was necessary and comfortable. E.g., the initial design had SPARC-likecondition codes. However, I found that I had to reset them immediately prior to nearly every branch, i.e., I was hardly ever able to do multiple brancheson a singlecompareor test. Moreover,they were somewhatawkwardto implementin full generalityfor targets other than the SPARC. Therefore,I scrappedthem in favor of self-contained branches.
warn_sam
warn_samis actuallytwo modules,one for predicatecode (indexinginstructions)and one for clause code (everythingelse). wam_samJJredicateis a one-pass macro-expansion. wam_sam_-clauseis likewise,plus peepholeoptimizationsto get rid of superfluousjumps. The WAM x registersmap to SAMregisters,alongwiththe usual WAM pointerregisters(h, S,etc.). Also, warn_samintroducesu registersto serve as temporaries. Three suffice for warn_sam;a fourth is used in the kernel. These aremanaged"by hand"per WAMinstruction.
The most interesting aspect of warn_samis how it compiles unification of a variable with a compound term. Compoundterm unificationin the context of the WAM has attracted a lot of attention (e.g., [Mei90, Umr90, Mar91] ). See the literature for general analyses; I shall merely present the strategy. pCwam generates gee. and unify_. largely as in the original WAM, and wam_samex-pandsthese to SAMinstructionsrealizinga versionof whathas becomeknown as the two-streamalgorithm [Van94] . There are two code streams, one for read mode and one for write mode, with branchesbetweenthem. The compoundterm,regarded as a tree with a node per compoundsubterm, is traversed,unifyingwith the term that was representedby the variableat compiletime. As long as this is nonvariable,executionproceeds in the read-mode stream. When a variable is encountered where a compoundsubtermis to be, executionbranchesto the write-mode stream, whichbuilds the compoundsubterm. Afterward,execution returns to the read-mode stream. Each subterm correspondsto exactlyone sequenceof instructionsin each stream-code sizeis linearin term size. Write mode is propagated to subterms. The order of traversal is the traditional depth-first -subterms are not reordered. Given this order, an economical branch apparatus is generated.
The strategy is best explained by an example; Figure 4 presents one, which also appears in [Mar91] . WAM and SAM instructions are represented essentially as in wam_sam. Refinements associated with indexing are omitted, and other items irrelevant to the example are omitted from the SAM code. The SAM code is for a target other than the 68k; for the 68k, postincrement loads on s and stores on h are generated. I shall not explain the example exhaustively; the highlights are as follows:
.unify_temp_variableIgeCtemp_structure: Thesepropagatewrite modeto compoundsubterms.
unify_temp_variable(X) is identicalto unify_variable(X), exceptthat it tells wam_samthat the argumentis a compoundsubterm-there will be one use of X. geCtemp_structure(F,X) is identical to geCstructure(F,X), except that it tells warn_samthat the structure is a subterm and X was definedby unify_temp-variable(X) -this is the one use of X. In read mode,the extrainformation makes no difference,but in write mode, it enables wam_samto avoid first creating X unbound, then dereferencing,trail-checking,and bindingit. This is the only changethe strategyimposeson WRETURN_RRETURN_OFFSET is a target-dependentparameter. A write-mode return is to the point immediately after the call; a read-mode return is to that point plus WRETURN_-RRETURN_OFFSET. Guided by fix_offsetpseudo-ops, the assemblers (._bin) insert no-ops if necessary to make this work) 0 E.g., the write-mode return from the NATIVE_GET_-STRUCTURE call is local(A),and the read-mode returnis local(B);therlX_offsetpseudo-opseparatestheseby WRETURN_RRETURN_OFFSET. This strategyresemblesthat of [Mar91] . However,it was inventedby Kent Boortzfor SICStus. The algorithmwarn_samuses to generatethe SAM code may be formalizedin terms of a push-downautomaton. I shallnot presentthis here;interestedreaders shouldcontactme.
In Figure4, mov(x(2),u(0»underlocal(B) is superfluous,sinceId(s,word(1),x(2»precedingit could as wellbe Id(s,word(l),u(O». An optimizationpass could deal with this, of course. However,early in this project, I tried somethingmore ambitious. x(2) is introduced here by pCwam and u(O)by warn_sam. If instead of assigning it to x(2), pCwam would let the argument of the first unify_temp_variable be a virtualregister,then it couldbe assignedto u(O)by a register allocationat the SAM level. I implemented such a register allocation, using a graph coloring algorithm. However, though it certainly dealt with cases like the one considered here, the benefit was disappointing for most programs. Typical speed-ups were less than two percent. Moreover, compilationtime typicallyincreasedsignificantly. The texture of the SAM codeis one reasonfor the lacklusterresults. Most computationsmust survivekernel calls -caseslike the one consideredhere are relatively uncommon. Most kernel calls may define the u registers. Consequently, most computationsendedup in x registers -wherepCwam would have put them anyway. It followsthat an implementationthat did morein-line wouldprobablybenefit morefrom such a registerallocation.
Two softwareengineeringtechniquesused in warn_sammay be of interest. First, target dependencies,e.g., whetherto generatepostincrementloads on s, are handledusingthe m4 macropreprocessor. m4 is run on a configurationfile that definesmacros appropriateto the target, e.g., POSTINC_Sfor the 68k, followedby the sourcecode of warn_sam,yieldingtrue Prolog. Thus warn_samwastes no time determining dependencies when it runs. Other global parameters, e.g., WRETURN_-RRETURN_OFFSET, are neatly handled in this fashion too. Second, concurrent accumulationof multiple code streams,e.g., for compoundterm unifications,is handled using an extended definite clausegrammarsyntax. The syntaxand a versionofterm_expansion/2 that interpretsit were developed by Peter Van Roy [Van89] . These techniques are used in warn_samand throughoutthe new scheme.
The Kernel
Kernelroutinesbelongto four classes. Unificationroutinesencompassnot only generalunification but also the meat of geCconstant, geClist, and geCstructure. Nondeterminismroutines include failure andtry, retry, and trust routinesthat managechoicepoints. Built-inroutines supportcommon built-inpredicates for whichpCwam generates a builtin_. or function_. instruction instead of a standardcall; most of these routinespertain to arithmeticexpressions,but there are also routinesfor term comparison,argl3, and functor/3. Finally, linkage routines handle transfers between native code andemulatedcode,garbagecollection,etc.
The primarypurposeof the kernel is to moderatecode size; this issue is significant,becauseSICStus is supposedto run real applications.A secondarypurposeis to simplifythe SAM and hence the post-SAM compilationphases. Callinga kernel routine costs something,so these purposes are in tension with the goal of high performance. Deciding what to put in-line and what to put in the kernel is a matter of judgment,preferablyinformed by experiments. The goals of this project include making experimentseasier by makingit easier to evaluatethe consequencesfor all targets of a changein the divisionbetweenthe kernel and code compiledfrom Prolog. E.g., arithmeticoperationsare always handledin the kernel at present; it might be interestingto try handling in-line the common case of dereferencedsmallintegeroperands.
Withinthe kernel,commonand simplecases are handled in SAM code, with assistancefrom assembly inclusionsas needed. Other cases are passed to routines written in C; the same routines handle these cases for emulatedcode. Error reportingis deferredto these routines too. E.g., many built-in routinespertainingto arithmeticexpressionscheck whethertheir argumentsdereferenceto smallinte.-gers. If so, the appropriateoperationis performedimmediately. If not, a C routineis calledto handle largeintegeror floatingpoint argumentsor reportan error.
The SAM-to-target phasesof the kernelcompilationpath ar~the same as those of the Prologcompilation path. This is good for portability. Of course,minordifferencescouldbe parameterized,but this happensto be unnecessaryfor any target so far. label(local (A) ) .
, "[enter write mode)
add(h,word(7) ,s). , "+ jnp (local(H)) .
, "+ label(local(B». , "
, geCtemp_structure(h/l,x(l» mov(functor(h/1),u(1».
, "
mov(x(l) ,u(O» .
jnpl (kernel(NATIVE_GET_SUBSTRUCTURE»~" fix_offset(local(F),local(G),word(WRETURN_RRETURN_OFFSET)
. someextenta relic of the old scheme,wherethe kernel was writtendirectlyin assemblylanguage. At the cost of complicating*_bin, the kernel could be assembledlike code compiledfrom Prolog. Assembly would occur at installation time and loading at run time; much of SICStus is already processedthis way. I have not pursued this possibility,but it might be worthwhile. It would avoid the quirks of some standard assemblers. 11
The RISS
Like the SAM,the RISS is an abstractmachineschema. The parametersspecifying an instance are meantto makethe RISScorrespondcloselyto the RISCtarget. Generally,the RISS differsfrom the SAMas follows:
.The number of registers is specified. A proper subset of SAM registers map to RISS registers, and the rest to memory locations.
.Immediates are size-restricted. The restrictions may be instruction-dependent. SAM immediates map to RISSimmediatesif they are smallenough. If not, it is necessaryto constructthem in RISS registers,or maybeto map them to RISSregisters,if they are heavilyused.12 Constructionmay be done in typical RISC fashion. The word is split into high and low parts, the low part being small enoughfor any instruction. The RISS has an instructionsethi(I,R) that assignsimmediate1 to the high part of register R.13
.Controltransfershavedelayslots. Theinstructionin the delay slot, knownas the delay instruction, of a jmp, jmpl, orjmpr is always executed. By default,this appliesto br too; however,the RISS may be extended with annulling branches. When this is done, branches have the form br(A,C,R,S,L), whereA is one of {n, y}. If A is n, the delay instructionis alwaysexecuted;but if A is y, the delayinstructionis not executedwhen the branchis not taken. jmpl puts the addressof the instructionafterits delay instructionin link. The set of instructionsthat may be delay instructionsis a parameterof the RISS.
For sometargets, there may be other differences. E.g., for the SPARC and MIPS,the source of a st mustbe a RISSregister,sincethese architecturesdo not have a store with an immediatesource.
The purposeof this designis to make the mappingof RISS instructionsto target instructionsas oneto-oneas possible-hence RISSregistersshouldmap to targetregisters,RISS high and low part splitting shouldbe as in the target, and the RISS shouldhave annullingbranchesif and only if the target has them. The mappingis not perfectlyone-to-one,due to things like branchesthat differenttargets realize in different ways, usually involving multiple instructions. However, it is close enough to make it reasonableto perform instruction scheduling at the RISS level. This is the most complex aspect of sam]iss; if it were done at the target level, it would be the most complex aspect of the target-levelphases.
sam_riss
sam]iss has two passes. The first boundsthe register set and immediatesizes, and the second performsinstructionscheduling,whichconsistsof introducinga delay slot per controltransferand trying to fill it with somethinguseful. Other kinds of instruction scheduling might be worthwhile, e.g., separatinga load to a register from a following instruction using the register; at present, however, sam]iss merely fills delay slots. There are two obvious places to look for somethinguseful: first, prior to the controltransfer in the basic block it terminates; second,in the basic block commencing with the target of the controltransfer. sam]iss tries both, trying the second if the first fails to find somethinguseful, the control transfer is a jmp, jmpl, or annulling br, and the target is local. The procedureis prolix to explainbut intuitivelystraightforward;two aspects of it merit further discus-11 E.g., MIPS as fails to provide the primitive div instruction of the basic machine; the synthetic instruction provided in its place performs a 32-bit overflow check, which is superfluous in the kernel. 12 E.g., for the SPARC and MIPS, sam]iss maps the structure tag mask to a RISS register.
sion. First, a RISSinstructionis suitablefor a delay slot if and only if it is context-independentand compilesto exactlyonetargetinstruction. E.g.,jmpl is excludedbecauseit is context-dependent, and br is likelyto be excludedbecauseit compilesto a pair of targetinstructions. sam_rissis guidedby a predicatedefinedper targetthat succeedsif and only if its single argumentis a RISS instructionsuitablefor a delay slot Second,copyingan instructionfrom a basic block to the delay slot of a control transferto that basic blockleads to multiplecopiesof the instruction. It may be desirablein the postRISSphasesfor all copiesto be linkedtogethersomehow. It is easy for sam_rissto supply a link in the form of an unbound variable attachedto all copies of a given instruction. Section 3.6 explains how riss_sparcuses this "tag".
3.6 riss_sparc riss_sparcis an instructiveexample of what final compilationto a target may involve. It has two passes, a macro-expansionand a peephole optimization. By design, the RISS code it compiles is muchlike SPARC codeto beginwith,but a coupleof issuesremainto be resolved,both pertainingto controltransfers.
Oneissueis compilingRISSjmp to a nonlocallabel.14 The availablesingle-instructioncontroltransfer to an unrestrictedlabel is SPARCcall. This implicitlydefines 0(7) and is the natural choice for RISSjmpl, mappinglink to 0(7). It wouldbe nice to use it for RISSjmp to a nonlocallabel too, in order to use just one instruction. Of course,this is safe only if link is dead at the jmp. In SICStus usage, this requirementhappensto be satisfiedmost of the time: jmpl is used to call the kernel, so link is usually live only in the kernel. riss_sparctakes advantageof this, assuminglink is dead at jmp to a nonlocal label, except in certain contexts, which it recognizes. This is satisfactory but nongeneric, i.e., it dependson SICStususage.
The otherissue is using the SPARCconditioncodes. riss_sparcmacro-expansioncompilesRISS br to SPARCcmp or tst followedby bicc (branch on integer condition code). In an instance of tst I bicc, the tst is usually superfluous if the preceding instruction is an arithmetic that defines the operandof the tst, sincemost arithmeticscan set the conditioncodes as a side-effect. The main task of riss_sparcpeepholeoptimizationis to take advantageof this. It is simple,apart from one complication: what if the arithmeticis the fust instructionin a basic block and has been copied to one or more delay slots by sam]iss? Then each copy must set the conditioncodes. The tags suppliedby sam]iss offer a meansof arrangingthis. riss_sparcuses the tag of an arithmeticto specifywhether the arithmeticshouldset the conditioncodes. Fixing the tag in one copy fixes it in all copies, since all copies share the tag. Of course, the problem could be solved by having all arithmeticsset the conditioncodes,but this seemsclumsy. The solutionadoptedis elegant 4 Evaluation I shall focus on the SPARC and MIPS in preference to the antiquated68k. Results for the 68k are qualitativelyconsistentwithresultsfor the SPARC.
Portability
The basic goal of the new schemerelative to the old is portabilityenhancement. Portabilityis resistant to quantification, but some measuressuggestthat the new schemeachievessignificantenhancement. Table 1 presentsthe sizes of the post-WAM compilationphases, in lines of source code, excludingcommentsandblanks. Accordingto this admittedlycrudemeasure,most of the complexityis in warn_sam,as desired. Table 2 presents the amounts of SAM and assembly code in the kernel, again in lines of source code, excluding comments and blanks.15 Since almost all targetdependenciesin the kernel are in the form of assemblyinclusions,this measureimplies a substantial degree of target-independencefor the kernel. In both respects, the new scheme improvesupon the old.
14RISS jmp to a local label compiles to SPARC ba (branch always).
15The amountof MIPSassemblycode is inflatedby quirksof MIPS as.
A less objective but maybe more meaningful indicator is my experience with extending the new schemeto the MIPS. I spent a bit over a weekroughingin workingversionsof riss_mips,mips_bin, and mips_asand dealingwith target-dependenciesin warn_sam, the kernel, and sam_riss,as well as other parts of SICStuslike the loader. Then I spent a bit under a week on debuggingand polishing specificto the MIPS. Thusthe total was abouttwo weeks. I believeit wouldhave taken considerably longerto extendthe old scheme. However,three factors must be taken into account. Oneis that, as the designerof the new scheme,I knowit well -thissurelyhelpedmein extendingit. Anotheris that I designed it with the MIPS in mind -it might not accommodateeven anotherRISCtarget as easily.
Finally, nooptimizations specificto the MIPSareperformed-I believethereare somethatwouldbe worthwhile.To mitigatethe knowledgefactor,I kept a log of my MIPS experience,and I have added to it general explanations and advice, so that it amounts to a rudimentary "porting manual". Undoubtedly,the best test of portabilitywouldbe for someoneother than myself to extend the new schemeto a newtarget. Unfortunately, this has not been possibleyet, but the SICStusgrouphopes it maybe soon.
Performance
Benchmarkingthe new schemeversus the old for a target they share, the SPARC, reveals how well the new schememeasuresup. For the new target, the MIPS, benchmarkingnative versus emulated code revealshow well the new schemerealizes the potential of native code. Not only performance but alsocompilationtime and spaceare of interest. Tables 3 and 4 Table 3 comparesthe old schemeto the new for the SPARC. The platformis a Sun 10/30with 32 megabytesof memory. In all cases, compiletime is longer for the new scheme,typicallyabouthalf again as long. This is reasonable,since the new schemehas more passes.18 As would be expected, the lengtheningis approximatelylinearin the compilesize. In all but one case, compilesizeis somewhatlarger for the new scheme. I believethis is largely due to changesin the divisionbetweenthe kernel and code compiledfrom Prolog. On the whole, the new schemeputs somewhatmore in-line. In all cases,run time is slightlyshorterfor the new scheme. Myimpressionis that slightlyfastermstargumentindexing and dispatchingin the kernel accountfor most of this. It seems that enhancing portabilityhas not diminishedperformance. Table 4 comparesemulatedto native code for the MIPS. The platform is a DECstation5000/240 with 48 megabytesof memory. Emulatedcode is measuredin the standardversion of SICStusand nativecode in my version. In the standardversion,the compilationphases, alongwith manybuilt-in predicates,arein emulatedcode,but in my version,they are in native code; this is a "hidden"benefit of nativecode. In Table4, compiletime is abouttwice as long and compilesize abouttwice as large for nativecode, well withinthe factorsof three mentionedin section2. Run time is shorterfor native code by factors of between 1.9 (quicksort) and 2.8 (query), a satisfactoryreduction. However,I have made this comparisonfor the SPARC too. Apart from query, which compiles atypicallywell for the MIPS,code expansionsare 5-10%less and speed-ups5-20%more for the SPARC. This suggests a need for MIPS~oriented codeimprovements. More sophisticatedinstructionschedulingmight be especiallyhelpful. Measurementsshowthat sam]iss fills morebranchdelay slotsfor the SPARC, whichgives the RISS annullingbranches,than for the MIPS,which does not. Givingthe RISS load delayslotswouldalsobe desirablefor the MIPS. Table5 correspondsto Table 3 and Table 6 to Table4, with the differencethat garbage collectionand other memory overflowhandlingis nontrivial;it is included in run time but also tabulatedas memorytime. Spacedoes not permit a detaileddiscussionof these statistics. I shall limit myself to noting that, for these more-or-Iessreal applications19,native code compilation is worthwhile,but generallynot as muchso as for the Warrenbenchmarks,not surprisingly. The speedups for MIPSnati~eversus emulatedcode, excludingmemorytime, are 112%for bamspec, 52% for freplan, and 32%for mixtus; the speed-upfor bamspec is the greatestI have seen for a substantial program.
Future Work
There are two obvious directionsfor future work: more targets and more optimizations. The new schemeshouldbe extendedto several more targets. Candidatesinclude the newly-introducedPowerPC, the lIP PA-RISC,and the DEC Alpha. The SICStusgroup hopes to undertakeat least one of these duringthe next few months. Beyondimprovinginstructionscheduling,there are two categories encompassingmany potential improvements. First in importance are changes in memory strategy. SICStushas a numberof frequently-referencedcontrol words that at present are independentglobal variables,e.g., the globalstacklimit word,whichis the triggerfor garbagecollection. A referenceto one of these costs two instructionsin the RISS or a RISC target: sethi followedby ld or st. These should at least be put in a table, where they could be referencedusing base-displacementloads and stores. A suitabletable alreadyexists in SICStus. Maybe some of these or other items in the table, e.g., the WAM hb register, should be mappedto registers for some native code targets. The other categoryof improvementsis changesin the distributionof labor between the kernel and code generatedby wam_sam. There is scopefor this, though care is necessaryto keep code size under control.
The new schememakesit easierthan beforeto experimentwith these and otherpossibilities. Someof them seem likely to yield significant benefits. The next release of SICStus, containing the new schemefor nativecode compilation,shouldimproveupon the performancereportedin this paper.
