MARS: aRISC-based architecture for Lisp by Lee, Hung-Chang
MARS- A RISC-based Architecture for LISP 
Hung-Chang Lee, Feipei Lai, Jenn-Yuan Tsai, Tai-Ming Parng, and Yu-Fang Li 
Department of Elecmcal Engineering 
National Taiwan University 
Taipei, Taiwan, R. 0. C. 
Tel: 886-2- 3635251 ext 241 
Abstract 
A RISC-based chip set architecture for Lisp is presented 
in this paper. This architecture contains an instruction fetch unit 
(IFU) and three processing units -- integer processing unit 
(IPU), floating-point processing unit (FPU), and list 
processing unit (LPU). The IFU feeds instructions to the 
processing units and provides the branch handle mechanism to 
reduce branch penalty; the IPU is optimized for integer 
operations, smng manipulation, operand address calculations, 
and some cooperation affairs for constructing a multiprocessor 
architecture; the FPU handles the floating point data type, 
which conforms to IEEE standard 754; and the LPU handles 
Lisp runtime environment, dynamic type checking, and fast list 
access. In this architecture, not only the critical path of complex 
register file access and ALU operation is distributed into LPU 
and IPU, and the tracing of a list can be done fast by the 
non-delayed cur or cdr instructions of LPU. But also, by using 
a new branch control mechanism (called branch peephole), this 
architecture can achieve almost-zero-delay branch and 
super-zero-delay jump. Performance simulation shows that this 
architecture would be about 4.1 times faster than SPUR and 
about 2.2 times faster than MIPS-X. 
I. Introduction 
Lisp, due to its extensibility and flexibility, has gained 
popularity these days. Nevertheless, Lisp programming 
language has some features that are difficult to implement 
efficiently on coni-entional computers. These features include 
frequent function calls, slow list traversal, scope issue of 
special variable, polymorphic operations, and automatic 
garbage cell recovery[l,2,6]. 
The Lisp machines, according to Pleszkun's[ 13 
classification, can be divided into three classes. First, the 
unspecialized stack-based microcoded Lisp processors (e.g. 
Symbolics 3600[2], Lambda[3]). Second, multiprocessor 
architectures where each processor serves a specialized function 
(e.g. Fairchild FAIM-1141). Third, multiprocessor systems 
composed of pools of identical processing elements aiming for 
high performance through concurrent evaluation of different 
parts of a Lisp program on separate processors (e.g. 
EM-3[5]). Another class of Lisp machine designed recently is 
NSC-like architecture with some enhancements to support Lisp 
such as SPUR[6], or by appealing to compiler to reduce the 
hardware complexities such as MIpS-X[7-9]. 
A limited instruction set suitable for Lisp execution are 
presented, and a RISC-based architecture, based on this 
instruction set, is designed. In fact, the architecture model of 
MARS has three folds. The first fold is the Lisp environment 
administrator and list traveling access. The second one is the 
general computational unit, and the f i a l  one is instruction feed 
and control transfer unit. Each fold has respective chip to carry 
its task. These synchronized chips take advantage of 
instruction format parallelism and get parallel execution 
whenever possible. 
The next section gives an overview of the systems. This 
follows by a description of the micro-architecture and 
instruction pipeline of MARS in section 111 and N. Section V 
follows with a performance evaluation using the simulation 
tools. Conclusion and status are stated in the last section. 
Svstem overview 
MARS [19] is a VLSI processor board for Lisp 
processing. Inside each board, shown in figure 1, there are 
CPU chips, i.e., IFU, (Instruction Fetch Unit) and IPU 
(Integer Processing Unit) as well as special chips, FPU 
(Floating-point Processing Unit) and LPU (List Processing 
Unit), Each processor board separate Instruction, Address, and 
Data buses. 
I I F "  I I DCache I Fl Memory 
Figure 1 .  MARS board level block diagram 
IFU, built on a single chip together with a 32-kilobyte 
instruction cache, is the buffering, controlling mechanism 
between the instruction cache and the datapath chips (IPU, 
FPU, LPU). It is designed to interleave instruction fetch and 
execution and achieve coordinated execution among IPU, FPU, 
and LPU. The block diagram is given in figure 2, in which 
there are a remote PC (Program Counter) chain, a displacement 
adder, a retum address stack (RAS) to store PC for calyretum 
instruction pair and dual instruction buffers for holding 
sequential and branch instruction streams. 
IPU, shown in figure 3, retains the integer datapath and 
some control part of a common RISC CPU[lO], performing 
integer arithmetic, shift, logical operations, and address 
calculation for data operands of all datapath chips. There are a 
flat 32 word register file, a 2-level internal forwarding latch, 
and a shifter. FPU conforms to IEEE standard 754. It has 
separate, pipelined Add/Sub and Mul/Div units to provide 
spatial and temporal parallelisms, a hardwired control unit to 
directly support for hardware format conversion, a 
synchronous interface protocol to tightly couple with other 
chips, and a.32-word 64-bit register file. 
I 
Outlet and Patiial demde 
Sequential 
Instruction 
Buller 1 Branch Target Instruction buffer b 
I * 
irom Instrunion Cache to instwnlon Cache 
Figure 2. The IFU block dlagram 
Figure 3. The IPU block diagram 
LPU, shown in figure 4, provides hardware primitives 
for list processing, such as car, cdr, cons, rplaca (replacea), 
rplacd (replaced). It is featured with a big windowed register 
file to expedite procedure call, a tag manipulation datapath, and 
a conaol register for shallow binding. 
CCMMU (Cache Controller and Memory Management 
Unit) is responsible for the operation of local data cache on 
each processor board, addressing translation, and data 
coherency protocol among processors. Local data cache will be 
as large as 128 KB and data will be heuristically prefetched in 
the face of pointer or list. Lock-up free cache design together 
with the cache coherency protocol- phoenix- are proposed in 
[IT]. 
1 insttuclin bus input pads 
Conlrol Registers 
I 
address (L data bus pads 
Figure 4. The LPU block diagram 
B bus I 1 
Compa- 
rator  1 
At first, we intended to adopt integrated LPU architecture 
like Symbolics 3600. but owing to the unique configuration 
and some other consideration inside MARS ( M A R S  can run 
conventional language, for instance C, in different way )[ 191, 
an on-chip instruction cache inside LPU will not provide any 
speed advantage when all the chips wait for instructions coming 
from the instruction cache and then decode their own 
instructions. Besides, some extra problems incurring with the 
design of on-chip cache has been deeply discussed in [24-251. 
Therefore, we decide to separate the instruction fetch unit for 
conventional LPU, and build IFU to accommodate the remote 
PC unit and the necessary logic for buffering and control of 
instruction access. This decision did cause some problems, 
thereby influenced our designs for instruction set and 
microarchitecture, but it provided us valuable experiences to 
deal with such arrangement and the result can still meet our 
initial requirement. 
Micro-Architecture 
The instruction set of LPU[23], shown in Table 1, was 
carefully designed to speed up the execution of Lisp program 
and to reduce the time wasted by traffic between LPU and 
IPU/FPU. In order to reach this goal, there are some 
instructions executed parallel by LPU, IPU and/or FPU. 
Parallel execution of IPU and LPU happens when IPU 
executes some ALU operation and LPU checks the 
corresponding tag of source registers and generals exception if 
data type of operands are neither fix number nor character. 
Another parallel execution happens when LPU loads or moves 
data, used by IPU or FPU, into a frame register. Besides IPU 
and FPU, LPU also writes this data into the corresponding 
register. With this parallel execution, we need not take extra 
instructions to transfer data from LPU to IPU or to FPU. The 
instructions of LPU with this kind of parallel execution are cur, 
pop, load and mov. 
&plemented list primitives are car, cdr, 
cons, rplucu and rplucd. The cur and cdr instructions are similar 
to load instruction except that the least two significant bits of 
address are masked with 00 or 01, respectively. In LPU 
pipeline stage, there is no delay for load instruction, so the trace 
of a list could be done fast by cdr and cur instructions. The 
rplaca and rplucd instructions are similar to the srore instruction 
except masking the address bits the same as in cur and cdr. The 
cons instruction executes the action of rplucu by using global 
register R7 (free cons cell pointer) as address and moves R7 to 
destined register. A complete cons primitives in Lisp could be 
done by a cons instruction following a rplucd instruction to 
replace the cdr of this cons cell and a cur instruction to update 
the free cons cell pointer. All above instructions are executed 
with parallel type checking. If the data type of source address 
register is not cons or nil, it will result in an exception. 
. . .  
Ai2 Stack oDeration 
The stack operation can be done in one instruction cycle 
by the push and pop instruction of LPU. Stack pointer register 
(SP) in LPU which decreases by 2 (distance of double-word) 
before executing the push instruction and increases by 2 after 
executing the pop instruction automatically. The push and pop 
instructions are mainly used when binding or unbinding the 
special variables and when saving or restoring frame windows. 
The p o p  instruction, as cur and cdr,  is a non-delay load 
instruction which can supply data for the next LPU instruction 
without delay. The content of stack pointer register can be read 
or written by rd-sp or wr-sp instructions. 
w c t i o n s  include: loud instruction 
which loads data from memory, mov instruction which moves 
data from a register to another register, l o u d f  and store f 
instructions which generate address for FPU and f-to-1 
instruction which transfers data from FPU to LPU. Note that 
the loud and mov instructions are parallel executed by IPU, 
FPU and LPU when the destination is a frame register and the 
action of transferring data from LPU to IPU or to FPU can be 
done by the mov instruction. 
The tag value of a register can be loaded with the 
immediate tag value packed in instruction or moved from 
another register. The data field of destined register could be the 
destined register itself or another register which is the second 
source register specified in the instruction. -
In the MARS system, comparison of two operands and 
branch are executed in one instruction cycle with zero or one 
delay. The compare & brunch instruction of LPU compares two 
operands with eight kinds of conditions encoded in 3-bit 
condition code and sends the compared result to the other 
processing units. 
A- 
The actions of jumping to target address and saving of 
program counter of function call and return instructions are 
done by IFU, while LPU updates the pointer of current frame 
window and checks whether control register file being 
overflowfunderflow or not. If an overflow or underflow 
happens, LPU causes an exception. 
Decial lnstructlons . .  
There are some special instructions of LPU which could 
be executed only in kernel mode. The rd-lpsw and wr-lpsw 
instructions transfer data between registers and LPU processing 
status word which contains the current window pointer, saved 
window number and some system status. The lpu-wake and 
lpu sleep instructions set LPU to be active or inactive. When 
LPD is inactive, the MARS system is acting as a general 
purpose computer without hardware support for Lisp. 
B. T w W  and Data re~r- 
storage 
MARS represents List pointer or immediate data with a 
38-bit tagged word consisting of a 6-bit tag and a 32-bit pointer 
or data. The data types represented by tag are shown in Table 
2. 
There are three immediate data types: character, fix 
number and short floating-point number. The first two are used 
by IPU and the last one is used by FPU. We decided to 
represent short floating-point number by an immediate type 
because all the IPU, FPU and LPU have a user-view of 
32-word control register file. The register files are monitored 
by LPU as a frame window and saved in the LPU's frame 
window while executing function call. The 32-bit short 
floating-point number has the same width as the data field of 
the LPU's register, therefore, it can be used as an immediate 
data and stored in the LPUs frame window. 
The type of data which is one of the following types: an 
immediate number, an indirect number, or a list is recognized 
by a type-checking hardware in parallel with the execution of 
data itself -- usually under the assumption of integer type, or 
with the comparison of LPU-compare-branch instruction. The 
other sixteen tag values are defined by compiler to identify 
pointed object such as string, vector and function. 
Poor data de& 
MARS architecture was design with emphasis on speed 
and simplicity rather than on data or hardware density. Board 
level data bus are 64 bits because of the intended one data fetch 
per cycle for the double floating point. The memory 
implementation particularly has poor Lisp data density since a 
@-bit system is used instead of a brand new 38-bit one. 
Several approaches exist for handling 38-bit data type: 
(1) build the whole system with 38-bit words, 
(2) allow unaligned cache access, or 
(3) place 38-bit words in aligned @-bit words. 
We adopt the last one because of three considerations. First, 
many off-the-shelf subsystems use 32 bit words as an integral 
unit. Second, unaligned cache accesses would increases the 
cache cycle time and therefore, decreases the system 
performance. Third, since MARS runs other conventional 
languages (e.g. C and Fortran ), a 32-bit words integral unit 
memory system is preferred. 
MARS stores the 38-bit tagged word in memory by two 
32-bit words aligned in a double-word boundary (data in even 
word and tag in odd word) with 26 bits unused. The data bus 
connecting four processing units and cache memory is @-bit 
wide -- 32 bits are needed by the IFU and IPU, 64 bits by the 
FPU and 38 bits by the LPU. This means that we can 
load/store one or two 32-bit words in one memory cycle. 
A list cons cell consisting of car's tagged word and cdr's 
tagged word is represented by four contiguous 32-bit words 
aligned in the quad-word boundary of memory. The first half 
of the double-word contains the car's tagged word and the 
second one contains the cdr's tagged word. We can access the 
car's tagged word of a cons cell by setting the least four 
significant bits of the cell's address to 0000 and access the 
cdr's tagged word by setting these four bits with. 10oO. 
C RlE-RZ3  
D R24-R31 
C I316423 
8 R8 -R15 
C R16-RZ3 
D RZ4-R31 
C R16-RZ3 
C. Registers o r g m  . .  
W4.L-L 
N & M  W5.IN 
W5.LOWL 
W6.IN W5.0LJl 
W6 .LOCAL 
W6 .U W.IN 
W.W% 
The registers organization in LPU plays the role as a 
runtime environment administrator. In Lisp, the arguments, 
local variables and special variables are accessed frequently. 
These variables are allocated in register fiie and maintained in a 
fast scheme described later. There are two kinds of register files 
in LPU, one is the control register file and the other is the 
binding register file. The control register file organized as an 
overlapping frame window structure is used to keep the 
activation records of callers and callees. The 32-word register 
file of IF'U and LPU are mapped to one of the frame windows, 
so user can view the control register as a 32-word register 
frame window whose data may be either integer (in IPU), 
floating-point number (in FPU), or pointer (in LPU). The 
binding register file is used to provides the dynamic scope of 
special variables and keep the binding value of theirs. We use 
shallow binding scheme to bind and restore the special 
variables. 
rol r- 
The control register file is organized as multiple, 
overlapping, fixed-size frame windows (shown in figure 3, 
similar to the multi-window register file of RISC I&II [IO]. 
However, the control register file differs to that of RISC I1 in 
that it has to monitor the 32-word register files of IPU/FPU and 
keep their contents in the corresponding frame windows in 
function calling. We designed a mechanism, which will be 
described later, to map the registers of IPUFPU into the 
control register file of LPU across function call/return. There 
are 8 frame windows and totally 136 registers in the control 
register file, but only 32 registers (one frame window) can be 
seen by user at a time. The user's view of 32-word registers is 
further partitioned into four 8-word register sub-frames -- i.e., 
global, input, output, and local. The partition of the 32-word 
registers differ to that of RISC 11, which has 10 registers for 
global, 6 for input, 6 for output and 10 for local, for the 
reasons that Lisp uses more arguments and fewer local 
variables than C or Pascal, and this arrangement makes the 
mapping of IPU/FPU's registers to the control register file 
easier. The global frame shared by all windows is used to hold 
some environment variables such as return value and pointer to 
the top of heap memory. The input frame is intended to place 
input arguments from parent function (caller). On the other 
hand, the output frame is used to hold and send arguments to 
child funcdon (callee). The output frame of caller is overlapped 
with the input frame of callee. U n i f u n c n o n  i s m e d ,  the 
window viewed by user switches from caller to callee and the 
output frame of caller now becomes the input frame of callee. 
The local registers frame which does not overlap with other 
window is used to store local lexical variables or temporary 
values. 
between IPU and LPIJ 
In the MARS system, only LPU has frame-window 
register structure and there are merely 32 registers in IPU and 
FPU. How do we keep IPUs and FPUs register data when 
executing function call? We solve this problem by the following 
mechanisms. First, the 32-word register file of IPUFPU is 
partitioned into four 8-word register groups which are mapped 
into the four sub-frames of LPU's current frame window. 
Figure 5 shows the mapping of IPU/FPU's 32-word register 
into LPUs frame window. A group registers (RO - R7) and C 
group registers (RI6 -R23) of IPU/FPU are always mapped 
into the global frame and local frame of LPU's current frame 
window. In contrast, B group registers (R8 - R15) and Dx 
group registers (R24 - R31) of IPU/FPU are mapped into the1 
input frame and output frame or vice versa according to the 
current window number of LPU being even or odd. Assuming1 
that the window number of current function is 0, then B group1 
registers are mapped into the input frame and D group registers 
are mapped into the output frame. After calling a child function, 
the frame window number is increased by one and now the D 
group registers of IPU/FPU are mapped into the input frame of 
window 1 ,  which is the output frame of window 0 (see figure 
6). This means that we do not have to save the 8 registers 
corresponding to the output frame of current window when 
executing function call. Likewise, A group registers mapped 
into the common global frame of all frame windows do not 
have to be saved, so only the remaining two 8-word register 
groups mapped into local frame and input frame need to be 
saved into or restored from the corresponding frames of LPU's 
current window when executing function call or return. 
The translation from IPU/FPUs register number to the 
sub-frames of LPU's current frame window can be 
implemented easily by the circuit shown in figure 7. The turn 
signal sent from LPU is reset to 0 when the number of LPU's 
current frame window is even and is set to 1 when frame 
window number is odd. When turn is 0 the translation is an 
identical one; when turn is 1 the translation maps the registers 
number of B group into output frame and maps the registers 
number of D group into input frame. 
IPUlFPU LPU 
R16-R23 .LOCAL 1 R24-R31 [ J l  W1.IN 
R16-R23 W1 .LOCAI 
R8 - R I 5  w1.oLn 
R16-R23 .LOCAL 
r 
Flgwe 5. Frame-wlndow structure of contml registu Ale and 
mappbg of colrrspolldlng m@stcr groups in IPU/Fpu 
Apart from the above mapping scheme to reduce the 
overhead of saving and restoring IPUFPU register data, we 
save them into LPUs current frame window parallel with the 
execution of IPU's instruction. LPU monitors all the 
instructions executed by IPU. When IPU executes an operation 
and writes the result back to the register file, it also puts this 
result on the data bus at the memory cycle. Meantime, LPU 
receives the data and writes back into the corresponding register 
of the current frame window. With this mechanism, we need 
not save any IPU registers data into LPU while executing a 
function call, we only have to restore the necessary IPU 
register data from LPU which would be used before the 
execution of the next function call or before the end of the 
current function when the called function returns. This 
overhead would be about two or three instruction cycles per 
function call on average. 
By using above mechanisms, the register data of IPU 
could be kept in LPU's control register file with little overhead 
while executing function call or return. The multiple, 
overlapping frame window structure of control register file in 
LPU updates runtime environment very fast. Because LPU 
does not execute ALU operations, it can spend more time in 
accessing the complex register file. On the other hand, the IPU 
which must spend time in executing ALU operations has a. 
simple 32-word register f i e  and can access the register faster. 
IPUlFPU LPU IPU/FPU LPU 
RO - R7 
R8 - R I 5  
Global FI w7.OUI - call 
WO.in 
wo.ou1 
Wl.in 
RO - R7 Ll 
R8 - R I 5  
Global U6 
wo.ou1 
W1 .in 
w 1  .out 
3 1  
Figure 6. Mapping of frame window before and 
after executing function call and return 
R(4) R(3) => R(4) R(3) 
0 0  O O A  
1 0  1 o c  
1 1  1 1 D  
0 0  O O A  
1 0  1 o c  
1 1  0 1 B  
TURN=O 
TURN=I  ' 
A-- r I I I I  
Flgure7. Translation of IPU/FPU Register number 
There are two kinds of variables in Lisp, the lexical (or 
static) variables and the special (or dynamic) variables. The 
lexical variables are known at compile time and is constructed 
as a stack frame. In contrast, the binding of the special 
variables are known at run time. Two popular ways of binding 
are deep binding and shallow binding. Deep binding 
implementations store the binding value of special variable in 
stack. When looking up a variable, the stack must be searched 
until the value is found. In shallow binding, each variable is 
assigned a global-value cell to store the binding value, while 
old values are pushed on a stack. In this scheme, variable 
lookup is very quick. For this reason, most uniprocessor Lisp 
system use shallow binding. 
for shallow bin- 
The 32-word binding register file which has no 
counterpart register file in IPU/FPU is used to store special 
variables in Lisp. Each special variable corresponds to one 
register allocated at loading time. We use binding registers in 
shallow binding scheme to handle the special variables. When a 
special variable is bound to a new value, the old value in the 
corresponding register has to be pushed into the restore stack, 
but when this special variable is unbound, the old value is 
popped from the restore stack and restored to the corresponding 
register. An example of binding and unbinding of special 
variables is shown in Figure 8. The binding registers, which 
does not have the corresponding register file in IPUFPU, can 
only be used by LPU's instructions. If they are to be executed 
IPU or LPU instructions, they should be moved to the global 
frame registers. By using the binding registers, we can speed 
up the access of special variables. 
(let ((x 3) 
(Y 5) 
(2  7)) 
(loo x Y 2 ) )  
. .  
Belore let binding 8 aner (loo x y z) 
Bindin Re isler :rq Memor 
SP 
After la1 binding: 
Bindin Re ister '" 
Figure 8. Binding and Unbinding of Special Varlable 
reeister file w d  in multiDrocessoL 
Snvlronment 
MARS is a multiprocessor project. Parallel processing for 
building multiprocessor environment is currently under 
intensive research. For multiprocessor Lisps, shallow-binding 
,implementation poses a problem: in the event that multi process 
es try to change a special-variable binding, how does MARS 
Lisp resolve the conflicting requirements of the different 
processes without serializing processes which result in the 
accessing conflict. MARS Lisp uses the aeclaration of special 
variable primitive (e.g. defvar) to define the special variable. 
Instead of assigning a global-value cell to store the current 
binding value, MARS Lisp uses the binding register file for 
keeping the information. Different processes can get the 
updated value directly from the register file instead of from the 
m2 
global shared memory. Therefore, seriahzing processes which 
get conflict for the same special variable is not necessary 
(shown in Figure 9). In the case of the number of special 
variables excesses the number of binding register file, the extra 
special variables are kept in the restore stack and copied to the 
next stack if another function call executes. 
ID 
'pcl. sp2. spes. 
nI.1, nl.2. n1.S. 
n2.1. n2.2. n2.3. 
are Uapdata 
. . . .  . . . ,  , 
RP MA WE 
Restore stack for pm- 1 Rutan atack for 2 
@ 
Figure 9. Binding register used In MARS Lisp 
JV. & Br- 
strategv 
. .  
MARS uses a fronted instruction fetch stage plus the 
following two independent four-stage pipelines, shown in 
figure 10, attempting to issue and complete an instruction every 
cycle. The instruction fetch stage issued by IFU fetches the 
following instruction after a non-compare instruction and two 
instructions (plus control transfer target address) after a control 
transfer instruction. The two four-stage stages are independent 
but synchronously executed by IFU and LPU respectively to 
meet the execution requirements that these two chips demand. 
For example, one duty of IPU is to determine the branch 
condition, so that the register value which determine the 
condition should be fetched as early as possible. On the other 
hand, LPU does not take this responsibility so that register 
fetch stage is delayed as far as possible to wait the result of 
extemal data reference. 
p,< I C A  
IFU 
IPU 
Flgure 10. Three kinds of pipehe stage in MARS 
Through this kind of pipeline stage arrangement, MARS 
can support almost zero-delay branch, super-zero jump, and a 
non-delayed list access. 
Branch instructions have a considerable effect on the 
performance of pipelined machines[ 11-13]. Conventional 
architectures employ additional hardware to deal with this 
problem. They detect the presence of a branch instruction and 
put off prefetching until the branch condition has been 
evaluated, or use some branch prediction techniques to reduce 
the number of control flow breaks[l4,15]. Recently, a Branch 
Folding was proposed in the design of CRISP[16], which 
folded a non-branch and the following branch instruction to get 
zero-delay branch. A well designed IFU adopt the combination 
of delayed branch, multiple prefetch and early resolution 
method to reduce the branch penalty. IFU executes the first two 
stages of our total 6-stage pipeline and issues instructions to 
the datapath chips (IPU, FPU, and LPU). Some intelligence 
exists in the IFU when issuing the instructions flow. The 
partial decode unit of IFU, executing at the PD (partial decode) 
stage, can peep out the existence of an incoming jump 
instruction, calculate the address and access the instruction 
ahead of time. IFU can absorb that jump instruction and send 
out the jump target address simultaneously. This mechanism 
makes a super-zero-delay jump instruction. Moreover, a 
conditional jump instruction is known in the PD stage, the IFU 
unit extracts the offset field of the instruction, adds this value to 
the PC (program counter), and then fetches the branch target to 
the datapath chips. If compare is a fast compare (fcb), we can 
resolve the compare at the early beginning of the ALU stage, 
that is, settle the branch before IF stage of the next instruction. 
Therefore, we can obtain a zero-delay branch. In some cases, 
however, a full compare is necessary, delayed or squashable 
compare and branch (dcb and scb) are addressed to reduce the 
penalty of pipeline drain. Experience has pointed out that only 
10% of the slots are filled with no-operation instruction[l7]. 
With the combination of fast and full compare and branch (fcb, 
dcb, and scb) schemes, almost-zero-delay branch effect can be 
obtained. 
B. Non-delaved list access 
Most Lisp programs execute list access frequently[ 181. 
List structure is usually constructed with two parts, header of 
list (car) and tail of list (cdr). Each of the two parts contains a 
tag field to identify the data type and a data field. When a car or 
cdr instruction is issued, tag field check and data field access 
are carried out simultaneously. Under LPU delayed RF 
mechanism, register can be fetched with short cut and incur no 
internal interlock to the following instruction, illustrated in 
figure 1 1 .  Detailed timing about non-delayed load work is that 
tag comparison stage (cmp) is executed at the falling edge of 
phase 2 and the memory access stage (MA) of the previous 
instruction is also ready at this period so that an internal 
forwarding from MA stage to cmp stage can be done. 
C .  N o n - d e l a h  
Procedure calls occur frequently in Lisp program . There 
are several problems associated with such frequent procedure 
calls. First, local variables should be kept with each function 
call and restored from memory on every return. Second, 
arguments passing must be cross- referred to the memory to 
and fro. Frame windows are provided in the LPU to keep all 
these variables in registers to reduce the cost of external 
memory reference. The input frame in each procedure's frame 
window overlaps with the output frame of the procedure that 
203 
calls it, and the output frame overlaps with the input frame of 
the procedure that it calls. Local variables are kept in the local 
frame of the procedure's window, and global ones in the 
global frame, which is visible as well as shared by the frame 
windows. This frame-window scheme makes procedure call 
virtually free, and significantly speeds up their operation. 
However, the gain is not obtained without any price. Increasing 
the number of registers means that rather an amount of chip 
area will be used and register access time will be longer and 
process switching overhead will increase. Nevertheless, these 
prices are offset by a dedicated environment maintenance unit, 
that is, LPU, and an independent execution pipeline stage 
designed between IPU and LPU. 
(1)LPU cdr / I D 1  R F  h M A I I W q  
wwu [ car I ID/ RF ~ M A I J W B J  
, W C h  data 6 
LPU 
:Internal forwarding 
car ID 
Figure 11.  Non-delayed load execution (1) and 
one-delayed slot Ust compare and 
branch (2) 
V. Performance Evalu- 
Six Gabriel benchmarks - a set of programs which test the 
speed of Lisp systems in various aspects - have been carried 
out to compare the performance of M A R S  with other 
architectures, shown in Table 3. The first column show the 
results for M A R S ,  excluding the effect of cache misses. 
Column 2, 3, 4, and 5 give the results for the other three 
architectures; the results for VAX-1lmO are from Gabriel's 
book[20]; SPURS results are from Patterson's paper[21]; 
MIPS-X's results are from Steenkiste's paper [22], and 
distributed into two columns- one without optimization, the 
other with optimization. The last four columns give the ratio of 
the execution time of VAX-1 ln80, SPUR, and MIPS-X to the 
execution time of MARS, We have adjusted the results for 
SPUR for a 100-nanosecond cycle time instead of the original 
150 nanoseconds because of a new version of SPUR report 
accordingly [28]. The reason why SPUR, when comparing 
with MARS, has such long cycle time is due to several 
reasons: SPUR decodes more instruction set in a chip, the 
tagging hardware is combined with ALU operation. MARS 
executes Lisp programs about 35.4 times as fast as the 
VAX-lln80, almost 4.1 times as fast as SPUR, and about 
2.2 times as fast as MIPS-X. However, in all these cases, the 
perforamnce difference varies significantly across the 
benchmarks. The best one is srack, which binds the special 
variables in the binding register file and can be referenced 
directly from the binding register file. On the other hand, the 
benchmarks iterative-div2 does not run so well as the other 
benchmarks because this benchmark has a very deep call-depth, 
window overflow and underflow occur in most of the function 
callheturn and the overhead of saving and restoring frame 
window actually slows down the execution speed. 
'Nme b mflllsconds Ratioa 
"Ay SPUR Hlpsx HIPGX. VMJ - -" 
W H A l S "  HAW 
b k  37 830 80 72 72 22.4 2.1 1.9 
sbk 70 7100 710 602 592 101.4 10.1 8.6 8.4 
hkl 325 5270 552 482 448 16.6 1.7 1.5 
div-fter 55 3800 ... 307 157 69.1 --- 5.6 2.9 
dfv-rw 340 3750 1950 284 196 11.0 5.8 0.8 
derW 110 6580 667 604 381 78.0 6.0 5.5 3.5 
35.4 4.1 2.9 2.2 Geometric mean 
Table 8. Execation t h w s  I. mUUsecOn& for the 0"kJ k c h m u h  
~ ~ t ~ :  MIPS-xo m- ulpt wsp -cut* d t h  opt1-L 
It is interesting to find the reasons for the performance 
difference between MARS and MIPS-X. Both MARS and 
MIPS-X are RISC processors, and of the same cycle time (i.e. 
50 ns) but they differ in that MARS has a Lisp environment 
administrator (i.e. LPU). The LPU has hardware support for 
tag handling, type checking on lists, binding registers, and uses 
frame windows to reduce the cost of register saving and 
restoringrand so forth. The MARS hardware for tag handling 
would eliminate about 25 percent of the cycles on MIPS-X, 
binding register would also save about 50 percent of the cycles 
for load and store on MIPS-X, and others (e.g. 
super-zero-jump, almost-zero-jump, fast list access, etc.) 
accouni for the remaining 45 percent. The frame windows do 
not function well for the Gabriel benchmarks and their average 
effect is small. The reasons are that some benchmarks use only 
few arguments and local variables per frame window and have 
a call sequence straightly backwards and forwards, thus the 
overhead of saving and restoring frame windows for overflows 
and underflows which is 16 register-to-memory transfers 
instead of just several for MIPS-X actually slows down some 
programs. Neither do the non-delayed car and cdr instructions 
in these benchmarks work well when compared with MIPS-X 
since the delayed slot can always be filled. MARS will perform 
better on more realistic programs or cases without an 
optimization compiler involved because the frame windows and 
non-delayed carlcdr instructions will be more effective. 
VI. Concl- 
A design of chips set for Lisp execution is proposed in 
this paper. By separating the IFU from the datapaths and our 
deliberate pipeline arrangement, we can not only get 
coordinated executions among IPU, FPU, and LPU but also 
drastically reduce slots due to control transfer; leaving the 
compiler more chances to fill the delayed load slots, thus 
accomplishing our goal of single-cycle instruction execution. 
What is more exciting, we can absorb the jump instructions 
within the IFU and directly issue the target to datapath chips to 
achieve what we call the super-zero-delay jump. An 
independent and separate LPU, playing the role as a Lisp 
runtime environment administrator, can accelerate Lisp 
programs with the following reasons. First, long complex 
register file access can be handled within LPU, without the 
company of a long ALU stage. Second, Two independent 
pipeline executions of IPU and LPU can separate the critical 
path of long register fetch plus integer processing into two 
independent parts. Furthermore, the LPU can put off the 
register fetch until the external memory access is ready, and 
thus no delayed slot is needed when refemng the data cache. 
Third, because instruction decoding of IPU and LPU are local, 
some frequently used Lisp primitives can be hardwired without 
increasing the complexity and access time of instruction 
decode. Fourth, by excluding the arithmetic calculation within 
204 
the LPU, the LPU can offer more silicon resource to 
accommodate more registers, and thus cut down the need of 
extemal memory access to increase system performance. Fifth, 
hardwired primitives can reduce machine cycles needed when, 
implemented by the underlying machine instruction. 
Status 
The implement of LPU is in progress. We have described 
the LPU at the register-transfer level with M modeling 
~ language. The layouts of the custom chips will be finished later 
in this year. 
REFERENCES 
A. R. Pleszkun, and M. J. Thazhuthaveetil, "The 
Architecture of LisD Machines." IEEE Computer, Vol. 20, 
No. 7, Mar. 1987,'pp. 35-44:' 
D. A. Moon, "Architecture of the Symbolics 3600," Proc. 
Twelfth Symposium on Computer Architecture, Boston, 
June 1985. 
MI. The Lambda Svstem: Technical Summary, 1983, ,- 
LISP Machines Inc. 
A. L. Davis and S. V. Robison, "The FAIM-1 Symbolic 
Multiprocessing System," Spring I985 Compcon Digest 
of Papers, 1985, up. 370-375. 
Y. Yamaguchi, K. Toda, and T. Yuba, "A Performance 
Evaluation of a Lisp-based Data-Driven Machine (EM-3)," 
Proc. IOth Int'l Svmo. Compurer Architecture, June . r ~- 
1983, pp.363-369. 
M. Hill, et al.,  "Design Decisions in SPUR," IEEE 
Computer, Nov. 1986,ip.8-24. 
M. Horowitz, et al., "MIPS-X: A 20-MIPS Peak, 32.-bit, 
Microprocessor with On-Chip Cache," IEEE Journal of 
Solid-state Circuits,Vol. SC-22, No. 5, Oct. 1987, 
P. Chow and M. Horowitz, "Architectural Tradeoffs in the 
Design of MIPS-X," Proc. 13th Symposium on Computer 
Architecture, Jun. 1986, pp. 300-308. 
P. Steenkiste, and J. Hennessy, "Tags and Type Checking 
in Lisp Hardware and Software approaches," Proc. 
Second Int'l Conf. Architecture Support for 
pp.790-799. 
Programming Langubges ~ and Operating- -Systems, 
[IO] M. Katevenis, Reduced Instruction Set Computer 
Architectures for  VLSI, Ph.D. dissertation, Computer 
Science Division (EECS) UCB/CSD, University of 
California, Berkeley, Oct. 1983. 
[11] S. McFarling and J. Hennessy, "Reducing the Cost of 
Branches," Proc. 13th Symposium on Computer 
Architecture, Jun. 1986, pp. 396-403. 
[I21 J. E. Smith, "A Study of Branch Prediction Strategies," 
Proc. 8th Symposium on Computer Architecture, May 
[13] J. K. F. Lee and A. J. Smith, "Branch Prediction 
Strategies and Branch Target Buffer Design," I E E E  
Computer, Vol. 17, No. 1, Jan. 1984, pp. 6-22: 
[14] J. Hennessy, et al., "Haf;dware/Software Tradeoffs for 
Increased Performance, Proc. SIGARCHISIGPLAN 
Srymposium on Architectural Support for Programming 
Languages and Operating Systems, ACM, Palo Alto, 
ACMIIEEE, Oct. 1987, pp. 50-59. 
1981, pp. 135-148. 
M a .  1982, pp. 2-11. 
[15] D. J. Lilja, "Reduced the Branch Penalty in Pipelined 
Processors,"IEEE Computer, Vol. 21, No. 7, Jul. 1988, 
pp. 47-55. 
[I61 D. R. Ditzel and H. R. McLellan, "Branching folding in 
the CRISP microprocessor: Reducing branch delay to 
zero," in Proc .  14th Annual Symp. Computer 
Architecture, 1987, pp. 2-9. 
[ 171 H.-C. Lee, and C.;E. Wu, "Lock-up free cache design and 
the phoenix protocol," NTU-EE-CS memo No. 329-4, 
Comuuter Science Division (EECS), National Taiwan 
University, Taiwan, R.O.C., Jan. 1989. 
[18]. D. W. Clark, "Measurements of Dynamic List Structure 
Use in Lisp," IEEE Trans. Software Engineering, Vol. 
se-5, No. 8, Jan. 1979. 
[19]. G.-S. Jang, F. lai, H.-C. Lee, Y. C. Maa, and T. M. 
pamg, J.-Y. Tsai, "MARS-Multiprocessor Architecture 
Reconciling Symbolic with Numerical Processing," 
International Symposium on VLSI Technology, Systems, 
and Applications, 1989. pp. 365-370. 
[20].R. P. Gabriel, Performance and Evaluation of Lisp 
System, The MIT Press, Cambridge, Mass., 1985. 
[21] D. Patterson, "A Progress Report.on SPUR," Computer 
Architecture News, ACM, Mar. 1987, pp. 15-21. 
1221.P. Steenkiste, and J. Hennessy, "Lisp on a Reduce 
Instruction Set processor: Characterization and 
Optimization," IEEE Computer, Vol. 21, No. 7, June 
[23] J.-Y. Tsai, The Design of List Processing Unit (LPU) for 
the MARS system. M.S. thesis, Computer Science 
Division, Department of Electrical Engineering (EECS), 
National Taiwan University, Taiwan, R.O.C., July 1989, 
[24] J. Cho, et al., "The Memory Architecture and the Cache 
and Memo? Management Unit for the Fairchild CLIPPER 
Processor, Tech. Rep. UCBICSD 861289, Computer 
Science Division (EECS), University of California, 
Berkeley, CA, April 1986. 
[251 A. J. Smith, "Cache Memories," Computing Surveys, 
ACM, Vol. 14, No. 3, Sep. 1982, pp. 473-530. 
[26] G.-S. Jang, "The Floating Point Unit for MARS: Design 
and Specification," NTU-EE-CS memo No. 329-3,1988. 
[27] K.-C. Chen, H.-C. Lee, F. Lai, and Z.-W. Liao, 
"Concurrent MARS Lisp: Language feature and 
Implementation," NTU-EE-CS memo No. 329-5, 
Computer Science Division, Department of Electrical 
Engineering (EECS), National Taiwan University, R.O.C. 
Mar. 1989. 
[28] D. Lee, et al., "A VLSI Chip Set for a Multiprocessor 
Workstation, PART I: A RISC Microprocessor with 
Coprocessor Interface and Support for Symbolic 
Processing," Tech. Report No. UCBICSD 891500, 
Computer Science Division (EECS), University of 
Califomia, Berkeley, CA, April 1989. 
1988, pp. 34-45. 
Tag value 
mxxxB 
001- 
OlOxXxB 
O l l M l X B  
OllOlXB 
01IIOXB 
OllllXB 
1 lOoOOB 
I 
11 11118 
Data type 
Immedlate: 
I". character 
tmm. Rxnl numba 
Imm. short floatlng polnt (32 bltd 
Indlrect number: 
blg number 
ratio 
long flcntlng pant (64 blfsl 
complex 
Other: 
defined by compller 
Table 2 Data type and correspondhg tag value 
B W - M  
Cud(~U.lPU.lPU~~ 
I load b Bud. U81 1 T loadIc  cud. R 1  
,,,_ .I,.e _,,.. $i-bi I n s t r ~ ~ l b n  under uniprocessor environment 
c OlOB 1 1 ~ l L P U ~ o d S  and two cycles per inslrucllon under mullipmceessor environme 
D 0110 Us1 I a L C O N S  
F 1010 ~ b o t ~ l a g b e F l X N U N  
G 1108 ml bolh lag be FIXNUM M C W U  
H 1110 mrm 
M cache mi86 happn 
Table 1. The Instruction set for LPU 
206 
