Set associative, i0 ns R/W Access #2 64 set, 1 way Time, ECL Bipolar RAM. #3 256 set, 4 way 32 B write buffer, non store through. #4 D-cache is equipped with an 8 word (32 B) parallel match logic for speeding up searching.
Fig. 1 Configuration of the FLATS Machine
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct eommereial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. These data types are checked by hardware.
Basic Data Format
"BIG NUM" and "BIG FLOAT" argument(s) in arithmetic operations causes a trap to extended arithmetic routines.
Most data types except CAT and AMT are similar to those of other Lisps.
CAT, AMT and H-type data are associative (hashed) data types and are explained later.
Address (Pointer) Space
Word addressing is employed, except for bit vectors with bit addressing capability.
The virtual addressing space is divided into two sub-spaces, the I-space and the D-space, with 224 word capacity each.
The I-space (I for Instruction) is used for storing compiled codes, and pointers into this space are tagged as a function pointer (cf. Fig. 2 ). All other data types are stored in the D-space (D for Data). 
High Speed Registers
The 128 global registers (G-reg.) and 127 local stack frame registers (F-reg.) are provided, and the "V-cache" (Fig. i) is used to realize these registers. Three identical copies of each register are provided in order to realize 3 parallel read ports. Use of both F-and G-registers would speed up the execution time of some programs (cf. recursive APPEND in appendix i). The meaning of the 3 bytes op, R1 and R3 is the same as in R 3 (4.2). "j" stands for a conditional short jump to a relative address j, -128 < j < +127. Typical operations of this type are:
100 ns if the invisible pointer of cdr coding is not involved. Makes a short jump in i00 ns if car or cdr of R1 cannot be taken.
Always i00 ns. Short jump or non-jump to "j" on the truth of (EQ R1 R3).
rl :=rplaca [rl,"j",r3] rl := rplacd[rl,"j",r3] 100 ns if the invisible pointer of cdr coding is not involved. Short jump to "j" in I00 ns on bad argument(s).
Always i00 ns. Short jump or non-jump to "j" on a t o m i c RI.
Short j u m p or n o n -j u m p to "j" on n u m erical e q u a l i t y of R1 and R3. 100 ns if R1 and R3 are short integers.
BIG-N U M argument(s) causes a trap to BIG-N U M routines, and n o n -n u m b e r argument(s) to an error handler.
GOTOs
The "GOTO J" i n s t r u c t i o n has a one w o r d format: (1 byte op code) + (a 24 bit I-space address).
The t i m e for GOTO is made p r a c t i c a l l y zero by p a r a l l e l i s m as d e s c r i b e d later.
On the other hand, the instruction for "computed GOTO on an integer R1 to o n e of n = R3 p l a c e s " h a s a special n+l w o r d f o r m a t and takes 250 ns to execute.
CALL, RETURN -C-stack Instructions
A hardware stack, called the C-stack (C for Control) d i f f e r e n t from the local stack f r a m e (cf. 4.1), is p r o v i d e d for s t a c k i n g a return address and an increm e n t a l value, D E L T A -C F P of the CFP (Current Stack Pointer cf. 4.1).
The "CALL" instruction is always followed by a "GOTO J" instruction.
The first byte of the "CALL" i n s t r u c t i o n is the op. code, the second byte is the i m m e d i a t e v a l u e of D E L T A -C F P and the last 2 bytes have no significance.
In the "RETURN" instruction only the first op. code byte is s i g n i f icant.
"CALL" increments the CFP by DELTA-CFP, pushes a l i n k a g e word, the r e t u r n a d d r e s s and D E L T A -C F P onto the C-stack, and then goes to J.
"RETURN" pops the linkage word from the C-stack, r e s t o r e s the old CFP by subtracting DELTA-CFP from the C F P and r e t u r n s . The t i m e s for "CALL" and "RETURN" are also made p r a c t ically zero by built-in parallelism.
T h e A r c h i t e c t u r e f o r B a s i c L i s p Operat ions

5.1
Cdr Coding and RCONS B e s i d e s i m p l e m e n t i n g cdr coding [3] by h a r d w a r e as in other Lisp m a c h i n e s , RCONS, (Reverse CONS) is also h a r d w a r e supported.
The RCONS instruction (RCONS, RI, R2, R3) can be d e f i n e d o p e r a t i o n a l l y as a s t a t e m e n t : Recursion can be removed from these f u n c t i o n s by using RCONS, w h i c h constructs a list from head to tail while CONS c o n s t r u c t s a list from tail to head. In the cdr coding system, however, the use of RPLACD w o u l d g e n e r a t e a n o n -l i n e a r s t r u c t u r e o c c u p y i n g 2 w o r d per list cell in excess of a linear structure.
RCONS is hardware implemented so as to construct a c o m p a c t linear list s t r u c t u r e f r o m the right of the free list area w h i l e C O N S does the same from the left [i0]. A progr a m m i n g e x a m p l e w i t h R C O N S is given in appendix 1 (cf. APPEND (Iterative)).
Pipeline and Advanced Control
Three p i p e l i n e stages I, V and D are e m p l o y e d : "I" for " I n s t r u c t i o n " f e t c h i n g a n d p r e f e t c h i n g , "V" for r e a d i n g a n d w r i t i n g the " V a l u e s " of h i g h s p e e d r e g i s t e r s (G and F reg. cf. 4.1), and "D" for i n s t r u c t i o n e x e c u t i o n w i t h m e m o r y a c c e s s e s through the "D-cache".
B e s i d e s these 3 pipelined stage units, the C-unit, provided for controlling the C-stack, runs concurrently.
The C-unit makes use of the D -c a c h e on a cycle steal basis.
The Icache is s e p a r a t e d from the D -c a c h e to i m p r o v e the p e r f o r m a n c e of i n s t r u c t i o n p r e f e t c h i n g .
Up to 6 i n s t r u c t i o n s can be p r e f e t c h e d w i t h i n the I -s t a g e unit.
The time n e e d e d for b r a n c h i n g by short jumps (4.3) is m a d e p r a c t i c a l l y zero by pref e t c h i n g b o t h i n s t r u c t i o n s in t h e branching and non-branching sides in parallel with the evaluation of the branching c o n d i t i o n a l predicate.
GOTO, C A L L and RETURN instructions are executed in parallel w i t h the e x e c u t i o n of other i n s t r u ctions by m e a n s of the I-stage unit and the C-unit. Thereby, the t i m e n e e d e d to execute these instructions is also made practically zero.
S i n c e c o n d i t i o n a l b r a n c h i n g , GOTO, CALL and RETURN instructions occupy about 50% of the compiled codes in typical Lisp p r o g r a m s , the s p e e d i n g up of these ins t r u c t i o n s by p a r a l l e l i s m is c o n s i d e r e d very effective. Some examples are given in a p p e n d i x i. W h e r e i n , e x a m p l e s of A S S O C Q and A P P E N D show that the s p e e d i n g up of CALL and R E T U R N is a l m o s t e f f e c t i v e as recursive elimination.
A n e w p i p e l i n e r e c u r r e n c e r e l a t i o n f o r m u l a t e d by S h i m i z u was used in the d e s i g n of the p i p e l i n e logic [9] . A logic simulator system DDL* written by Shimizu [9] has been used throughout the design of the FLATS. The DDL* system had to be written in F o r t r a n (about 14,000 lines) because all Lisp s y s t e m s a c c e s s i b l e to our group were considered too slow. The world w o u l d have been d i f f e r e n t if F L A T S w e r e available! 2 1 0
Vectors
The operational specification of vector i n s t r u c t i o n s is the s a m e as M K V E C T , G E T V and P U T V in the Utah s t a n d a r d Lisp [5] .
A vector is i n t e r n a l l y r e p r e s e n t e d by a "vector descriptor" which consists of a pair of p o i n t e r s (L, U) o c c u p y i n g t w o words (8 B format data).
L and U give the lower and upper bounds of the memory space allocated for the vector.
The instruction (MKVECT, RI, -, R3) places a pointer (tagged as a vector) to a new vector d e s c r i ptor (L, U) in R3, w h e r e U = L + RI, provided that R1 is an integer r e p r e s e n t i n g the size of the vector.
Vector r a n g e violation is always checked by hardware in vector access instructions, GETV and PUTV.
Bit Vector for Garbage Collection
A bit p a t t e r n h a n d l i n g h a r d w a r e [6] is implemented for speeding up the marking of a c t i v e cells, p o i n t e r a d j u s t m e n t s and r e l o c a t i o n in c o m p a c t i f y i n g garbage collection.
Bit v e c t o r s (32 bit word) w i t h bit addressing hardware are used for this purpose.
P-list vs. CAT, AMT
P-list (Property-list) P-list is an i m p o r t a n t p r o g r a m m i n g c o n c e p t i n t r o d u c e d in Lisp 1.5 [i].
H o wever, it often causes g l o b a l n a m e clash p r o b l e m s b e c a u s e P-list is u s u a l l y a s s oc i a t e d w i t h a g l o b a l n a m e (atom). This p r o b l e m can be r e s o l v e d by using a "gensym" mechanism as shown in 6.2. P-list is u s u a l l y i m p l e m e n t e d l i t e r a l l y as a "list structure", which results in a rather slow O(n) o p e r a t i o n t i m e w h e n n i t e m s a r e placed on P-list. Table) and C A T ( C o n t e n t Addressed Table) , which may be regarded as n a m e l e s s P-lists, are provided.
AMT and CAT T w o d a t a t y p e s , A M T ( A s s o c i a t i v e M e m b e r s h i p
O p e r ationally, each AMT or CAT instruction corresponds, line by line, to a P-list operation as in: The v a l u e s of x and y are 1 and T r e s p e c t i v e l y in each program.
The speed up is realized in AMT, CAT instructions by skipping the gensym mechanism and by using hardware supported hash retrieval so as to realize O(i) operation times.
Hardware Hashing and H-TFpe Data
In the D -c a c h e (cf. Fig. i ), 8 w o r d s are c o m p a r e d in p a r a l l e l to speed up the searching by a hashing hardware [7] . Besides s p e e d i n g up of A M T and C A T o p e r ations (6.2), h a s h i n g is e m p l o y e d to c o ns t r u c t u n i q u e l y r e p r e s e n t e d data types, called the H-type data.
McCarthy [2] once noted about (HCONS X Y), w h i c h is like (CONS X Y) but o n l y one copy of the c o n s e d o b j e c t is to be made by s e a r c h i n g t h r o u g h the s t o r a g e to check whether the same structure has been made before.
S e a r c h i n g is to be m a d e by h a s h i n g for the sake of speed.
H C O N S is h a r d w a r e i m p l e m e n t e d in our machine. E q u a l i t y c h e c k i n g of t w o tree s t r u c t u r e s , say, a and b, can be m a d e in O(i) t i m e by the pointer comparing primitive eq[a; b] when they are c o n s t r u c t e d by HCONS.
M c C a r t h y r e m a r k e d that the p r o b l e m of s p e e d i n g up the equality checking of large mathematical expressions would be resolved by using an HCONS scheme.
H o w e v e r , this is not s u f f i c i e n t .
The e x p r e s s i o n A + B + C m a y be e x p r e s s e d in m a n y d i f f e r e n t l i s t s (ordered n-tuple) (A, B, C), (B, A, C), ... o w i n g to the c o m m u t a t i v e n a t u r e of the addition.
Unique r e p r e s e n t a t i o n of sets (unordered n-tuple) w o u l d r e s o l v e this problem [8] , since the equivalence of a set is d e f i n e d as: {A, B, C} -{B, A, C}, ....
H a s h i n g h a r d w a r e for u n i q u e l y d e f i n i n g sets is also i m p l e m e n t e d in our machine.
Starting from <ATOM> which is a u n i q u e l y d e f i n e E q u a l i t y c h e c k i n g of any two H -t y p e data can be m a d e in 100 ns by the EQJ or EQNJ i n s t r u c t i o n (cf. 4.3).
Since H -t y p e data are u n i q u e like any l i t e r a l atoms, they can be used as i n d i c a t o r s and flags in P-lists, AMTs and CATs.
Thus, the Htype data operations are believed to provide a p o w e r f u l a s s o c i a t i v e c o m p u t a t i o n scheme. T.Ida and E.Goto, " P e r f o r m a n c e of A P a r a l l e l H a s h i n g H a r d w a r e w i t h K e y Deletion," Proc. IFIP Congress 77, NorthHolland, w h e r e the second a r g u m e n t y is s i m p l y c o p i e d , i.e., y is p a s s e d f r o m o u t e r "append" to inner "append" w i t h o u t any change. Such "argument copying" can be r e m o v e d by storing the value of y in a G (global) r e g i s t e r ( a c t u a l l y G R I 2 7 in APPEND). A Lisp compiler for automatically r e m o v i n g such " a r g u m e n t copying" is now being designed. For r e c u r s i o n e l i m i n a t i o n RCONS is used in APPEND and tail recursion removal is done in ASSOCQ. No good iterative method is known for EQUAL. 
