Numerically intensive calculations are not well supported by Prolog, yet there are important applications that require tightly coupled symbolic and numeric calculations. The Aquarius Numeric Processor (ANP) is an extended numeric Instruction Set Architecture (ISA) based on the Berkeley Programmed Logic Machine (PLM) to support integrated symbolic and numeric calculations. This extension expands the existing numeric data type to include 32-and 64-bit integers, single and double precision floating-point numbers conforming to the IEEE Standard P754. A new class of data structure, numeric arrays, is added to represent matrices and arrays found in most scientific programming languages. Powerful numeric instructions are included to manipulate the new data types. Dynamic Operand Coercion (DOC) provides dynamic type checking and coercing of operands. The ANP and PLM together provide for the efficient execution of symbolic and numeric operations written in AI languages such as Prolog and Lisp. Simulated performance results indicate the system will achieve about 10 MFLOPs on the Prolog version of some Whetstone and Linpack benchmarks and close to 20 MFLOPS on some matrix operations (all in double precision).
Introduction
Contemporary Prolog execution systems provide excellent support for symbolic calculations, but are generally quite weak in their support of numeric and linear algebra calculations. Yet some of the most interesting and challenging applications of logic programming require high performance execution of tightly coupled symbolic and numeric calculations. Examples include linear algebra, digital signal processing, computer-aided design, engineering and manufacturing, design automation [3), robot..ics, Constraint Logic Programming [8, 9, 11) . geometric modeling and reasoning with probabilistic evidence.
In our Aquarius project [4) . one of the main applications is design automation [3) and it .requires extensive numeric calculations as well as symbolic manipulations. We are investigating additional built-in predicates and macros for the Prolog language to better support numeric operations. The predicates have a semantic interpretation in a kernel subset of Prolog, but can be efficiently and directly compiled into powerful machine instructions. At execution time, most of the machine instructions are executed by a symbolic processor, the PLM [7) . When the special numeric instructions are fetched by a pre-fetch unit, they are ignored by the symbolic processor and are acted upon by the Aquarius Numeric Processor (ANP) [18) .
• 2. The ANP is a high performance vector numeric processor especially designed to support numeric operations that occur in the context of logic programming. Figure 1 shows a block diagram of this integrated ANPJPLM architecture. The ANP co-processor of figure 1 is currently under construction using TIL and EO.. parts and will be inserted into our current experimental system 1 in the near future.
Programming Model
Because the Aquarius Numeric Processor (ANP) is a co-processor to the Programmed Logic Machine (PLM), it inherits the data types and programming model from the PLM [7] . It adds new data types to the programming model including, in both scalar and vector forms, integer, single and double precision floating point numbers in IEEE standard (754) form [2] . An extended numeric register set and a large repertoire of integer and floating point operations are provided for these new data types.
The PLM programming model
The programming model of the PLM includes an execution model based on a modified and extended Warren Abstract Machine [6, 16] . The execution model of the PLM consists of four distinct memory areas and a number of regist~rs associated with them. The Code Space contains compiled Prolog procedures and clauses. The Data
Space defines the dynamic state of the PLM and is divided into three areas each organized as a LIF0 2 memory. The
Stack contains control informations defining the environments and choice points created during the execution of PLM instructions. The Heap is a general area for the storage of data. Finally, the Trail keeps ttack of variable bindings which must be unbound during backtracking. In addition to the state registers described above, there are eight data registers, Ao-A 1, in the PLM for parameter passing and for the storage of temporary and frequently used data items. Following is a summary of registers that make up the machine state of the PLM list, structure, variable and constant, which are distinguished by bit<31:30>. These are shown in figure 2. Bit<29> is a cdr bit which is used for compact list representation, and bit<28> is a garbage collection bit This bit is reserved for data marking during garbage collection. Bit<27:26> of a constant data type further differentiate between a 26-bit small integer (00), other-numeric header (01), an atom (10) and a nil (11) . This tagging information allows efficient ~anipulation of data by applying different strategies to operate on each class of data. Although data typing benefits from efficient execution, it decreases the amount of information that can be stored within each data word.
Several new data types are added to the ANP for numeric computations. The fundamental numeric data types are 32-and 64-bit integers, single and double precision floating-point numbers. Arrays based on these fundamental data types can be constructed in single and multi-dimensional forms. Integers and floating point numbers for computation in the ANP conforms to the IEEE Standard P754 [2] .
Structure Numeric Representation
The IEEE Standard for Binary Floating-point specifies numeric operands to be a multiple of a 32-bit word except for the recommended extended format, which is 80-bits long. To maintain compatibility with this standard as well as the PLM execution model, an additional 32-bit word is needed to store data type (tagging) information.
The SLructure Numeric Representation utilizes a place holder as an indirect pointer to access the numeric operands.
The place holder is modified with a structure primary tag, garbage collection and cdr bits, and a 28-bit address pointer pointing to the numeric data sLructure. Figure 3 shows this numeric data structure which consists of a place holder, a numeric header and the numeric operand it is representing. The first entry of the numeric operand is a numeric header which has the same top 6-bit tags of the place holder, a 4-bit numeric tag and a 16-bit vector 2 UFO is 'Last In. First Out' or stack managed memory. (S) of the operand. Tag space is also provided for additional numeric types such as 'Bignum', decimal, and complex numbers. Since the PLM data path cannot directly operate on 32-bit numeric operands, entire numeric structure addressed by an indirect pointer will not be transferred into the PLM register set, but into the ANP instead.
Dynamic Operand Coercion
Many numeric operations generally appear in the instruction set of scientific processors. Often a subset of equi\'alent scalar opcodes appear in vectorized forms as well. Normally the programmer (or the compiler) chooses the correct opcodes for the data types used in each program. This is fine for some applications that have static data types. For others, any change in the input data type often requires a recompilation effort. For general programs, code for testing the input data types used must be added to accommodate the dynamic nature of the input There are two ljndesired side-effects in this method: 1) the extra code increases the size of the program, thus increases the demand on a generally critical system resource, input/output to memory storage. 2) the added test and branch opcodes decrease the efficiency in the processors' (pre-)fetching mechanism.
The second side-effect is greaLly magnified in a vector processing system in which the functional units are pipelined. Thus, the Dynamic Operand Coercion scheme (DOC) is supported in the ANP.
Dynamic Operand Coercion (DOC) is a mechanism built into the ANP architecture to support dynamic data type checking and operand coercion. The programmer can describe the numeric operations that are required to 'In effect, the vector length is just the arity of the suuc:ture.
. s.
accomplish a goal without any consideration of the input data types involved. The ANP will do dynamic type checking and coerce the operands if necessary. The implementation is such that there is no overhead when no coercion is done (i.e. if the types are identical).
Extended Numeric Register Set
The architecture of the ANP extends the PLM programming model with a number of data and state registers.
There are eight general purpose data registers, .. ABSolute and NEGate instructions can be applied to all numeric data types in scalar and vector forms. The purpose of the ANP is to supplement the PLM symbolic processor with high performance numeric operations while maintaining upward compatibility with the existing PLM's Instruction Set Architecture. This is accomplished with an extension of numeric data types and insuuctions, as described in the previous sections, and an architecture that efficiently supports these new extensions. The ANP functions as a co-processor to the PLM. The programmer perceives the ANP/PLM execution model as if all numeric instructions are executed in the PLM. In systems where an ANP is not present, numeric operations are emulated in software via traps to the host processor.
The Execution Pipeline
One of the bottom-neck in the ANP/PLM system lies in the memory bandwidth of the PMB. Although the P1'v1B has separate Opcode and MAR address busses, the Memdat bus is multiplexed between code and data. Furthermore, the memory access cycle time on the PMB is one Pclk cycle which is equal to two Fclk cycles (50ns each).
• II -7-
. .
i .
Like in most high performance vector computer design [13] , pipelining is employed to increase the throughput of the-system by overlapping execution of ANP instructions. This is accomplished with a two-word deep instruction buffer and a dynamic execution pipeline. The ANP execution pipeline can issue a register-toregister instruction every two fclk cycles for most scalar operations. For vector and other scalar operations the execution pipeline is dynamically stretched to accommodate multi-cycles instructions. Maximum performance is obtained for scalar operations while no penalty is incurred on others. The pipeline stages after Decode are repeated for every element of a vector operation.
While most scalar register files are latches, vector register files are often implemented by Random Access
Memory (RAM) with additional logic to control multi-port access, arbitration and timing. Adding the Write stage in the vector pipeline allows less complicated timing circuitry for access control and a slower RAM may be utilized in an implementation.
Pipeline Interlock
Two common problems in a pipeline design is structural and data dependent hazards [10] . Structural hazards occur when two pieces of data attempt to use the same stage of a pipeline simultaneously. This hazard may occur if cares are not taken during the design and scheduling of the pipeline. An example in the ANP design is the [10] and the micro-code flow-chart method described in [15] .
Data dependent hazards occur when the state of pipeline depends on other stages. These happen when two separate instructions attempt to access the same data (i.e. regis~er, memory location) when their executions are overlapped in the pipeline. There are three such hazards:
Read-After-Write (RAW) -The data read by a (logically) preceding instruction is modified by a following instruction before the access is completed. The data read by the earlier instruction is no longer the initial content but the new value written by the second instruction.
Write-After-Read (WAR) -The data read by a following instruction occurred before it is modified by a (logically) preceding instruction. The data read is the stalled data.
Write-After-Write (W A W) -The data written by a following instruction occurred before it is written by a (logically) preceding instruction. The data stored is the stalled data from the earlier instruction.
The ANP pipeline design is free from data dependent hazards by 1) maintaining the sequentiality of instruction flow.
2) proper scheduling of instructions through the pipeline [10, 14, 17] . 3) employing a pipeline interlocking technique [12] . TheRA Wand WA W hazard cannot occur in the ANP execution pipeline because of serial execution of instructions. Read and Write (Wx for vector instruction) accesses are scheduled so that they are aligned during the first half of a Fclk cycle (as shown in figure 6 ). A Pipeline manager within the OCU detects and resolves any potential WAR hazards by forwarding the new data from the current Store stage directly to the input of the EU via a feedback path.
Bus Interface Unit
The Bus Interface Unit (BIU) is responsible for all communications between ANP, PLM and the memory system. The primary function of the BIU is to enforce the Co-processor Interface Protocol (CIP) for orderly transfer of instructions and data between ANP, PLM and the memory system. The co-processor interface consists of physical connections that enable the transfers of information between units attached to it, and a controlling mechanism that implements the CIP.
System parameters, such as interrupt enables and mode controls, can be written to a Control Register (CR)
for initialization and debugging of the ANP hardware. Condition codes and flags from execution of ANP instructions are written into a Status Register and can be accessed by the PLM or the memory system from the BIU.
The PLM was designed as a single processor system. The interface between the PLM and the memory system has no support for additional (co-)processors. Thus, an interface protocol is needed for the addition of the ANP.
The Co-processor Interface Protocol (CIP) is a bus protocol for iniL.ialization, usage of the PMB (for transfer of instruction and data), and exception handling between the ANP, PLM and the memory system. Specialized micro-routines can be loaded for the processing of specific applications (i.e. DSP, CAD).
Transfers of Instruction and Data
The execution model of the ANP primarily consists of a sequence of load, execute and store operations.
When an ANP opcode is detected on the Opcode Bus, the BIU enters one of three modes:
LoadThe ANP waits for the transfer of operand(s) from memory into its internal Fx/Bx registers via the PMB data Bus. Address calculation and initiation of the memory transfer are done in the PLM.
Execute -is a register to register operation in the ANP. Data coercion may be done to complete the operation.
StoreFbusy* is asserted during the execute cycle when IBuffer is full. Fexcept* may be asserted on the PMB when an external exception is detected.
The ANP transfers operands from its internal Fx/Bx registers to memory along with data type (numeric header) information. The data structure is built on the top of the Heap. Address calculation and initiation of memory transfer are done in the PLM along with the updating of state registers (i.e. H, TR).
When a numeric opcode appears on the Opcode Bus (on the PMB), the ANP as well as PLM receive the opcode. The ANP transfers the opcode into the instruction buffer (!Buffer) during the Fetch stage of the pipeline and decodes it during the Decode stage. The PLM calculates the address and fetches or stores operands needed to complete the operation. The ANP is synchronized to the PLM to receive/supply the operands from/to the memory system respectively. The sharing of the PLM address, data and control busses simplifies the protocol to incorporate the ANP in a PLM system. The co-processor architecture allows the ANP to access all of the resources available to the PLM. This includes instruction (pre-)fetching, (cache) memory access, memory management and interface to the host system. Furthermore, the PLM's debugging environment can be easily enhanced to accommodate the ANP.
This allows fast prototyping and verification of the design on both the hardware and software.
Address generations for the ANP are done in the PLM for two reasons.
1)
A shadow register set in the ANP is not needed.
2) The overheads of maintaining and synchronizing two separate processor states are eliminated.
A shadow register set will increase the complexity and cost of the hardware. The overheads for the synchronization and maintenance of two separate sets of data and state registers are large. Any change in one of the register set necessitates a broadcast to the other. For examples, when a PLM instruction modifies an AX register or updates the heap or trail pointer (i.e. laying a choice point or instantiating an unbound variable), the ANP needs to be notified and updated. When a numeric data structure is built or modified on the heap, the PLM needs to be updated. All of these extra transfers will further burden a critical system resource, the PMB. Moreover, additional hand-shaking protocols and signals for synchronization are eliminated. All of these factors will translate into a slower and more expensive system. The dual execution modes of the ANP, both as a parallel processor that shares the same Opcode Bus with the PLM and a slave co-processor to the PLM, allow high performance execution of integrated symbolic and numeric programs while maintaining a tightly-coupled execution model in the system.
'Numeric exceptions will not fail the operation immediately but forward the exception to the exception handler in the host.
-13. 
. Exception Handling
The ANP is viewed as an extension of the PLM data path and register set The fact that the ANP is physically separate from the PLM is transparent to the programmer. Thus, the exception processing is coordinated by the PLM in a manner that is consistent across all exception types, whether detected during the execution of an instruction native to the PLM or by the ANP instruction.
The processing of an exception detected by the execution of an ANP instruction involves the two basic steps:
Internal exception -The ANP detects an exception that can be resolved within its data path (i.e. operand coercion) and continues on with the execution. Examples are invalid operands and invalid operations.
External exception -The ANP detects an exception and internal exception processing cannot resolve Lhe excep-
tion. An external exception is sent to the PLM which in tum forwards it to the host for further exception processing.
The ANP internal exceptions are caused by invalid operands or invalid operations. These are handled by the Dynamic Operand Coercion mechanism in the Operand Coercion Unit The ANP external exceptions are any error other than internal exceptions. These types of exceptions, which include those described in the IEEE Standard 754, cannot be resolved within the ANP and an exception interrupt is generated to the host processor for further handling.
Operand Coercion Unit
The Operand Coercion Unit (OCU) implements the Dynamic Operand Coercion (DOC) mechanism. which provides partial instruction decoding, operand type checking and coercion, and vector operation management for the ·14.
Execution Unit (EU). Since data typing infonnation is not specified in Prolog code, an ANP instruction with mismatched operand type/size must be detected during run-time by means of an internal exception to the MCU, or by sending a failure or (external) exception signal to the BIU.
The OCU takes as inputs an 8-bit ANP Fopcode (most significant byte of the Darg latch from the BIU) and the Hx registers from the two input operands 8 • It then provides the MCU with the entry point of the micro-routine to perfonn the instruction or call a data coercion subroutine to process an internal exception. When no (internal exception) coercion is needed, the OCU generates the required opcode, feu_op, to the EU. Otherwise, an 8-bit conversion opcode, conv _ op , is sent to the EU for operand coercion. Upon compl~tion of coercion, the feu_ op is sent to the EU for the execution of the instruction. Both Feu_op and conv _op are opcodes of the BIT chip set [1].
The second function of the OCU is to maintain vector count for the instruction currently executing in the EU.
If there is a mismatched between scalar/vector and instruction/data, a failure signal is sent to the BIU. There are two counters each tracking the elements in the Load and Store stages of the vector pipeline. -15-operands with length up to 65536 are supported in the ANP 9 • For operations that involve two vector operands. same calculation is applied to each corresponding element of the two vectors starting from element zero.
As high precision numeric calculations are needed for scientific data processing. a set of constants are frequently used by these applications. This set of high precision nwneric constants, often used in trigonometric. statistical and other numeric calculations. are difficult to derive or to reload each time they are used. In the SU, a set of 128 64-bit pre-defined (or user-defined) constants can be accessed by most numeric operations. All constants are identified with the most significant bit set (equal to one) in the operand byte of an opcode (i.e. register 128 to 255).
The ANP register set is shown in figure 4 .
The Storage Unit is made up of a set of multi-port register files and several 32/64-bit busses connecting them and to other functional units. Maximum throughput is maintained in the Execution Unit by combining the SU and Store -The content of the Wx register is written in the Bx register file.
Micro Control Unit
The heart of the ANP is a Micro Control Unit (MCU) which consists of a micro-program sequencer, a 96-bit horizontal writable control store, and other circuitry to handle exception processing and initialization of micro-code. In normal instruction sequencing, one of four address sources are selected to generate the next micro-program address. These address sources include the modified next address seed ( BRANCH ) as described above, a subroutine entry address ( SUBR) and subroutine return address ( SRA) and the decoded opcode ( OP) from the OCU.
Selection of these four sources are made in the next address selection logic (NASL) under the control of a 5-bit field in the micro-word.
The concept of subroutine call in high level programming languages allows programmer to write compact and modular code. The SUBR/SRA pair implements a micro-subroutine call by pushing the return address (modified next address seed) in the SRA, a one-word subroutine stack, and making a jump to the subroutine entry address provided by the SUBR. Besides entry points for all the micro-subroutines, the SUBR also contains entry points of common routines such as initialization and exception processing. A mechanism, under the control of the BIU, is provided in the micro-sequencer to support system related functions such as transferring the processor states to the host for debugging. A control retwn register ( CRR ) is placed at the output of the NASL branching logic to support system level subroutine calls similar to the SRA. A 11-bit boot counter ( BC ) is selected as next micro-program address during system initialization for transferring micro-codes from the PMB into the writable control store.
Performance Measurements
Evaluation of the ANP is done in two steps. First. a register transfer level simulator provides a means for the evaluation of the microarchitecture of the ANP. Second, a hardware implementation will be constructed and tested with calculations that are too large to be simulated. Preliminary performance measurements were obtained from simulation of the design using a set of benchmark programs written in Prolog.
Measurement Results
A set of Prolog programs, including modules translated from (double precision) Whetstone benchmark modules, is used to verify the correctness and measure the performance of the ANP/PLM system. 
