This paper describes architectural oonsiderations which led to the design of a fast programmable processor made frcm ECL bit-slices. The processor will be used as an on-line data filtering engine for high energy physics experiments. Unlike prior designs of such engines, the processor supports both user (horizontal) micrcxzode and emulation of the PDP-ii fixed peint instruction set (without memory management and multiple interrupt levels). In addition to an overview of the techniques used to achieve an execution speed of roughly three times that of the PDP-ii/70 CPU, strengths and weaknesses of bitslices are discussed, as are the use of a Signetics meta asse~nbler and the ISPS Architecture simulation system.
This paper describes architectural oonsiderations which led to the design of a fast programmable processor made frcm ECL bit-slices. The processor will be used as an on-line data filtering engine for high energy physics experiments. Unlike prior designs of such engines, the processor supports both user (horizontal) micrcxzode and emulation of the PDP-ii fixed peint instruction set (without memory management and multiple interrupt levels). In addition to an overview of the techniques used to achieve an execution speed of roughly three times that of the PDP-ii/70 CPU, strengths and weaknesses of bitslices are discussed, as are the use of a Signetics meta asse~nbler and the ISPS Architecture simulation system.
Intreduction
The purpose of this paper is to present the design of a MICroprogrammable filtering "Engine" (MICE) built fram ECL bit-slices. The engine will run high energy physics data-reduction algorithms at substantially higher speeds than can be achieved with standard minis, yet preserve the latter's full programmability.
Motivation for on-line data reduction in hi@h energy physics
Research in high energy physics is directed towards investigation of ever rarer phenomena at ever higher energies. This has been made pessible by improvements in particle accelerator technology, the development of new forms of electronic particle detectors and the intrcx~uction of computers for data acquisition and control in experiments.
As a result of the pursuit of rare particle interactions the typical experiment now collects vast quantities of ni~eric data, often more than 90% of which is useless "background", i.e. non-interesting particle interactions. Traditionally the experimental data is recorded on magnetic tape and the "gcx~d" events (particle interactions) are filtered from the often overwhelming background by resource censuring off-line data analysis on the largest available number crunching facilities~
In order to get a feeling for the problem,consider a typical experiment which generates as many as 106' to 107 consecutive particle interactions per second. A fast hardwired first level "trigger" reduces this rate to akxgut i000 per sec. by looking for relatively sinlole time and space correlations characteristic of the type of interaction desired, typically with a few nanoseconds time resolution. When the trigger logic signals the detection of an event, of the order of i000 16-bit data words should be read out frcm the detectors into the Data Acquisition Cc~pu-ter (DAC). A typical time for a block transfer of 1000 words frcm the detectors to the Data Acquisition Computer's memory, including software overhead, is 4 msec. During the read-out the trigger is disabled ("dead") and all potential triggers occurring during this period are therefore lost. Figure 1 shows, for particle interactions occurring randc~ly in time, the fraction of potential triggers which can be collected by the DAC as a function of the "dead-time" im~xgsed by the read-out of the event. For our typical experiment only 20% of the ~xgten-tially triggerable interactions can be recorded; hence the physicist is unable to take full advantage of the expensive resources at his disposal. In addition perhaps less than 1% of the events actually collected are of the "gofx~" type, and several hundred hours of CDC 7600 equivalent CPU time will be rrequired to filter them out.
Clearly, if we introduce a second level trigger which can reject background events in a time much shorter than the event read-out time and thereby reduce deadtime for rejected events (the majority) we would gain in two ways: i) the expensive accelerator and detector resources would be more efficiently used because fewer gcxgd trigger events are lost due to deadtime for read-out of unwanted events.
2)
The amount of expensive off-line n~nber crunching would be reduced because of greatly improved fractions of gcxgd events in the recorded sample.
Ass~ae that a second level trigger with a decision time of i00 ~s accepts 10% of the events treated. The average deadtime per event would then be i00 ~s + 0.1x 4ms = 0.5 ms. In this case we would be able to record 67% of the available gcxgd events at the previous trigger rate of 1000 sec -I , while reducing by a factor (0.2"1000)/(0.67"1000"0.1) = 200/67 the vol~ne of data to be analysed off-line. Therefore we have an 67:20*200:67 = i0:i "inlorevement" ratio. -4/80/0000-0278 $00.75 © 1980 IEEE Rationale for a microprogrammable emulation engine
278

CH1494
The key idea of the on-line filtering eoncept is to give the greatest throughput of "good" events to the Data Acquisition Cc~puter by applying progressively more restrictiv~ but more time-constmdng event selection procedures . Typical is the two level filter, where the first level is provided by the simple, fast hardwired "primary" trigger. The second level is provided by slower hardwired or programmable logic performing more sophisticated filtering algorithms on the already reduced event stream.
The ideal on-line filter would be both extremely fast for manipulation of low precision (16-bit) integer data and flexible to allow easy adaptation to the changing requirements of a physics experiment.
In the study of physical phenc~aena, the important parameter is the number of events collected; in principle the rate at which they are collected is immaterial. However, the faster it is, the less time experimental resources are tied up, and therefore the less expensive the experiment. On the other hand, it is in the nature of high energy experimentation that trigger algorithms evolve constantly for ever greater selectivity during the many months of data collection and experimentation, so that flexibility and ease of change are often of greater importance than speed. If speed criteria daminate, the filter would be hard-wired 2 whilst if flexibility and ease of implementation are more important, software on standard minis 3 or specially designed processors can be used 4 .
The MICE is meant to provide a good cc~prcmise between the speed of hardwired logic and the flexibility of (mini) software. Unlike prior designs of microprogrammable filters 4 our engine has been designed explicitly both to interface easily to an experiment and be an efficient host for microprogrammed emulation of a standard mini. Thus the user can write low level micro-code for time critical algorithms and program non time-critical tasks at a conveniently high level, making use of standard software development tools with which he is probably already familiar. We chose to emulate the PDP-ii architecture because of the synmnetry of the instruction set, its memory mapped I/O and its wide acceptance within the high energy physics (x~munity. Our PDP-iI fixed point instruction set emulator runs at about three times the speed of Digital Equipment Corporation's PDP-ii/70 CPU operating out of cache memory with 100% cache hits.
We found that the speed ratio between hardwired algorit~ns and software on a PDP-ii/70 is typically 50:1. MICE will allow the following speed/flexibility tradeoffs: when running on the high level but relatively slow emulated PDP-ii, algorithms will run about three times faster than on a PDP-11/70. Hen ooded in low level but faster and still flexible microcode they should run 5 to 10 times faster than on a PDP-ii/70, 5 to 10 times slower than in hardware, and when running on special purpose ha~wired units controlled by MICE, even closer to that of strictly hardwired prooessors. MICE supports only those architectural features which are necessary for on-line filtering -hence there is no floating point hardware, no memory management and only a simple one level interrupt facility. These sinplifications reduced the complexity of the host design, and allowed us to concentrate design effort on an efficient overlap scheme for target instruction fetch, decode and execute, and some other special features for fast execution of both target code and microcode.
Integration of the Filtering Engine into the Experiment
The engine is not intended to replace the data acquisition and control o~puter of the experiment, but it should prefilter the data stream from the experiment. The engine is therefore controlled by the Data Acquisition Cc~puter (DAC). There are many ways in which the data acquisition system can be structured in order to use an engine (or multiple engines) to prefilter the data stream. However all such system require the fastest possible rate of data transfer between the experiment and the filter engine. Figure 2 shc~s a typical system organisation using a single engine operating on a single data stream. The engine perfonms the filter algorithm on a subset of the event data made available to it via a fast readout system 5 connected via a EMA controller to the engine's memory. If an event is to be rejected, the engine clears the system by issuing the REJECT signal and awaits the next trigger interrupt. If an event is to be accepted, the engine interrupts the DAC which then reads out the ccmplete event from the detector.
Clearly the filter engine will increase the throughput of "good" events only if the time it requires to read out and process the subset of data is shorter than the time required to read out the plete event via the slow readout. Other configurations are possible where multiple engines work in parallel on multiple data streams in order to increase throughput i .
Interfaces Figure 3 is a block diagran of the engine and its interfaces. The engine is interfaced to the DAC via the CAMAC-interfaoe module, which uses the sub-address and function codes of a single CAMAC station 6 . Some of the CAMAC-interface cc~mands available to the DAC are:
Loading and dL~ping of programs and data to/ frcmn writeable control store (WCS) or target memory (TM).
2)
Cc~ands which control the engine's clocke.g. hold the clock, suspend the CPU, r~ the clock n cycles and then hold it (for hardware debugging).
3)
CGmmands which euulate the PDP-ii consolei.e. halt, load progran counter, start, continue.
4)
A ec~nand which causes diagnostic microcode in MICE to dunp the internal registers of the CPU into a register in the CAMAC-interface at the end of every micro-cycle, f~L~ where they can be transferred to the DAC.
5)
Cc~mands which handle interrupts to or from the engine, e.g. load MICE interrupt register, read IAM* register, load LAM register, test LAMs, etc.
The DAC loads or dumps WES or TM by block transfers over the "external" bus (E-bus). The E-bus is also used by the DMA-interface for fast block transfers of data frem the experiment to the TM.
The CPU normally accesses TM via the fast ECL "processor" bus (P-bus). The P-bus is interfaced to the U-bus -a slower asynchronous TTL bus which is a subset of the DEC "UNIBUS". The U-bus, used for input/output to slower devices, is a logical extension of the P-bus and operates by holding off the clock until the U-bus cycle is complete. Hence all devices on the P and U-buses can be accessed in a single micro-cycle irrespective of their speed. For design simplicity, neither the P-bus nor the U-bus support vectored interrupts or bus mastership arbitration. Hence the U-bus can only support simple, non-DMA UNIBUS devices, such as the teletype or papertape reader, where interrupts are replaced by software status tests.
Overview of Design Objectives and Tradeeffs
The key design objectives were:
Simplicity of design and short inplementation time.
2)
High execution speed.
3)
Simplicity for the programner . 4) Low cost.
Design and implementation time was considered the most critical factor, even more so than raw execution speed. Low design and manufacturing cost to allow replication was given lowest priority. Thus simplicity of design was esphasized, leading us to select bit slices as the highest level building blocks. Though slower than discrete ECL logic, we felt the use of the full functionality of the Motorola 10800 ECL bit slice family would give us a modular, flexible and relatively simple design at still acceptable speed. Prior experience with this family 7 for design of simple processors demonstrated the functional adequacy and tractability of the Mi0800 family.
Simplicity and speed dcminatingcost led us to impose no artificial constraints on micro-instruction %Drd width -in our classically horizontal host we do not "pattern" or "program" a microinstruction field for multiple uses as a function of mode bits, nor do we generally encode fields that control the bit slices. Only the ALU slice is steered by external decoding logic (ROM). This direct control also makes the design more understandable for the microprogra~ner in that the host is very regular, with private groups of fields controlling each slice/functional unit. As explained in the section "Findings", however, the microprogrammer need not be concerned with a 49 field, 112 bit micro-instruction.
*) CAMAC interrupt -see Ref. 6).
Simplicity for the eventual programmer also led us to design the host to make it easily accessible frcm the target level but also to keep the need for having to microprogran to a minim~n. For example, we implement a "j~p to micro-address" instruction, and also a REPEAT n instruction which repeats the next target instruction n times. Typically the repeated target instruction will be a 2 operand arithmetic instruction with at least one operand in auto-increment mode. Repeating it provides, at the target level, a simple fast vector instruction and eliminates the overhead of loop control (i.e. decres~nt counter, test for zero, and branch back). For example a read/modify/write arit~netic vector operation on target memory can be performed in three micro-cycles per operand (micro-cycle time is 105 ns). Note that the host supports repeated execution of micro-instructions and micro-subroutines, which allows vector operations to be implemented at the microcode level at even faster rates -e.g. search of a target memory vector in one micro-cycle per element.
Finally, we realized frcmn the start that a significant investment in software could probably help decrease the total hardware design and implementation time and in any case would especially be useful for subsequent use of the system. Thus we traded off ha~ware design time for finding and adapting to our purposes a good, general puzpose microasse~nbler, and for writing a oc~plete simulator of the host and its interfaces. For the latter purpose we used Carnegie Mellon University's ISPS, a register transfer machine simulator (see section "Findings"). Probably half the total manhours on the project were in fact spent on custcmising and using the microassembler and ISPS tools; we are confident the investment was more than worth it.
Implementation Strategies and Host Architecture
The goals of simplicity and short implementation time made it mandatory to use where possible standard techniques for achieving speed rather than developing innovative but complex solutions. This resulted Jn a fairly standard use of all the slices (see Fig. 4 ). We have the ALU (MCi0800), the Register File (MCi0806), the Micro-sequence Controller (MCi0801) and the Target Memory-interface (MCi0803). A four phase clock with a cycle time of 105 ns is provided by the Programmable Clock circuit (MCi0802). Additional logic is used to enhance flexibility and/ or spe~ -see section "Slices are not always sufficient for their intended purpose".
In order to meet the design goals we tried to:
Keep the 40 ns ta~et memory as busy as possible prefetching instructions and operands.
2)
To do as much pipelining as possible without needing the complexity of residual control (having the state of the machine during one micro-instruction depend on prior instruction sta£es) • 3)
To have all micro-instructions take the same time by cfmputing the address of the next microinstruction and fetching it always during the current cycle, even for cenditional branches.
We tried to balance the architecture so that each functional unit (and memory) could be kept busy nearly every micro-instruction and that most paths/ computations would take roughly the same amount of time. The cost for this logical and physical simplicity is a s~what (25%) longer cycle for the si~plestarithmetic instructien, which is considered acceptable.
There is no arbitration logic for resolving memory access conflicts between the P-bus and the E-bus (Fig. 3) ; instead the E-bus may access WCS or TM by suspending the CPU, i.e. holding the CPU clock and switching the memory ports to the E-bus. This solution allowed us to keep memory accesses from the P-bus (the majority in our applications) very fast (~ 40 ns) . It was the same design goal which led us to limit the electrical load on the Pbus by connecting I/O devices to the U-bus.
To achieve balance and homogeneity we tried to reduce data and control flow to common patterns. For example, the way in which Source and Destination operands are fetched is the same and independent of the cpcode of the target instructien. Effectively, execution microcode deals with prepared (prefetched) cperands from a "standard place", without knowledge of (i.e. dependency on) the addressing modes used. To this end we prefetch operands into the Register File into which ALU results are also stored (and possibly written out to Target Memory on the next cycle). A target register-to-register operation takes place in one micro cycle (Read/Modify/Write of the Register File); because of the decoupling of operand fetching and target instruction execution the same micro-instruction can also be used to execute a target n~ory to register instruction.
The micro-instruction pipeline scheme
The micro-instructien pipeline scheme is shown in Fig. 5 . During the current micro-cycle we always calculate the address NAi+ 1 of the next microinstruction (in the Micro Sequencer or Fork logic as described below) and fetch it into the microinstruction pipeline register ~Fi+l, STR PR) ready for execution in the next cycle. The Micro Sequencer is therefore strobed at the midpoint of the cycle (STR M2). We chose this scheme because of its simplicity and the resulting ease of writing and understanding the microcode.
To be able to calculate the next micro-address and fetch the corresponding micro-instruction in one cycle, it was necessary to add external table lookup logic (i.e. mapping ROMs) in order to (conditionally) "fork" on:
i)
The Source and Destination operand addressing modes.
The target instruction op code.
3)
Interrupts and internal hardware status signals.
During the micro cycle in which the fork occurs the mapped micro-address is also loaded into the sequehcer's micro-address register (CR0) so that it is available for the next micro-address calculation in the following cycle. The fork operation is enabled during the last micro cycle of every target instruction execution in order to decode the prefetched next target instruction. The resulting fork is either to a source operand fetch sequence, or to a destination operand fetch sequence, or to the target instruction execution sequence. Internal status signals can effectively interrupt the normal microprogram sequence after any micro-cycle via hardware which forces a fork iDJdependently of the microprogram control (ex~nples are bus timeout, odd address errors, CA~%C interface control lines for activating diagnostic microcode, PDP-ii censele emulating control signals such as HALT, STA~ and CONTINUE). Target Instruction Pipeline Scheme
The target memory pipeline scheme prefetches next instructions, operands or data for operand address calculation according to one of two standard sche~nes shown below. The scheme used depends on whether a result is written back to Target Execute Ij, no TM write. This degree of overlapping is supported by the three instruction pipeline registersT, IR, A and the use of the MesDry Interface ALU for effective address calculations in parallel with the main ALU operation. Most nonnal PDP-ii one or two operand instructions take only a single execution micro cycle, after any required source and destination fetch cycles. Thus the target memory is used nearly every cycle for rea~ing or writing. Only if the T register is already full or if an extra cycle is needed to cc~pute the next target fetch address (auto-decrement, certain cases of successful branches) is the target memory not active.
(Note that this siniole pipelining scheme makes certain operations which we consider "unstructured" in our enviror~ent either illegal (e.g. operating on PC) or legal but erroneously executed (calculating an 28-'1 operand in instruction I. for storing as part of 3 Ij+ 1 or Ij+ 2.
Special features
As explained in the secticn "overview of Design Objectives and Tradeoffs", we introduce new target instructions in order to facilitate the task of speeding UP the time-critical parts of prograns written in a higher level language (e.g. ASS~4~LER or PLii8).
Thus we have introduced the target instruction "j~p to micro-code" (JMC) which provides a general mechanismn to migrate functions frem Assembly language code to user microcode.
Another special feature offers the possibility of vectorising single instruction loops frcm the target level by preceeding the instruction to be vectorised with the new instruction "repeat next target instruction" (RPT). The RPT instruction sets the MICE into target instruction repeat mode, where the incrementing of the target PC is inhibited and the repeat counter is decremented at each instruction decoding fork until it be~es zero, whereupon the MICE leaves the target instruction repeat mode. Repeated execution of an instruction using autoincrement/decrenent addressing modes to step through arrays in memory removes the overhead of loop control, e.g:
ADD R0, (Ri)+ SOB R2 ,AGAIN can be replaced by:
RPT COUNT ADD R0, (Ri)+
The first requires six micro-cycles per element of the array, whilst the second requires only three micro-cycles per el~ment. Addressing modes which alter the PC are illegal in repeat mode and are trapped in the instruction decoding logic.
The Micro-sequencer provides powerful facilities for repeating a single micro-instruction or micro-subroutine. These features can be used to efficiently inplement microcoded loops. The single micro-instruction loop can be conditionally exited by wire-ANDing the appropriate status signals to bits in the next micro-instruction ~dress to force a conditional multi-way jump. This standard technique is available in the MICE hardware, where 16 status signals can be selected under micro-progrem control.
AS an ex~sple of the potential speeds achievable by microprogramming in MICE, consider the following assembly language progran which scans an array in memory for an element smaller than the value in R1 (a process typically encountered in high energy physics filtering algorithms):
MgV OOUNT ,R2 (IMP (R0) +,Ri BLT Ot~f SOB R2 ,I/3OP O~: ~P This is executed on the PDP-Ii/70 in a time of 1.65 microsecs, per loop pass. On MICE the same algorithm has been executed in a single microinstruction loop using the repeat micro-instruction mode, the micro-branch address modification technique and the target memory instruction pipeline to overlap:
i) The conditional micro-branch on cc~pare result of element j.
2) The c~pare of element j+l (in ALU).
3) The fetching of element j+2 (into T register).
4) The autoincrement of R0 (in Memory Interface ALU).
Thus the search is performed at a speed of 105 ns per array element, a factor 16 times faster t/~an in assembler on the PDP-iI/70.
Another special feature, which proved extremely useful during debugging of the hardware was the built-in diagnostic microcode. When the "diagnose" request signal is set by the DAC the MICE microprogrem is interrupted. The "diagnose" microinterrupt service microcode saves the interrupted value of CR0 for its return address and then proceeds to d~p all internal registers to a standard register in the CAMAC-interface, from where they are retrieved by the DAC. The MICE clock is held off until the DAC collects the data frcm the CAMACinterface register. On return frcm the "diagnose" service micro-sequence the next micro-instruction of the interrupted microprogram-is executed before the diagnose request causes the next micro-interrupt. In this way we were able to step through the microprogran from the DAC console, getting a dtm~ of the MICE registers after every micro-cycle. This feature required minimal extra hardware and could equally well be used to gather statistics on algorithms running in the MICE (at beth target and host machine levels).
It is also foreseen to connect special purpose units to the extended buffered internal CPU buses. These hardwired units would replace time cons~ning microcode required for certain functions, e.g. multiplication, internal product, etc.
An example of CPU operation
As a simple example of target instruction pipelining and e~lation (Fig. 7) , ass~e that during t i we decode Ij+ 1 = ADD X(R0),(Ri); this is an add with source mode 6 (Register Indexed), destination mode 1 (Register deferred). Three discrete phases will be required: Source operand fetch, Destination operand fetch, and Execution. At the end of t., the decoding of Ij+l will have enabled the Sour~ operand fetch output of the decoding R0M which resulted in the fetching of the first micro-instruction in the mode 6 source operand fetch micro-sequence (two micro-instructions). Simultaneously we prefetchedthe word at j+2, which is the index X in this case, and incremented the Program Counter (Prepare Fetch Ij+3). During ti+ 1 this first micro-instruction oauses the effective address to be calculated frem index X and source register R0 in the Memory Interface unit. At the same time the word at j+3 is prefetched (either destination operand or next instruction) to "pay back" the word previously fetched and "borrowed" as (source) operand rather than next instruction.
In ti+ 2 the second Source fetch micro-instruction causes the operand to be fetched into a standard register in the Register File (RFs), and PC is again incremented to prepare the next fetch. Since this is the last source operand micro-instruction, it also enables the destination fetch micro-sequence output frem the decoding RDM so that simultaneously the first (and only) Destination micro-instruction is accessed. At ti+ 3 it fetches the Destination operand (into RFd) using R1 as an address (see section "Slices are not always sufficient for their intendedpurpose"), and enables the execution sequence output of the decoding ROM to cause the ADD microinstruction to be fetched. This is executed in ti+ 4 in parallel with the fetch of Ij+ 4 , and at ti+ 5 the result if written back in TM while PC is incres~nted for the next fetch, and Ii+ 3 is decoded by enabling the decoding ROM. Thus five micro-cycles of 105 ns are required to cc~plete the instruction. Note that when all memory reads are frem the cache, the PDP-ii/ 70 requires 2.1 micro-seconds to execute the sane instructien (4 times slower than our machine).
Findings
Most of our observations probably will not surprise those experienced in design with bit slices, but we have not see~l them clearly stated in the trade-and formal literature.
Slices are not alwalzs sufficient for their intended purpose
The 10800 family of slices is a rich family, but the individual slices have both too much (i.e. underutilized) functionality and too little orthogonality/regularity/symmetry. We used less than half of the functions, data paths and register of the sequen(.~r, ALU and memory interface slices. At the same time, they lack syrmretry -the ability to transfer source X to destination Y does not imply the existence of the inverse transfer although it makes sense and was needed. This forced us to use external data paths (via r~itiplexors and gates) and registers to supply the symmetry. While we could not fully use the available functionality, we paid for it nonetheless in terms of chip complexity (and therefore speed).
It certainly would help if the recent trend to design symmetric, orthogonal CPU's (e.g. DEC's VAX) would also appear in the design of future slices; where technology limitatiens make that impossible, better docnanentation on how and why the subset was chosen and how to work With it would be useful.
283
Another reason we found it necessary to add external SSI and ~I "glue" was to bypass slice features when they would take too much time. Two main examples of this both concern the bypassing of internal address registers one normally used to address the control store (CR0 in the micro-sequencer), the other normally used to address the target memory (MAR in the Memory Interface). As mentioned above, CR0 is bypassed with the decoding R0M for target instruction decoding (forking) because testing target address mode bits with the sequencer and then loaalng CR0 would take too many cycles. Similarly ~%R is bypassed, for example, for the ccmmon operend address modes register deferred, auto-increment and auto-increment deferred. To use MAR we would have to fetch the target operand address from a register in the Register File and then strobe it in MAR at the end of the cycle. Instead we load the external address register (XADR) with the operand address frcm the Register File 30 ns into the cycle, immediately address the Target Memory, and bring the operend thorough the Memory Interface to the Register File to store it at the end of the same cycle. During this same cycle auto-increment of the target address may take place in the ALU.
Because of the external logic there are many ways to accc~plish a given function -multiple independent data paths, registers and even ALU's. The resultant cc~plexity and potential for writing dangerous (unpredictable) microcode must be masked from the ordinary microprogranmer by supplying him with high level macros which present him with a more str~tured, more vertical (encoded) architecture -see below.
Hardware design means software tools Software tools and development are a (the?) major portion of a bit slice hardware design. Because the host must he microprogrammed, one obviously needs a powerful, convenient micro-assembler. We also used a simulation system to check out the host design. A micro-assembler is commonly custc~n-ized to the host architecture by declaring the layout of the micro-instruction (fields, subfields, default values of those subfields) to a meta assembler, and then using its macro facilities to declare "instructions" which set the fields belonging to each of the functional units in the host architecture. We found the Signetics Meta Assembler written in FORTRAN entirely usable, but spent several man/months writing a large set of (nested) macros (called microps by Signetics) to control assignments to groups of fields at a level well above that of controlling 49 independent fields in our horizontal instruction. These microps do extensive error checking (using IF THI~W ELSE parameter checking) to make sure field values are within range, and are rm/tually consistent within the definition of the microprogrammer's host, which is essentially a higher level, suitably restricted version of the actual hardware. Considerable effort ~nt into error diagnostics, an area in which the Signetics assembler was somewhat weak; for example it provides no easy way to correlate field values in separate microps for the same micro-instruction. More general macro facilities (e.g. parameter type checking) would also have been useful. A typical set of microN, to route two operands (RS and RD) from the register-file (RF) to the ALU, add them and restore the result in RD, (see Fig. 4 for the data paths) is as follows: RF(RS,R)OB RF(BD,~)IB LATCH IB ADD(A,L)IB; a construction which hides the details of gate, multiplexor, and function code settings for the two slices and their glue, as well as the time-multiplexed use of the I-bus, and prevents inconsistent/ conflicting use of the buses.
The si~lation of the host was the other major software project and although entirely optional, of major impact. We decided to simulate the host for a variety of reasons, which are beccming increasingly accepted in the hardware design ccmmunity:
To verify the correctness of the host design, by tracing its execution of micro-instructions.
2)
To be able to write and debug microcode (e.g. the target emulator) before the hardware was ready, and indeed, even afterwards; debugging on any time sharing tezminal with good debugging and diagnostic features is easier than doing it on a minimally available piece of hardware.
3)
To check the hardware against a non-varying standard "definition" so as to be able to "certify" it.
4)
To allow a quick assessment of the impact of changes (fixes, enhancements, etc.).
5)
To provide "living", i.e. dynamic, interactive project doc~entation which is easier to ; understand than static, passive diagrans, code, or text, and is always up-to-date.
6)
To allow measurement, evaluation, and identification of bottlenecks.
Rather than implementing a simulator in a general purpose language (e.g. FOR~AN, PASCAL, SIMULA) we picked Carnegie Mellon University's ISPS 9 because it is specifically designed for architecture simulation, and has many useful, well debugged features and tools. It consists of a description language in which data transfers and data operations are specified, a sirm/lator for the resultant register transfer machine, and an interactive ccmmand language for tracing, setting breakpoints, loadin~ and interrogating register and memory values, etc. Multiple concurrent processes are supported to mirror hardware paralleliem, and timing and measurement facilities are also available. Once we learned to use ISPS properly, the CPU description and si~/lation was completed in three man-w~_ks, the interface and I/O logic in another man-month.
Regrettably, ISPS was not installed on our PDP-10 until after most of the logical design was cxmmpleted; if it had been available earlier, much tedious hand simulation of the microcode and the architecture could have been avoided. As is, our learning investment in ISPS has been more than justified in terms of finding many (mostly small) design errors before the construction of the hardware and allowing easy testing of additions to the basic system. It currently runs exclusively on the PDP-10, unfortunately, but makes only modest demands on the ccmputer since the simulation is dominated by man/machine interaction at the console.
284
Conclusion
Working with bit slices involves sophisticated, integrated hardware, fizmware, and software design, using high level, cc~plex hardware building blocks. It cannot be done the sane way designers work with fixed instruction set micro-processors, which requires primarily ordinary assembly language programaing knowledge and far less hardware experience. The design task is accfmplished faster and easier than working with lower level (~I) building blocks and yields a fast flexible result. A significant investment in software tools for a proper design met/~odology made it possible to design, build and test our filtering engine in about one year, using a host architecture simulator on a PDP-10, a microasse~nbler/PROM loader on a 370/168, a wire-wrap program on a CDC 7600, a PDP-ii/40 to drive the hardware, and two networks to tie all these software ingredients together and make them easily accessible frcm a single terminal. Notes: Source and Destination operand fetching is not shown. 
