Abstract-The microprogrammed filter engine (MICE) is a fast, microprogrammable processor built with ECL bit slices (Motorola ECL 10800 series) intended primarily to be used as an on-line data filtering engine for high energy physics experiments. In this note we describe the use of a hardware description language used to model and simulate the hardware during its development. We treat the problem of describing a pipelined, horizontal (112 bits wide) host machine, implemented using bit slices with considerable potential for parallelism. Several levels of modeling are conceptually applicable to a problem of this nature and the note describes the thorough process followed before we decided on a particular style of description and simulation.
I. INTRODUCTION The microprogrammed filter engine (MICE)
I] is a fast, microprogrammable processor built with ECL bit slices (Motorola ECL 10800 series) intended primarily to be used as an on-line data filtering engine for high energy physics experiments. The processor supports both user microcode and emulation of a subset of the DEC PDP-1 1 architecture (no floating point, memory management, or multiple interrupt levels).
MICE was designed and implemented at CERN, Geneva, Switzerland during the latter portion of 1978 and the first half of 1979. The design parameters, intended application, and architecture of MICE are described in [1 ] . In this note we concern ourselves with the use of a hardware description language used to model and simulate the hardware during its development. We treat the problem of describing a pipelined, horizontal (112 bits wide) host machine, implemented using bit slices with considerable potential for parallelism.
Il. SIMULATION AS A DESIGN TOOL As indicated in [ 1] , there were six basic reasons for using simulation as a design tool, as follows: 1) to "verify" the correctness of the host design, by tracing the execution of microinstructions, 2) to be able to write and debug microcode (e.g., the target emulator) before the hardware was ready, and indeed even afterwards; 0018-9340/81/0700-0513$00.75 © 1981 IEEE debugging on a time-sharing terminal with good debugging and diagnostic features is easier than doing it on a minimally available piece of hardware, 3) to check the hardware against a nonvarying standard "definition" so as to be able to "certify" it, 4) to allow a quick assessment of the impact of changes (fixes, enhancements, etc.), 5) to have the machine description provide "living," i.e., dynamic, interactive project documentation which is easier to understand than static, passive diagrams, code, or text, and is always up-to-date, 6) to allow measurement, evaluation, and identification of bottlenecks.
Rather than implementing a simulator in a general purpose language, the designers adopted an existing, well debugged, and documented architecture description/simulation facility based on the ISPS notation [2] , [3] .
The notation and its use in a variety of applications are described elsewhere [4] . The objective of this note is to demonstrate the advantages of high-level software tools and the thought processes that permitted the successful development of a MICE model in a very short period of time. In particular, we want to illustrate the novel use of an instruction set description language in a project involving a horizontal microprogrammable processor.
This note is not intended to be a tutorial or a survey on hardware description languages. A large number of these languages have been proposed and some have even been implemented. However, many tend to be used in fixed areas of application, or specialize in a particular level of design, or are built around a specific model of timing and synchronization, or are even specialized for a particular component technology. This variability restricts most comparisons to a rather small set of dimensions. Nevertheless, research on hardware description languages is very active, and readers interested in tutorials or comparisons of the most popular languages should consult [5] - [7] .
III. AN MICE is intended to provide a flexible second level detection mechanism by performing more complex algorithms than are used in the very fast hardwired primary detector. It can reject "uninteresting" events selected by the primary detection mechanism in a shorter time than would be required for a full read-out to the DAC. An added advantage is that the data collected contains a higher percentage of "good" events, thus reducing the amount of off-line data processing required to higher percentage of "good" events, thus reducing the amount of off-line data processing required to filter out a given number of good events. Fig. 1 depicts a typical system configuration. MICE performs a filter algorithm on a subset of the event data made available to it by a fast read-out system connected (via DMA) to its target memory. If an event is to be rejected, the engine issues the REJECT signal which clears the system and awaits the next trigger interrupt. If an event is to be accepted, the engine interrupts the DAC which then reads the complete event from the detectors. In MICE a user can write low-level microcode for time-critical algorithms and, using standard software development tools, he can program noncritical tasks at a higher level, using the fixed-point PDP- IV. THE ISPS NOTATION ISPS describes the interface (i.e., external structure) and the behavior of hardware units (called entities in the language). The interface describes the number and types of carriers used to store and transmit information between the units. The behavioral aspects of the unit are described by procedures which specify the sequence of control and data operations in the machine.
In the simplest case, a unit is simply a carrier (a bus, a register, a memory, etc.), completely specified by-its bit and word dimensions, as shown in Fig. 3 .
The examples in the figure are taken directly from the specification of the writable control store (WCS) in MICE. The control memory consists of 1K words, each 128 bits wide (only 1 2 bits are currently used). Microinstructions are loaded into the pipeline register (PR) for decoding and execution.
Different fields of a microinstruction control different data paths and functional units in the machine. For convenience of description and debugging, these fields tend to be grouped according to their function. Fields must be declared by specifying a name and a structure (e.g., ALUMX IIOF(9:0)), together with the corresponding portion of the carrier (e.g., PR( I 1 11:102)) of which they are a part. Notice that bit "names" used in the left-and right-hand side of a field definition are completely independent of each other. Thus, bit "9" of ALUMX1IOF is mapped onto bit "11 1" of PR, bit "8" onto bit "1 10," etc. Fields can be mapped over other fields, as in the definition of ALUFF(5:0). For 1-bit fields (or carriers), there is no need to specify a bit name inside the"(" and ")" brackets, as in the declaration of DGIOOBF.
Hardware behavior can be modeled by procedures containing data and control operations. Fig. 4 displays an abridged copy of the behavioral description of the microsequencer control (MC) unit. Pro- cedure declarations are similar to those of a high-level programming language, with a procedure name followed by a (possibly empty)list of parameters and the body enclosed in a BEGIN -END block. The MC procedure computes microinstruction addresses depending on fields of the current microinstruction, external branch conditions, and other signals. The ICF field of a microinstruction is used as an "operation code," which is decoded and its value used to select one of a number of alternative register transfer sequences. This selection mechanism is implemented in ISPS with the DECODE operation. If the value of ICF is 0, the microinstruction address register, CR0 is pushed onto a small stack, internal to the MC unit and a new value (computed by procedure NA) is loaded. If the value of ICF is 15 (Hexadecimal F), a return address is popped from the stack into CR0 while another register (CR1) is incremented by the value computed by procedure CINi. To further draw the analogy between the role of the ICF field and the operation code in a vertical machine, each 515 of the alternatives is labeled with both the value of the ICF field (0, 1, , "F) and an instruction mnemonic, as in "O\jsr ", where 0 is the operation code and JSR stands for jump to subroutine. The "\" operator is used to introduce aliases for constants (as in the example) or identifiers, as in "pr\pipeline.register."
The two operations illustrated in the figure (JSR and RTN) differ in one important aspect. Although both consist of two steps, the steps are separated by different delimiters: "NEXT" is used to indicate a sequential operation, while ";" is used to indicate a concurrent operation. Concurrency in ISPS is defined as "process" concurrency and no assumptions are made about the synchronization of the operations. Thus, conflicting use of source and destination carriers, as in "A = 5; B = A" can yield unpredictable results (no assumptions of the form compute all expressions into temporary variables before performing the transfers can be made).
In the general case, a unit consists of an interface (carrier) and a procedure which describes its behavior. The procedural part may contain not only data and control operations, but also the declaration of local units of arbitrary complexity. Local units are not accessible to external units, allowing the encapsulation of portions of the design in a well-structured manner.1 This is illustrated by the PUSH and POP procedures used in the description of MC (Fig. 5 ).
V. THE STRUCTURE OF MICE As described in [1] , there were some problems with the functionality of the bit slices which resulted in the addition of external SSI and MSI components. These were required to provide data paths which were not present in the standard slices, and also to provide functions that were too expensive (i.e., slow) to obtain using the functions already available in the slices.
For example, to calculate the next microaddress and then fetch the corresponding microinstruction in one cycle, it was necessary to add external table look-up logic (i.e., mapping ROM's) in order to (conditionally) branch on the following: 1) the source and destination operand addressing modes, 2) the target instruction operation code, 3) interrupts and internal hardware status signals. During the microcycle in which the branch occurs, the mapped microaddress is also loaded into the sequencer's microaddress register (CRO) so that it is available for the next microaddress calculation in the following cycle.
Similarly, an external target memory address register was added to allow bypassing of the memory interface's MAR and its associated strobing to give us a single cycle address-and-fetch for the more common or simple PDP-II addressing modes. Also, a number of external gates were added to provide symmetry of data flow not available from the data paths of the slices. The net effect was an increase in the complexity of the host architecture. In addition, at least in principle, unpredictable microcode can be written and it was a policy decision to discourage this practice by providing the users with high-level macros to implement a more structured, virtual architecture, yet without sacrificing speed by hiding architectural features of the horizontal host machine.
The system designers could then take advantage of the features of the host machine by programming at the individual field level, while normal users could program a higher level virtual host machine which still takes advantage of the machines's inherent parallelism.2 For example, register to register arithmetic and logical operations involving the register file, the ALU, and external multiplexers and gates controlled by a dozen or so fields, were encapsulated-in simple macros specifying source and destination operands and an operation code. VI. MODELING One of the most important decisions to be made whenever simulation tools are to be used is the selection of a level of modeling. Several such levels were conceptually feasible and in this section we describe the thought process followed before we decided on a particular style of description and simulation.
At one extreme, the modeling of the system as a PDP-1 instruction set processor was clearly inadequate since the objective was not to 'test the target instruction set but the host machine design. A PDP-1 1 instruction set description would have hidden all of the important features and potential problems in the host machine.
At the other extreme, the combination of standard slices, together with the avoidance of pathological microinstructions eliminated the need for expensive, detailed simulation at the gate level.
Modeling the higher level virtual host seen by most users suffered from the same problems that led to rejecting PDP-II instruction set level as a viable alternative. This level hides most of the timing and concurrency details that characterize the host machine. Thus, while it is acceptable and even desirable for a user to be unaware of these details, the designers needed a level of description closer to the actual hardware.
Modeling the host at the component level, describing and simulating the bit slices and the extra logic and data paths was closer to what was needed. However, even at this level, much unnecessary detail and expense could be incurred. The level of description finally adopted could be characterized as "virtual slice" simulation. Thus, while we described and simulated the operation of the individual slices, not all details were included, only those portions of the slices that are used in the host. Thus, for instance, the BCD arithmetic capabilities of the slices are not used in the host and are not included in the description.
Using the "virtual slice" approach does not, of course eliminate all the potential sources of difficulty in the description and simulation of the host machine. We will address two of these problems in the The solution we implemented was not to try to approximate the actual (micro) timing of each of the slices, but rather to implement only the crudest level of timing, that of strobing the registers internal to each slice as well as the external ones at the right phase of the five phase clock. This implies that the state of the simulated machine (i.e., the status of both clocked and unclocked components) changes only at discrete time intervals, the five phases of the clock (see Fig. 6 ). These timings are based on worst case propagation delays and therefore the simulation does not mimic the hardware faithfully, a point that should not concern the microprogrammer. The manner in which changes in sources are propagated to their destinations is described below. Conflicts are resolved by introducing explicit master and slave copies of each of those registers which can act both as source and destination during a microcycle. Old values are copied from the slave copies, while new values are strobed into the master copies. At the end of the microcycle all master values are transferred to their slaves. In this way, the intrinsic parallelism of the machine is simulated at a level which is both manageable and gives a realistic picture of what the microprogrammer can expect to see in each register at each clock edge.
Note that the firing order problem would appear even if we used ISPS to model the parallel activity with parallel processes, say one process for each "virtual slice," or even one process per major activity such as register file read and register file write. This is because such processes would have to be synchronized using the ISPS "DELAY-(time)" primitive so that a consuming process gets an up-to-date value from its producing process, via the interface carrier. This again implies determining mutually consistent amounts of delay, which is equivalent to defining a canonical order of firing.
We have however, used the parallel processor facility of ISPS to model independent activities going on in our machine, e.g., MICE CPU, and the two DMA processes (see Fig. 7 ). In addition, a number of parallel processes can be activated in order to Now, however any of the sources change, the procedure must be invoked to load its carrier with the correct value source.n = expression NEXT bus( ). Whenever the current value in the bus is to be used as input to some component, the bus carrier can be used in an expression. For instance x = y + bus. In this style of description, activation of procedures describing combinational logic occur as a consequence of a change in some input carrier. There are two problems, however: 1) there is an implied retention property being attached to the bus carrier, and 2) there is a spreading of information (invocation of procedures) throughout the description because each of the entities for which the bus is a source must also be invoked in turn and so on, to propagate the original change.
An alternative mechanism is to transfer the activation of the bus procedure from the site where an input carrier changes, to the site where the bus is to be read. In this way, when an entity is accessed, we then ask which sources might have created it as their destination, and consequently what its up to date value should be. In other words, we do not propagate a change as it happens, but only as it is needed to supply the latest value to a carrier affected by the change. Thus, source changes can be described as before source.n = expression !notice, no bus activation. Whenever the bus is to be used as input to some component, its value must be "computed" explicitly x =y +bus( ). In this style no retention properties are implied (the value is computed whenever needed, using the latest values of its sources), the description is shorter, and moreover, the knowledge about the identity of the bus has been restricted to.those places that use the bus as an input. In effect, this method handles propagations by going backwards from destination to source whenever needed rather than spreading from a source to all its destinations everytime the source changes. While at first this recomputation every time the bus is needed seems 517 very inefficient, it does not happen more than a few times for each slice per microcycle and it is the best way to guarantee that the bus contains "fresh" results, without timing conflicts. Furthermore, we avoid needlessly propagating changes to entities which are not in turn used as sources.
The same considerations were applied to other components without retention properties (e.g., output of multiplexers). All components without retention properties were modeled as procedures to be invoked whenever their output lines were to be used in an expression. Thus, one views such logic not as passive passthrough of its inputs but as an active entity, whose procedure computes a function of its inputs. To illustrate this principle, the example in Fig. 8 describes one of the multiplexors driving the OBus (MX5) (Fig. 2) .
The example indicates that the rule implies a regression from the desired output values back towards its sources, through potentially many levels of logic (e.g., OB, MX5, CONSTANT, CONSTANT .ROM). The regression stops whenever a source with retention properties (e.g.,OB) has reached steady state at the moment that its carrier is to be used.
The latter is a direct consequence of the level of description. In a gate level simulation, events are continuously repeated until signals reach steady state. In a register transfer level simulation, signals are computed once unless explicitly described otherwise. Although the latter is clearly more efficient, errors in the design could be easily overlooked. To detect some of these errors, two features of ISPS were used. The first one was the use of a predeclared carrier in the language. This carrier is dubbed "UNDEFINED" and can be used as any user declared carrier. When it is used as a source in a register transfer statement, the destination carrier is marked as having an undefined or illegal value which is readily detected by the simulator. In the MICE description we assume that carriers (in particular, combinational logic carriers such as buses and multiplexers) reach steady-state at their right time, determined by the actual physical timings we expected. They are UNDEFINED before this time. Any attempt to access one of these carriers before the right time is therefore detected by the simulator and the proper diagnostic message is issued. Fig. 9 illustrates the use of UNDEFINED. The RD.CM carrier is assumed to be undefined during the first three phases of the clock (clk = 1, 2, 3). It reaches steady state during phase 4, at which point one of several potential sources is loaded into the carrier. The procedure depicts the behavior of the CAMAC read interface. An attempt to transfer (i.e., read) a value from the CAMAC interface into one of the CPU carriers before clock phase 4, is trapped by the simulator as an error (this is, of course, a worst case assumption since different sources become available at different times).
The second feature of the language that was used in the description was the detection of "recursive" calls on an already executing procedure. This allows the detection of potential race conditions. Specifically, the enabling of some chains of data paths could lead to a situation in which a bus is being driven by a signal from a slice which is in turn driven directly from another bus, which in turn is receiving a signal from the first bus. This loop takes the form of a series of nested calls on the procedures describing the behavior of the combinational data paths along the loop. Eventually, when the loop is closed, an attempt is made to call the procedure that started the chain and this is again detected by the simulator.
An example of this situation could arise when for instance, the register file (RF) is to be loaded from the input bus (IBUS) through the A6 port (Fig. 2) . To compute the value in IBUS, all its inputs are evaluated and AND-ed together. One of these inputs comes from the memory interface (MI) slice, port I3. This port could be driven, through the slice internal data paths by the 03 port, which in turn requires the evaluation of the output bus (OBUS) signal. This signal is the conjunction of several possible sources, and one of these (G3) could be driven in turn by the input bus, whose value we were trying to compute in the first place! VII. CONCLUSIONS The MICE description models the CPU, buses, CAMAC interface, and DMA interface in 5000 lines of ISPS (including extensive comments). The CPU description was written in one man-month, the remainder in another two man-months by MICE designers without prior experience with ISPS.
When the description is compiled and linked with the simulator run time system, the program occupies 400 kbytes of memory, 110 kbytes of which is the simulator run time system (this portion does not depend on the particular ISPS description being simulated, while the description specific portion grows with the complexity of the description and the size of the memories and register files).
Simulation of a single microcycle requires 400 ms of PDP-10 (KL-1080) processor time. The load on the machine is, however, slight, on the order of a few CPU min/h, since man-machine interactions (commands, responses, and execution traces) dominate the total elapsed time.
In a system of the complexity of the MICE host, there are many ways to get in trouble due to conflicting use of data paths. We were not interested in being able to simulate all microcode sequences one oould write, only "friendly" microcode, trapping anything that looked suspicious as'an error.
The responsibility for detecting errors has been divided between the microassembler and the simulator. The former catches errors resulting from simultaneously enabling potentially conflicting data transfers. The latter catches errors resulting from accessing data before it has settled (still UNDEFINED) or from accessing the wrong data (recursive calls).
The actual implementation encountered no logical errors thanks to the thorough simulation; the only problems were ECL technology related. As the hardware is extended, the simulator is in constant use to verify the correctness of enhancements and alterations. Also, the DEC PDP-11 diagnostics are run after each change to give an additional check. It is fair to say that the project would not have been completed as rapidly and as cleanly without our powerful software tools. The digital control technique of microprogrammed logic has obtained widespread acceptance in large-scale computers, minicomputers, microprocessors (e.g., 8086, 68000), and controllers. Recent trends in computer architecture have been to migrate functions traditionally implemented in software to beneath the hardware/software boundary, thus increasing the size and complexity of microprogram logic. As a result, the testing and debugging of microprograms in today's systems is a sizable undertaking. The term "firmware engineering" was coined to suggest that, like the widespread recognition of the problems of software engineering, much more attention is needed on tools and techniques for microprogram development [1] .
In early microprogrammed systems, microprogram development was largely serialized with hardware development by postponing microprogram testing and debugging until a hardware prototype or engineering model was available. The relative simplicity of the microprograms in such systems allowed this to be done, and also allowed one to forgo the development of special tools.
The debugging of a microprogram on its actual hardware base is a costly and time-consuming process for several reasons. One is that Manuscriptreceived December 2, 1980; The increased sophistication of the architectures of many microprogrammed systems, and the introduction of bit-slice LSI components which are oriented to microprogrammed designs, have led to the recognition of a market for tools to aid the debugging of microprograms in a live hardware environment. These tools, developed largely by manufacturers of bit-slice parts, employ hardware and software logic to monitor a prototype system during debugging, providing the user with such facilities as the ability to display the system state on a CRT, trace events (e.g., the flow through the microprogram), and suspend processing upon the occurrence of specified events [2] . These tools also contain a fast RAM which can be configured to emulate the user's control storage. II . SOFTWARE SIMULATION Another approach to microprogram debugging and testing is software simulation. In the sense used herein, a software simulator of microprogrammed system is a program written to mimic, from the microprogram's point of view, the underlying hardware data flow. That is, the simulator interprets the microprogram, providing exactly the same state changes and effects (except for elapsed time) that would occur if the microprogram existed in its actual hardware environment.
Software simulators for microprogrammed systems are not a new idea; for instance, they were used in IBM's CAS (control automation system) during the development of the S/360 processors [3] . However, many microprogram simulators are unsophisticated and cumbersome tools, the literature on such simulators is extremely scarce [4] - [ 14] , and much of the literature that does exist discusses simulators produced for instructional, rather than developmental, purposes.
The motivations for developing and using a simulator for microprogram testing and debugging are listed below.
Parallel Microprogram and Hardware Development: The obvious motivation for a simulator is serving as a vehicle for testing and debugging of the microprogram before the hardware base is available. One benefit is a reduction in the total elapsed development time of the system. Another is a reduction in the number of variables in question when the microprogram meets its new hardware base. Easier Debugging: As will be discussed later, simulators usually contain special debugging facilities for the microprogrammer, facilities that are normally not available when debugging a microprogram in a live hardware environment. Hence, the cost and time of locating and correcting microprogram errors is significantly less when using a simulator.
Special Error Checks: It is normally uneconomical to place logic in the hardware design to detect microprogramming errors; the hardware base is usually designed to operate under the assumption that the microprogram is correct. However, special checks for potential errors can be incorporated in the simulator. Such checks are dependent upon the particular system design; examples are: 1) simultaneous gating of two or more sources onto a bus, 2) timing errors, 3) branches to "empty" control-storage words, and 4) failure to observe memory speeds (e.g., doing a read and attempting to use the value before it arrives).
0018-9340/81/0700-0519$00.75 © 1981 IEEE 519
