This paper describes a method for the automatic generation of the internal structure of digital processors from a specification of the required behaviour. The latter is specified by a high-level, PASCAL-like program. The internal structure is described in terms of memories, arithmetic/logic function boxes, multiplexers and their interconnections. In order to reduce the complexity of the design process, it is,partitioned into a sequence of individual steps. These steps include a flexible expression decomposition, a statement scheduling phase, anew module selection method and optimizations of interconnections and instruction word length.
This paper describes a method for the automatic generation of the internal structure of digital processors from a specification of the required behaviour. The latter is specified by a high-level, PASCAL-like program. The internal structure is described in terms of memories, arithmetic/logic function boxes, multiplexers and their interconnections. In order to reduce the complexity of the design process, it is,partitioned into a sequence of individual steps. These steps include a flexible expression decomposition, a statement scheduling phase, anew module selection method and optimizations of interconnections and instruction word length.
Design Specification
The manual design of digital systems is a timely and errorprone process. While the complexity of digital systems is still increasing, there is a demand for shorter design times. Cutting design times can only be achieved by the use of new design techniques. Therefore, there is a growing interest in methods for the synthesis of digital processors from a description of the required behaviour.
There are several levels of abstraction on which the required behaviour can be specified. In order to open a large design space, the behaviour should be specified on a level as high as possible.
A common form of design specifications for digital processors is the description of an instruction set, which is to be implemented.
However, there are cases, where such a specification would restrict the design space more than necessary. As an example, consider the design of a processor for some dedicated application, like the design of a simulation engine. The only behaviour of such a processor, which can be monitored from the outside world, is the behaviour of the programs running on it. It does not matter, which instruction set the machine is executing. Therefore, programs can be used to specify the desired behaviour of digital processors. This approach was first proposed by G. Zimmermann [1] . A design specification describing the desired behaviour by programs is called a specification on the algorithmic level. Starting on the algorithmic level allows the design procedure to tailor the instruction set to the particular application. The instruction set is generated during the design and there is no need to specify it from the beginning. The instruction set may be constructed such that using the poten tial parallelism of VLSI technology is simplified.
Frequently, however, designers will still have to design machines with a given instruction set. Fortunately enough, this turns out to be a special case of an algorithmic specification. Instruction set semantics are usually described by an interpreter, for example in ISPS. Interpreters, however, are special programs and therefore can be used as an input to a s ynthesis system using algorithmic specifications.
This special case, frequently leads to confusion because two different instruction sets are involved: the instruction set interpreted by the interpreter and the instruction set of the machine to be designed. The latter is implemented in hardware and is frequently called microinstruction set. In this paper, we will refer to instructions implemented in hardware simply as "instructions" since the second level of instructions may be missing.
Because of the advantages of algorithmic specifications,we designed a synthesis system startting on that level. This system uses MIMOLA (machine independent microprogramming language) [2] as its design language. MIMOLA has been changed recently to include most of PASCALs features. Hence, the algorithmic level in our sense includes e.g. recursive procedure calls and references to arrays with an arbitrary number of dimensions. The design system itself is called MIMOLA Software System (MSS).
Usually there will be a large variety of machines being able to perform the desired behaviour. Therefore, it is necessary to restrict the design space in a number of ways. The following are the most important restrictions accepted by our design system:
1. Limitation of the number of instruction bits for immediate data and addresses. E.g., it is possible to specify that at most 24 bits per instruction are allowed.
2. Specification of a set of available arithmetic/logic units (ALUS). This restriction is obviously required if the design has to be implemented using discrete devices like TTL-circuits. It is also required, if the design system is to be extended into a standard cell silicon compiler.
3. Specification o f available data memories. Our previous experience indicates that there are only few choices for data memories. Therefore it is possible to completely specify all data memories to be used in the design, repeat the design process for possible other choices and then compare the resulting designs.
One important parameter of random access memories is their number of ports, that is, their maximum number of simultaneous accesses to different memory locations.Small memories are usually implemented as 2 -or 3 -port memories. ' Larger memories frequently are single-port memories, but pseudo multiport memories can be built using memory banks and crossbar switches.
In addition to the specification of the desired behaviour and the set of design contraints, our design system needs some additional information on how to link behavioural and structural domains. This information is concerned with the implementation of high-level programming elements (like procedure calls) in hardware. Examples are given below.
The following is a sample of a complete design specification in MIMOLA. Due to space constraints, this sample is much shorter than typical design specifications. The syntax used in this example reflects version 4.0 of the design system. Sequential execution is necessary for the first form and an implementation requires at least two instructions. In contrast, both assignments in the second form can be done in parallel. Therefore, it can be implemented by a single instruction if hardware resources are sufficient. It is hard to anticipate, which implementation will be the fastest under the constraints of limited hardware resources. Therefore the design decision is delayed by generating up to three different versions of control flow implementations in a component called MSSI. One of these versions is selected after the number of required instruction steps has been computed for each version. MSSF, MSSR and MSSI are three of the so-called front end tools. The execution of these tools precedes the execution of the synthesis algorithm (see Fig. 1 ). Other front-end tools are MSSS (a simulator capable of simulating RT-behaviour), MSSO (an optimizer for RT-programs) and MSSP (a component detecting possible parallelism).
The synthesis subsystem 2.3.1 Statement decomposition
The synthesis system uses instruction bits in order to generate (address-and data-) constants. Design constraints may include a maximum for the number of immediate bits per i nstruction. Hence, complex statements, containing many constants, must be decomposed into a sequence of simpler statements not violating these design constraints. Required temporary variables must be introduced. For the present version of the MIMOLA system it is also assumed that there is no reassignment of hardware resources during the execution of a generated instruction. As a consequence, e.g. the number of memory references per instruction cannot exceed the number of memory ports. Therefore, statements containing many memory references must also be decomposed into simpler statements.
Finding an optimal decomposition is known to be NP-complete. Traditional compiler techniques like [4] are optimal only for special cases. One of the frequent simplifications is ignoring the existence of common subexpressions. Our previous experience however indicates that taking advantage of common subexpressions is required for acceptable designs. Optimal algorithms, which do consider common subexpressions (e.g. [5] ), do not handle general expressions.
We therefore developed a heuristic method. The virtue of this method is that it is very flexible with respect to different design constraints and that it takes advantage of common subexpressions.
Let t be an arbitrary expression or assignment. Define treetoobig (t) such that treetoobig(t) is true if t cannot be evaluated in a single cycle and false otherwise. The precise definition of treetoobig includes the number of available memory ports, the upper limit for the total instruction length and predictions of the cost to implement arithmetic operations present in t. For example, treetoobig is true, if the number of memory references in t exceeds the number of available memory ports.
Let t again denote an arbitrary expression or assignment.
Define mostcomplex(t) to mean a subexpression a of t, where a is by a heuristic criteria the most complex subexpression of t, which can be assigned to a temporary variable without violating design constraints. In the MSS, mostcomplex selects a subexpression of t according to the following priority list:
1. maximum number of memory references, 2. maximum number of references to memories not used to store intermediate results, 3. common subexpressions, 4. boolean subexpressions, 5. left to right. E.g. if two subexpressions of t contain exactly the same number of memory references and one of them represents a common subexpression, it will be selected by mostcomplex. Fig. 2 is a flow tree representing this statement. Numbers in parentheses indicate the sequence, in which decompose traverses the tree. Reading from and writing to memories is simply denoted the name of the memory.
decompose will deposit the following sequence of statements in the stack:
Frequently, no distinction is made between dd and ad (e.g. in [6] ). As Vegdahl [7] points out, this prevents some blocks of code being moved as a whole. Therefore the distinction between the two relations is important. The definitions of dd and ad in [81 are applicable only to sequential blocks, because they make use of the order, in which statements are written. MIMOLA allows the user to specify parallel blocks like PARBEGIN a:=b; b:=a END.
The two assignments are expected to interchange the contents of a and b and the order in which they appear in the program is redundant. The above definition can be applied to sequential as well as to parallel blocks. A more precise form is contained in [9] .
Using dd and ad, the set of allowable schedules can be defined. Let MI(si) be the instruction allocated to si. Let MI(si)>MI(sj) denote that the execution of sj precedes that of si and let MI(si)_>MI(sj) denote that either MI(si) > MI(sj) or MI(si) = MI(sj). Then, the following conditions must hold:
As .long as these conditions are met, many different scheduling algorithms may be used. In the MSS we modified the pairwise comparison algorithm [10] such that is does no longer rely on a strict order of statements.
As the name indicates, the pairwise comparison algorithm compares statements pairwise for data dependence and resource constraint violations. This comparison is limited to statements contained in the same block. Hence, the complexity of the pairwise comparison algorithm grows quadratically with respect to the size of blocks and linearly with respect to the number of blocks. This complexity is equal to that of the statement decomposition phase, because decompose requires that common subexpressions within a block are detected. Detecting common subexpressions also requires a pairwise comparison of expressions.
At the end of the scheduling phase, the behaviour of the RT-program has been decomposed into the behaviour of each of the instructions. The number of instructions for every version generated by MSSI therefore is known and the shortest instruction sequence can be selected.
2.3•3 Register assignment
After all statements have been assigned to one of the instructions, locations are assigned to temporary variables. Since optimizations at this step are limited to straight-line sequences of instructions, it is almost trivial:
duling phase the sequence of statements is frequently changed. Hence, too many locations would be required, if the allocation would already be done in the statement decomposition phase.
Module selection
The previous design steps did not synthesize an RT-structure. They just transformed the program such that the selection of hardware resources is simplified. The next design step now is the first of those which actually build up an RT-structure.
As a result of the scheduling phase, arithmetic and logic operations in each of the instructions are known. We now use this knowledge in order to generate arithmetic/logic units (ALUs).
the IP-problem can be solved in less than 100 ms on a 1 Mips machine.
If a program contains an operation which cannot be performed by any of the available module types, the MSS creates a new type being able to perform just the required operation. A warning is generated whenever this occurs.
At the end of this design step, all major hardware components have been selected. However, behavioural level operations have not yet been bound to specific hardware modules.
Generating interconnect
Allocating hardware modules to behavioural level operations implies the existence of physical paths from source modules to sink modules. The problem is to find assignments of modules to operations such that the cost for interconnect is minimal. Unfortunately we are unable to predict the effect of such an assignment in terms of wiring area. We therefore use a simplified design objective: minimize the total number of paths! The optimization problem is formulated as follows: For each operation to be performed by one of the instructions, there is a set of matching hardware resources. E.g. for each arithmetic operation,there is a set of functional modules, which are able to perform this operation and for each constant (0-ary operation), there is a set of instruction fields of the required length. Now, for each operation find a resource from this set such that no resource is assigned to more than one operation per instruction and such that the minimal number of paths between resources is required.
In the present implementation of the MSS, a branch-and-bound algorithm is used to solve this assignment problem. Unfortunately the complexity of this algorithm makes it impossible to generate globally optimal assignments. Therefore, it is necessary to solve the assignment -.problem for a few instructions at a time, starting with the most complex instructions.
In case the above algorithm computes a solution requiring more than a single path to an input, multiplexers are generated by the MSS in a straight-forward manner.
Generating control
In the MSS we assume that the hardware is controlled by instructions with a format similar to horizontal microinstructions. More specifically,we assume that the direct encoding method [12] is used to control RT-modules: for each module with a control input, there is a corresponding instruction field f., which may be used to select one of the module's operations.
Since not all the modules are used simultaneously, some of them may share instruction bits.
Optimization techniques similar to our's have been used by Takagi [13] and in the CMU-DA system [14] . The basic idea in both cases is modelling the problem as a clique partitioning problem. The heuristics used for solving the clique partitioning problem and the scheduling problem are very similar.
Generating completely bound programs
Using the results of sections 2.3.5 and 3.3.6, so-called completely bound programs can be generated. Completely bound programs explicitly specify all used hardware resources and all used instruction bits [15] . Completely bound programs may be processed by the back-end tools MSSM, MSSS, MSSB, MSSE and MSSG (c.f. Fig. 1). 
Sample output
The following is a partial description of the RTstructure generated for our sample input. Note that the instruction format and the interconnections between modules are now described. One copy of each of the ALU types B7483 and B74xy has been selected.
The interconnections with multiplexer MaU1 can be seen in Fig.4 , which is a graphical representation of the resulting structure. Address inputs to SR and control inputs except to MaU1 have been omitted.
Fig. 4 Synthesized RT-structure

First results
One of the earlier examples, which was used for testing the MSS, was the mergesort algorithm as described by Wirth [16] . The performance of the RT-structures created by the MSS was compared with the performance of an IBM/370 type machine. The results are shown in table 1.: Table 1 : Performance and code of MSS designs The speed of the MSS designs is not a result of tailoring a machine to exactly one program. The synthesis algorithm does not, for example, hardwire constants (except zero). The resulting RT-structure is mainly influenced by some of the programs properties like addressing modes, used arithmetic/logic operators and the amount of bit level addressing. In a case study we analysed the effect of adding the mergesort algorithm to a behavioural specification con sisting of the gomory-I algorithm. Except for a single bit wire, the resulting structure was the same.
The MSS designs in table 1 require at most 22% more program code than the SIEMENS. This is a remarkable result, because the direct encoding scheme is believed not to lead to compact code.
Flexibility has always been a design goal for the MSS. This allows us studying the effect of different design alternatives. Fig. 5 for example, is MSS1 has been used to design a processor for a design specification consisting of the kernel of an operating system and some typical sections of PASCAL-programs. The MSS1 was not particularly successful in minimizing the number of data paths and therefore these have been reduced manually by about 50%. The resulting structure and the structure generated by the current MSS had the same number of paths.
Other tools in the MSS
Retargetable code generation
To a certain extent generated hardware structures may be modified by changing the resource constraints. Design iterations initially should be done by using this method, that is, by executing the synthesis procedures for d ifferent design constraints. However, after a certain number of iterations, ideas about the hardware structure become more and more precise. As a result, the designer usually knows the structure he would like Hopefully, it is similar to a structure generated by the synthesis system. But, in order to take advantage of the ideas of human designers, it is necessary to allow manual modifications to automatically generated structures (c.f. Fig. 6 ).
It is easy to document these changes because MIMOLA (in contrast to other languages) can be used for the description of the synthesis result as well as for the behavioural specification. The problem with manual changes is that they may result in an incorrect design. "Incorrect" means that the modified structure is not capable of executing the programs specifying the desired behaviour.
Therefore we have to provide the user with tools enabling him to check correctness. We do so by including a so-called retargetable code generator in the MSS [181. This code generator tries to generate code for the machine described in the hardware description section of its input. This machine is called the target. If the code generator is able to generate code for a manually modified structure, the design is still correct. I f it fails to do so, it is either incorrect or the code generator does not know enough "tricks". "Tricks", which are frequently used by human microprogrammers, represent a certain amount of knowledge. With MIMOLA it is possible to convey this knowledge to the MSS, because MIMOLA allows the user to define valid program transformations. These transformations are used by the code generator like in a tiny expert system. This transformation is required for programming the AMD2900 series of bitslice chips. Hardwired zeroes exist at the input of the AMD2900 ALU. In order to pass these to the output of the ALU, the "AND"-function must be used. The suffix "CONDITIONALLY" indicates that this rule has to be applied only if its application results in good code. The code generator has been used for more than 20 different targets.
Test generation
Because of the increasing problem of testing VLSI chips, we also designed a component MSST, which automatically generates self -test programs (diagnostics) from a given structural hardware description [19] . These programs are intended to be executed by the real hardware.
Recently, an additional tool has been completed at the university of Aachen, Germany. This tool computes testability measures for a given hardware. The output of this tool as well as the output of MSST is intended to be used in design iterations in order to improve the testablity of an RT-structure.
Back-end tools
The synthesis subsystem, the code generator and the test generator are the three main components of the MSS. All three components generate completely bound programs. These programs as well as the description of the RT-structure can be processed by a set of components called back end tools.
Probably the most important back end tool is MSSM. MSSM generates listings of the resulting RT-structures and completely bound programs. The language used for these listings again is MIMOLA. This is possible, because MIMOLA is able to describe RT-structures and completely bound programs. Therefore, the output of MSSM can be used as input to MSSF. This feature is necessary in order to support design iterations without having to learn two or more different languages.
Another important t ool is the simulator. The simulator is able to simulate the structure of the processor. It is based upon an analysis of the interconnections in the processor and therefore is able to detect unwanted sideeffects.
The simulator allows the user to monitor the execution of the program on the target hardware. Note, that the user does not have to load the instruction memory manually because binary code is generated by the MSS.
The simulator can be used to validate a design independently of the synthesis and code generation tools. This is especially valuable in order to detect errors made by these tools. A third back end tool is MSSB. MSSB generates an instruction memory map in human readable form. The purpose of MSSE is to evaluate the target structure. With the help of the simulator it computes the expected execution time of 'the program and generates utilization statistics for the RT-modules. In order to help visualizing the result of a design process we are currently implementing MSSG. MSSG will automatically c onvert textual hardware descriptions into schematics. MSSG is based upon an extension of an algorithm published in 1985 [20] .
The back end tools as well as all other tools are implemented in standard PASCAL. The MSS has been installed on SIEMENS, VAX, Eclipse, Sun and Apollo computers.
Conclusion
Synthesis methods for the design of digital hardware are capable of producing correct designs in a short turn-around time. This paper deals with the synthesis of RT-structures from an algorithmic design specification. By carefully partitioning the design process into a sequence of subprocesses we have tried to reduce the complexity and to keep interactions between the subprocesses as small as possible. The partitioning was done such that design decisions are delayed as long as possible. The complexity of the resulting subprocesses allows synthesizing hardware for large design specifications.Probably the most remarkable achievement is the algorithm for selecting modules from a set of predesigned module types. This algorithm is both globally optimizing and very fast.
References
[1] G. Zimmermann 
