Abstract-Microcode compaction is the conversion of sequential microcode into efficient parallel (horizontal) microcode. Local compaction techniques are those whose domain is basic blocks of code, while global methods attack code with a general flow control. Compilation of high-level microcode languages into efficient horizontal microcode and good hand coding probably both require effective global compaction techniques.
I. INTRODUCTION
THIS paper presents trace scheduling, a solution to the 1"global microcode optimization problem." This is the problem of converting vertical (sequential) microcode written for a horizontally microcoded machine into efficient horizontal (parallel) microcode, and as such, is properly referred to as "compaction" rather than "'optimization." In the absence of a general solution to this problem, the production of efficient horizontal microprograms has been a task undertaken only by those willing to painstakingly learn the most unintuitive and complex hardware details. Even with detailed hardware knowledge, the production of more than a few hundred lines of code is a major undertaking. Successful compilers into efficient horizontal microcode are unlikely to be possible without a solution to the compaction problem.
Local compaction is restricted to basic blocks of microcode. A basic block is a sequence of instructions having no jumps into the code except at the first instruction and no jumps out except at the end. A basic block of microcode has often been described in the literature as "straight-line microcode." Previous research Manuscript received July 15, 1980 ; revised November 15, 1980 [1] - [3] has strongly indicated that within basic blocks of microcode, compaction is practical and efficient. In Section II we briefly summarize local compaction and present many of the definitions used in the remainder of the paper.
Since blocks tend to be extremely short in microcode, global methods are necessary for a practical solution to the problem. To globally compact microcode it is not sufficient to compact each basic block separately. There are many opportunities to move operations from block to block and the improvement obtained is significant. Earlier methods have compacted blocks separately and searched for opportunities for interblock operation movement. However, the motivating point of this paper is the argument that these methods will not suffice. Specifically:
Compacting a block without regard to the needs and capacities of neighboring blocks leads to too many arbitrary choices. Many of these choices have to be undone (during an expensive search) before more desirable motions may be made.
As an alternative, we offer trace scheduling. Trace scheduling compacts large sections of code containing many basic blocks, obtaining an overview of the program. Unless certain operations are scheduled early, delays are likely to percolate through the program. Such critical operations, no matter what their source block, are recognized as such and are given scheduling priority over less critical operations from the outset. Trace scheduling works on entire microprograms, regardless of their control flow, and appears to produce compactions strikingly similar to those laboriously produced by hand. Trace scheduling is presented in Section III. In Section IV suggestions are given to extend trace scheduling to more realistic models of microcode than that used in the exposition.
II. LOCAL COMPACTION It is not the purpose of this paper to present local compaction in detail. Thorough surveys may be found in [1] - [3] . However, trace scheduling uses local compaction as one of its steps (and we have developed an approach that we prefer to the earlier methods), so we briefly summarize an attack on that problem.
A. A Straightforward Model of Microcode
In this paper we use a relatively straightforward model of microcode. Trace scheduling in no way requires a simplified model, but the exposition is much clearer with a model that represents only the core of the problem. Reference [1] contains a more thorough model and references to others.
0018-9340/81/0700-0478$00.75 (© 1981 IEEE   478   479 FISHER: TRACE SCHEDULING During compaction we will deal with two fundamental objects: microoperations (MOP's) and groups of MOP's (called "bundles" in [1] [4] .] Definition 7: Given a P, with a data-precedence DAG defined on it, we define a compaction or a schedule as any partitioning of P into a sequence of disjoint and exhaustive subsets of P, S = (S1, S2, Su)
, 54 with the following two properties.
* For each k, 1 < k < u, resource-compatible(Sk) = true.
[That is, each element of S could represent some legal microinstruction.] * If mi << m«, mi is in Sk, and mj is in Sh, then k < h. [That is, the schedule preserves data-precedence.]
[The elements of S are the "bundles" mentioned earlier. Note that this definition implies another restriction of this model. All MI's are assumed to take one microcycle to operate. This restriction is considered in Section IV.] It is suggested that readers not already familiar with local compaction examine Fig. 1 . While it was written to illustrate global compaction, Fig. 1(a) contains five basic blocks along with their resource vectors and readregs and writeregs sets. Fig.  1(d) shows each block's DAG, and a schedule formed for each. Fig. 5 (a) contains many of the MOP's from Fig. 1 scheduled together, and is a much more informative illustration of a schedule. The reader will have to take the DAG in Fig. 5(a) on faith, however, until Section III.
Compaction via List Scheduling Our approach to the basic block problem is to map the simplified model directly into discrete processor scheduling theory (see, for example, [5] or [6] ). In brief, discrete processor scheduling is an attempt to assign tasks (our MOP's) to time cycles (our bundles) in such a way that the data-precedence relation is not violated and no more than the number of processors available at each cycle are used. The processors may be regarded as one of many resource constraints; we can then FOLLOWERS (DEFAULT  BLOCK   REGISTERS   FALLS THROUGH TO   NAME   WRITIEN READ NEXT MOP ON LIST) RESOURCE VECTOR   ml   .15   Bl:  R3  R1,R2  RS  R3,R4  R6  R3  R7  R5,R6  R8  R7,R3   B2:  R10  R9  R11  R9  Rn0   B3:  R13  R12  R13  R10  R16   R13,R14   R17 R16 [8] , [9] . In particular, blocks in microcode tend to be short, and the compactions obtained are full of "holes," that is, MI's with room for many more MOP's than the scheduler was able to place there. If blocks are compacted separately, most MI's will leave many resources unused.
A. The Menu Method
Most hand coders write horizontal microcode "on the fly," moving operations from one block to another when such motions appear to improve compaction. A menu of code motion rules that the hand coder might implicitly use is found in Fig.  3 . Since this menu resembles some of the code motions done by optimizing compilers, the menu is written using the terminology of flow graphs. For more careful definitions see [10] or [1 1 ], but informally we say the following. Previous suggestions for global compaction have explicitly automated the menu method [12] , [13] . This involves essentially the following steps.
1) Only loop-free code is considered (although we will soon consider a previous suggestion for code containing loops).
2) Each basic block is compacted separately.
3) Some ordering of the basic blocks is formed. This may be as simple as listing pairs of basic blocks with the property that if either is ever executed in a pass through the code, so is the other [12] , or it may be a walk through the flow graph [13] .
4) The blocks are examined in the order formed in 3), and legal motions from the current block to previously examined ones are considered. A motion is made if it appears to save a cycle.
Limitations of the Earlier Methods The "automated menu" method appears to suffer from the following shortcomings.
* Each time a MOP is moved, it opens up more possible motions. Thus, the automated menu method implies a massive and expensive tree search with many possibilities at each step.
* Evaluating each move means recompacting up to three blocks, an expensive operation which would be repeated quite often.
* To find a sequence of very profitable moves, one often has to go through an initial sequence of moves which are either not profitable, or, worse still, actually make the code longer. Locating such a sequence involves abandoning attempts to prune this expensive search tree.
We summarize the shortcomings of the automated menu method as follows:
Too much arbitrary decisionmaking has already been made once the blocks are individually compacted. The decisions have been shortsighted, and have not considered the needs of neighboring blocks. The movement may have been away from, rather than towards, the compaction we ultimately want, and much of it must be undone before we can start to find significant savings. An equally strong objection to such a search is the ease with which a better compaction may be found using trace scheduling, the method we will present shortly.
C. An Example Fig. 1 (a)-(d) is used to illustrate the automated menu method and to point out its major shortcoming. The example was chosen to exaggerate the effectiveness of trace scheduling in the hope that it will clarify the ways in which it has greater power than the automated menu method. The example is not meant to represent a typical situation, but rather the sort that occurs frequently enough to call for this more powerful solution to the problem. Fig. 1 is not written in the code of any actual machine. Instead, only the essential (for our purposes) features of each instruction are shown. Even so, the example is quite complex. Fig. 1(a) shows the MOP's written in terms of the registers written and read and the resource vector for each MOP. The The compaction would be obtained using list scheduling and almost any heuristic (and probably any previously suggested compaction method).
Given the blocks as compacted in Fig. 1(d) , an automated menu algorithm could then choose to use rules R 1-R6 to move MOP's between blocks. Fig. 4 shows the possible application of some of these rules, and using them we see that some of the blocks may be shortened. If, for the moment, we suppose that the code usually passes through the path B1-B2-B3, we can see that the length of that path may be reduced from the initial 13 cycles (with each block compacted separately) to 11 cycles, an important savings. Nonetheless, we shall see that this is an example of an unsatisfactory compaction obtained using the automated menu method.
D. Trace Scheduling-Compacting Several Blocks Simultaneously
The shortcomings of the automated menu method are effectively dealt with using the technique we call trace scheduling. Trace scheduling operates on traces instead of basic blocks. A trace is a loop-free sequence of instructions which might be executed contiguously for some choice of data. More formally, we define the following. We define a trace as any sequence of distinct MI's (mI, M2, , mt) such that for each j, 1 < j < t -1, m+1 is in followers(mj). [Thus, a trace is a path through the code which could (presumably) be taken by some setting of the data. Note that if no MI in T is a conditional jump, T might be a basic block, or part of one. In general, T may contain many blocks.]
To allow us to consider P to be a portion of a larger program, we have dummy MI's which are on the boundaries of P. We call these dummy MI's entrances and exits, and they are used to interface P with the rest of the program. Before compaction we set the compacted value of both entrances and exits to true, so they never appear on a trace. After compaction is completed, we replace all jumps to exits by jumps to the code location, outside P, represented by the exit. Similarly, we change all jumps from code outside P to entrances by having the jump be to the followers of the entrance. Fig. 4 . Examples of the savings available in Fig. 1(d) via the menu method.
Building Data-Precedence Graphs on Traces The automated menu method hunts for specific cases of interblock motions and examines each only after compacting basic blocks. Trace scheduling takes the opposite tack. Here, the scheduler is given in advance the full set of MOP's it has to work with and is allowed to produce whatever schedule is most effective. No explicit attention is given to the source code block structure during compaction. Sometimes, though, it is not permissible for an operation to move from one block to another. That new information and ordinary data-precedence determine edges for a DAG. Given this DAG, it is possible to explain the main technique of trace scheduling as follows.
The DAG built for a trace already contains all of the necessary restrictions on interblock motion, and only those restrictions. A scheduler may compact the trace without any knowledge of where the original block boundaries were. The scheduler's sole aim will be to produce as short a schedule as possible for the trace, making implicit interblock motions wherever necessary to accomplish this goal. This may be done at the expense of extra space, and may sometimes lengthen other traces. Thus, the process is applied primarily to the traces most likely to be executed. More formally, we build the DAG as follows.
Definition 11: Given a trace T = (ml, m2, * **, mt), there is a function condreadregs: the set of conditional jumps in T subsets of registers. If i < t, register r is in condreadregs(mi) if r is live at one or more of the elements of followers(mi) -fmi+ Ii. [That is, at one of the followers besides the one which immediately follows on the trace. Algorithms for live register analysis are a standard topic in compiler research [10] , [ 11 ] . We assume that updated live register information is available whenever it is required.] For the last element on T, we define condreadregs(mt) as all registers live at any follower of mt.
Definition 12: Given a trace T = (mI, m2,*.* mt), we define the successors function to build a directed acyclic graph (DAG) called the trace data-precedence graph. [Or, just the data-precedence graph or DAG if the context is obvious.] This is calculated exactly as if T were a basic block, using the sets readregs(m) and writeregs(m), except for the DAG edges from conditional jumps. If m is a conditional jump, then all the registers in condreadregs(m) are treated as if they were in the set readregs(m) for purposes of building successors(m). [This is to prevent values which may be referenced off the trace to 482 be overwritten by an instruction which moves from below m to above m during compaction. Again, the DAG is defined more carefully when we extend it slightly in the next section.] Scheduling Traces In brief, trace scheduling proceeds as follows. To schedule P, we repeatedly pick the "most likely" trace from among the uncompacted MOP's, build the trace DAG, and compact it. After each trace is compacted, the implicit use of rules from the menu forces the duplication of some MOP's into locations off the trace, and that duplication is done. When no MOP's remain, compaction has been completed. To help pick the trace most likely to be executed, we need to approximate the expected number of times each MOP would be executed for a typical collection of data.
Definition 13: We are given a function expect: P 3-nonnegative reals. [Expect(m) is the expected number of executions of m we would expect for some typical mix of data. It is only necessary that these numbers indicate which of any pair of blocks would be more frequently executed. Since some traces may be shortened at the expense of others, this information is necessary for good global compaction. For similar reasons, the same information is commonly used by the hand coder. An approximation to expect may be calculated by running the uncompacted code on a suitable mix of data, or may be passed down by the programmer.]
Given the above definitions, we can now formally state an algorithm for trace scheduling loop-free code.
Algorithm Fig. 5(a) shows the DAG for that trace, and a schedule that would be formed using the highest levels first priority function. The dotted lines in the DAG are edges that would arise between MOP's which originated in different blocks, but are treated no differently from ordinary edges. Note that, in this example, the scheduler was able to find a significantly shorter compaction, namely 7 cycles to the 11 which might be expected from the automated menu method. This is due to the overview the scheduling algorithm had of the whole trace. The necessity to execute M6 early, in order to do the critical path of code, is obvious when looking at the DAG. An automated menu compactor would be unlikely to see a gain in moving M6 into the first block, since there would be no cycle in which to place it.
The notes for Fig. 5(a) show the MOP's which would have to copied into new blocks during the bookkeeping phase. Fig.  5(b) shows the new flow graph after the copying. Fig. 5(c) Definition 14: A loop is a set of MI's in P which correspond to some "back edge" (that is, an edge to an earlier block) in the flow graph. For a careful definition and discussion, see [11] .
Reducible Flow Graphs For convenience, we think of the whole set P as a loop. We assume that all of the loops contained in P form a sequence LI, L2, * *-LP, such that a) eachLi isaloopinP, b) Lp = P, c) if Li and Lj have any elements in common, and i < j, then Li is a subset of Lj. That is, we say that any two loops are either disjoint or nested, and that the sequence L1, L2,* Lp is topologically sorted on the "include" relation.
The last requirement above is that P have a reducible flow graph. (For more information about reducible flow graphs, see [11] .) Insisting that P have a reducible flow graph is not a problem for us for two reasons. One, programs formed using so-called "structured" control of flow, and not unrestrained GOTO's, are guaranteed to have this property. This is not a compelling argument, since it must be granted that one is apt to find wildly unstructured microcode (the nature of the micro machine level tends to encourage such practices). However, code generated by a compiler is unlikely to be so unstructured. The second reason is stronger, however, and that is that an irreducible program may easily be converted in.to a reducible one with the use of some simple techniques [11] . The automatic conversion produces a slightly longer program, but we have seen that small amounts of extra space is a price we are willing to pay. All known methods which guarantee that the conversion generates the least extra space rely on the solution of some NP-complete problem, but such a minimum is not important to us.
We will, then, assume that the flow graphs we are working with are reducible, and that the set of loops in P is partially ordered under inclusion. Fig. 6(a) is a sample reducible flow graph containing 12 basic blocks, B 1-B 12. We identify five sets of blocks as loops, L1-L5, and the table in Fig. 6(b) identifies their constituent blocks. The topological sort listed has the property we desire.
There are two approaches we may take in extending trace scheduling to loops; the first is quite straightforward. A Simple Way to Incorporate Loops Into Trace Scheduling
The following is a method which strongly resembles a suggestion for handling loops made by Wood [ 14] . Compact the loops one at a time in the order L1, L2, * -*, Lp. Whenever a loop Li is ready to be compacted, all of the loops Lj contained within it have j < i, and have already been compacted. Thus, any MI contained in such an L1 will be marked as compacted [see Fig. 6(c) ].
We can see that trace scheduling may be applied to Li directly with no consideration given to the fact that it is a loop, using the algorithms given above. No trace will ever encounter an MI from any LJ, j < i, since they are all marked compacted. Thus, the traces selected from Li may be treated as if they arose from loop-free code. There are still "hback edges" in LJ, that is what made it a loop, but they are treated as jumps to exits, as are jumps to MI's outside of Li.
When this procedure is completed, the last loop compacted will have been P. Each MOP will have been compacted in the Li in which it is most immediately contained, and we will have applied trace scheduling to all of P. Even with the addition of trace scheduling, however, this does not provide enough power for many applications.
A More Powerful Loop Method The above method fails to take advantage of the following potentially significant sources of compaction.
* Some operations which precede a loop could be moved Fig. 6(c) shows what the flow graph would look like for two of the sample loops. Next, trace scheduling begins as in the nonloop case. Eventually, at least one loop representative shows up on one of the traces. Then it will be included in the DAG built for that trace. Normal data precedence will force some operations to have to precede or follow the loop, while others have no such restrictions. All of this information is encoded on the DAG as edges to and from the representative.
Once the DAG is built scheduling proceeds normally until some lr is data ready. Then lr is considered for inclusion in each new cycle C according to its priority (just as any operation would be). It is placed in C only if lrj is resource compatible (in the new sense) with the operations already in C. Eventually, some lr will be scheduled in a cycle C (if only because it becomes the data ready task of highest priority). Further data ready operations are placed in C if doing so does not violate our new definition of resource compatibility.
After scheduling has been completed, lrj is replaced by the entire loop body Lj with any newly absorbed operations included. Bookkeeping proceeds essentially as before. The techniques just presented permit MOP's to move above, below, and into loops, and will even permit loops to swap positions under the right circumstances. In no sense are arbitrary boundaries set up by the program control flow, and the blocks are rearranged to suit a good compaction.
This method is presented in more detail in [2] .
IV. ENHANCEMENTS AND EXTENSIONS OF TRACE SCHEDULING In this section we extend trace scheduling in two ways: we consider improvements to the algorithm which may be desirable in some environments, and we consider how trace scheduling may be extended to more general models of microcode. A. Enhancements
The following techniques, especially space saving, may be critical in some environments. In general, these enhancements are useful if some resource is in such short supply that unusual tradeoffs are advantageous. Unfortunately, most of these are inelegant and rather ad hoc, and detract from the simplicity of trace scheduling.
Space Saving While trace scheduling is very careful about finding short schedules, it is generally inconsiderate about generating extra MI's during its bookkeeping phase. Upon examination, the space generated falls into the following two classes: 1) space required to generate a shorter schedule, 2) space used because the scheduler will make arbitrary decisions when compacting; sometimes these decisions will generate more space than is necessary to get a schedule of a given length.
In most microcode environments we are willing to accept some extra program space of type 1, and in fact, the size of the shorter schedule implies that some or all of the "extra space" has been absorbed. If micromemory is scarce, however, it may be necessary to try to eliminate the second kind of space and desirable to eliminate some of the first. Some of the space saving may be integrated into the compaction process. In particular, extra DAG edges may be generated to avoid some of the duplication in advance-this will be done at the expense of some scheduling flexibility. Each of the following ways of doing that is parameterized and may be fitted to the relevant time-space tradeoffs.
If the expected probability of a block's being reached is below some threshold, and a short schedule is therefore not critical, we draw the following edges.
1) If the block ends in a conditional jump, we draw an edge to the jump from each MOP which is above the jump on the trace and writes a register live in the branch. This prevents 487 copies due to the ambitious use of rule R3 on blocks which are not commonly executed.
2) If the start of the block is a point at which a rejoin to the trace is made, we draw edges to each MOP free at the top of the block from each MOP, which is in an earlier block on the trace and has no successors from earlier blocks. This keeps the rejoining point "clean" and allows a rejoin without copying.
3) Since the already formed schedule for a loop may be long, we may be quite anxious to avoid duplicating it. Edges drawn to the loop representative from all MOP's which are above any rejoining spot on the trace being compacted will prevent copies caused by incomplete uses of rule R 1. Edges drawn from the loop MOP to all conditional jumps below the loop will prevent copies due to incomplete uses of rule R3. In any environment, space critical or not, it is strongly recommended that the above be carried out for some threshold point. Otherwise, the code might, under some circumstances, become completely unwound with growth exponential in the number of conditional jumps and rejoins.
For blocks in which we do not do the above, much of the arbitrarily wasted space may be recoverable by an inelegant "hunt-and-peck" method. In general, we may examine the already formed schedule and identify conditional jumps which are above MOP's, which will thus have to be copied into the branch, and MOP's which were below a joining point but are now above a legal rejoin. Since list scheduling tends to push MOP's up to the top of a schedule, holes might exist for these MOP's below where they were placed. We examine all possible moves into such holes and pick those with the greatest profit. Making such an improvement may set off a string of others; the saving process stops when no more profitable moves remain. This is explained in more detail in [2] .
Task Lifting Before compacting a trace which branches off an already compacted trace, it may be possible to take MOP's which are free at the top of the new trace and move them into holes in the schedule of the already compacted trace, using motion rule R6. If this is done, the MOP's successors may become free at the top and movable. Reference [2] contains careful methods of doing this. This is simply the automated menu approach which we have tried to avoid, used only at the interface of two of the traces.
Application of the Other Menu Rules
Trace scheduling allows the scheduler to choose code motion from among the rules RI, R3, R5, and R6 without any special reference to them. We can also fit rules R2 and R4 into this scheme, although they occur under special circumstances and are not as likely to be as profitable. Rule R2 has the effect of permitting rejoins to occur higher than the bookkeeping rules imply. Specifically, we can allow rejoins to occur above MOP's which were earlier than the old rejoin, but are duplicated in the rejoining trace. This is legal if the copy in the rejoining trace is free at the bottom of the trace. When we do rejoin above such a MOP we remove the copy from the rejoining trace. This may cause some of its predecessors to be free at the bottom, possibly allowing still higher rejoins.
In Having used a simplified model to explain trace scheduling, we now discuss extensions which will allow its use in many microcoding environments. We note, though, that no tractable model is likely to fit all machines. Given a complex enough micromachine, some idioms will need their own extension of the methods similar to what is done with the extensions in this section. It can be hoped that at least some of the idiomatic nature of microcode will lessen as a result of lowered hardware costs. For idioms which one is forced to deal with, however, very many can be handled by some special case behavior in forming the DAG (which will not affect the-compaction methods) combined with the grouping of some seemingly independent MOP's into single multicycle MOP's, which can be handled using techniques explained below.
In any event, we now present the extensions by first explaining why each is desirable, and then showing how to fit the extension into the methods proposed here.
Less Strict Edges Many models that have been used in microprogramming research have, despite their complexity, had a serious deficiency in the DAG used to control data dependency. In most machines, master-slave flip-flops permit the valid reading of a register up to the time that register writes occur, and a write to a register following a read of that register may be done in the same cycle as the read, but no earlier. Thus, a different kind of precedence relation is often called for, one that allows the execution of a MOP no earlier than, but possibly in the same cycle as its predecessor. Since an edge is a pictorial representation of a "less than" relation, it makes sense to consider this new kind of edge to be "less than or equal to," and we suggest that these edges be referred to as such. In pictures of DAG's we suggest that an equal sign be placed next to any such edge to distinguish it from ordinary edges. In writing we use the symbol <<. (An alternative way to handle this is via "polyphase MOP's" (see below), but these edges seem too common to require the inefficiencies that polyphase MOP's would require in this situation.) As there is a k such that i < k < j and r E writeregs(mk).
3) If writeregs(mi) n writeregs(mj) X 0, then mi << m« unless for each register r E writeregs(mi) n writeregs (mj) there is a k such that i < k < j and r E writeregs(mk). In what are called polyphase systems, the MOP's may be further regarded as having submicrocycles. This has the advantage that while two MOP's may both use the same resource, typically a bus, they may use it during different submicrocycles, and could thus be scheduled in the same cycle. There are two equally valid ways of handling this; using either of the methods presented here is quite straightforward. One approach would be to have the resource vector be quite complicated, with the conflict relation representing the actual (polyphase) conflict. The second would be to consider each phase of a cycle to be a separate cycle. Thus, any instruction which acted over more than one phase would be considered a long MOP. The fact that a MOP was only schedulable during certain phases would be handled via extra resource bits or via dummy MOP's done during the earlier phases, but with the same data precedence as the MOP we are interested in.
Many machines handle long MOP's by pipelining one cycle constituent parts and buffering the temporary state each cycle. Although done to allow pipelined results to be produced one per cycle, this has the added advantage of being straightforward for a scheduler to handle by the methods presented here.
Compatible Resource Usages The resource vector as presented in Section II is not adequate when one considers hardware which has mode settings. For example, an arithmetic-logic unit might be operable in any of 2k modes, depending on the value of k mode bits. Two MOP's which require the ALU to operate in the same mode might not conflict, yet they both use the ALU, and would conflict with other MOP's using the ALU in other modes. A similar situation occurs when a multiplexer selects data onto a data path; two MOP's might select the same data, and we would say that they have compatible use of the resource. The possibility of compatible usage makes efficient determination of whether a MOP conflicts with already placed MOP's more difficult. Except for efficiency considerations, however, it is a simple matter to state that some resource fields are compatible if they are the same, and incompatible otherwise.
The difficulty of determining resource conflict can be serious, since many false attempts at placing MOP's are made in the course of scheduling; resource conflict determination is the innermost loop of a scheduler. Even in microassemblers, where no trial and error placement occurs and MOP's are only placed in one cycle apiece, checking resource conflict is often a computational bottleneck due to field extraction and checking. (I have heard of three assemblers with a severe efficiency problem in the resource legality checks, making them quite aggravating for users. Two of them were produced by major manufacturers.) In [2] 2) Machines are being produced with the potential for very many parallel operations on the instruction level-either microinstruction or machine language instruction. This has been especially popular for attached processors used for cost effective scientific computing.
In both cases, compaction is difficult and is probably the bottleneck in code production.
One popular attached processor, the floating point systems AP-1 20b and its successors, has a floating multiplier, floating adder, a dedicated address calculation ALU, and several varieties of memories and registers, all producing one result per cycle. All of those have separate control fields in the microprogram and most can be run simultaneously. The The trend towards highly parallel instructions will surely continue to be viable from a hardware cost point of view; the limiting factor will be software development ability. Without effective means of automatic packing, it seems likely that the production of any code beyond a few critical lines will be a major undertaking.
