Abstract
Self-Timed Motivation
A self-timed paradigm offers several potential advantages beyond the savings in design effort which result from eliminating the global clock distribution circuits. Because of their request/acknowledge communication protocol, self-timed circuits separate timing from functionality, which leads to an increase in composability. Systems may be constructed by connecting components and subsystems based only on their functionality without having to consider their timing requirements. Incremental improvements in speed or functionality are possible by replacing individual subsystems with newer designs without changing or retiming the system as a whole. The systems are robust since subsystems continue to operate over a wide range of process variations, voltage differences, or temperature changes. Because self-timed systems signal completion as soon as they are able, self-timed pipelined systems tend to display average case behavior, as opposed to the worst-case behavior typical of traditional synchronous systems. Additionally, self-timed systems do not incur the power overhead of distributing a free running clock across the entire system, and since the circuit elements make signal transitions only when actually doing work or communicating, large systems can show greatly decreased power dissipation in some technologies, especially during quiescence.
In fairness, there are some potential disadvantages as well. Self-timed circuits often exhibit an increase in circuit size, an increase in the number of wires connecting parts of a system, possible performance penalties due to the larger circuits, and a marked difference in design and test procedures from those used in standard synchronous circuits. However the potential advantages of self-timed circuits are analogous to those evinced by object-oriented programming languages, where the disadvantages of increased code size and minor performance degradation are outweighed by the advantages of using encapsulated software objects without individually tailoring each instantiation.
There are fundamental differences in the structure of asynchronous and synchronous processors, and the problems of each design require innovative solutions. It is certainly possible to implement a conventional microprocessor design using self-timed circuits. The advantages inherent in the self-timed circuit elements are available to any architecture which uses them. However, when the asynchronous philosophy is incorporated at every stage of the design, the microarchitecture is more closely linked to the basic structures of the self-timed circuits themselves, and the resulting design is much simpler. The Fred architecture is an example of such a design approach. The self-timed design philosophy directly results in a powerful and flexible architecture which exhibits significant savings in design effort and circuit complexity.
The Fred Architecture 1
Fred is a self-timed, decoupled, concurrent, pipelined computer architecture. It dynamically reorders instructions to issue out of order, and allows out-of-order instruction completion. It handles exceptions and interrupts. Several features of the Fred architecture are directly related to its self-timed design, such as the decoupled branch mechanism and exception model. Early versions of the Fred architecture have been discussed elsewhere [9, 10, 11] .
A prototype of Fred has been implemented as a detailed VHDL model to investigate the performance 2 and behavior of the Fred architecture under varying conditions. Figure 1 shows the overall organization. Each box in the figure is a self-timed process communicating via dedicated data paths rather than buses. Each of these data paths (shown as wires in Figure 1 ) may be pipelined to any desired depth without affecting the results of the computation. Because Fred uses self-timed micropipelines in which pipeline stages communicate locally only with neighboring stages in order to pass data, there is no extra control circuitry involved in adding additional pipeline stages. Because buses are not used, their corresponding resource contention is avoided.
The VHDL model chooses particular implementations for each of the main pieces of Fred. For example, the Dispatch Unit is organized so as to dynamically schedule instruction issue, and to allow out-of-order completion. This is of particular interest in a self-timed processor where the multiple functional units might take varying 2. Fred's performance is obviously measured in "Fhlintstones." amounts of time to compute their results, thus leading naturally to out of order instruction dispatch and/or completion. An individual functional unit might even take different amounts of time to compute a result, depending on the input data. The VHDL prototype is fully operational in all aspects, including dynamic scheduling, out-oforder instruction completion and a functionally precise exception model. The timing and configuration parameters can be adjusted for each component of the design.
Decoupling
In order to achieve a reasonable performance level, Fred utilizes both pipelining and concurrency. Self-timed circuits (especially micropipelines) are ideally suited to the pipeline aspect, since there is no centralized control logic needed to govern the movement of data through the FIFOs, and as soon as an instruction is issued, no further control is required. This greatly simplifies the processor control logic, since there is no need to explicitly control the progression of each datum through each stage of the pipelines. The concurrency aspect is also simplified. No control logic is required to implement multiple pipelines other than arbitration at the fork and join points, which is done locally with self-timed control elements. 
Branch Delay Slots
Flow control instructions are significantly affected by the degree of decoupling in Fred. The instructions for both absolute and relative branches compute a 32-bit value which will replace the program counter if the branch is taken, but the branch is not taken immediately. Instead, the branch target value is computed by the Branch unit and passed through the Branch Queue back to the Dispatch unit, along with a condition bit indicating whether the branch should be taken or not. These data are consumed by the Dispatch unit when a subsequent doit instruction is encountered, and the branch is either taken or not taken at that time. To prevent extra instruction fetches, the doit instruction can be encoded implicitly by a single bit available in the opcode of other instructions, indicated by a ".d" appended to the instruction mnemonic.
The distance between the branch instruction and the corresponding doit is completely variable, whereas in most synchronous machines the number of instructions to place in the delay slot is encoded in the original branch instruction. Although the concept of delay slots is still valid, the self-timed nature of Fred renders it of less importance. Instructions are dispatched as soon as possible, not in accordance with some arbitrary time signal such as a clock.
Regardless, fixing the number of delay slots to use for each branch instruction would seriously interfere with Fred's ability to dynamically reorder instructions. The doit instruction can be issued out of order, just like any other instruction. This would not be possible with a hard-coded delay slot count.
Much has been written in favor of decoupling branches from other instructions [6, 13, 8, 4, 3, 5] . Trace data gathered from the Fred simulation lends support to this view, especially so when the test conditions are considered. The benchmark code used in testing was compiled for a synchronous, non-decoupled architecture (the Motorola 88100 processor) by a relatively non-optimizing compiler (GNU v. 2.4.5), and then subjected to the most simple peephole optimization after compilation. Despite these handicaps, the average number of delay slots between instructions was around 1.5, as seen in Figure 2 . This argues strongly in favor of the branch decoupling technique.
Multiple Branch Targets
Another difference from other decoupled architectures lies in Fred's ability to place more than one branch target into the Branch Queue. A possible use of this ability is to mimic loop unrolling, but without any code expansion. This isn't actually true loop unrolling since the registers are not recolored, but it is still an intriguing concept Complete utilization of this feature will most likely require the development of a better compiler.
Prefetching
Any number of instructions (including zero) may be placed between the branch target computation and the doit instruction. From the programmer's view, these instructions do not have to be common to both branches, nor must they be undone if the branch goes in an unexpected way. The only requirement for these instructions is that they not be needed to determine the direction of the branch. The branch instruction can be placed in the current block as soon as it is possible to compute the direction. The doit instruction should come only when the branch must be taken, allowing maximum time for instruction prefetching, as shown in Figure 3 . Because the doit is consumed entirely within the Dispatch Unit, it will take effect as soon as the branch target data is available, allowing instructions past the branch point to be loaded into the IW before the prior instructions have completed (or even issued). This lets the IW act as an instruction prefetch buffer, but it is always correct, never speculative. Figure 4 shows an example, based on the code in Figure 3B .
There are two ways in which the branch decoupling allows effective prefetching. First, the doit instruction does not have to be consumed in program order, but instead can be executed as soon as the branch data reaches the head of the Branch Queue. This reduces some of the branch latency, because instructions from the new stream can be requested as soon as possible. However, in order to fetch instructions from the new stream the doit instruction must be fetched by the Dispatch Unit, and if there are several instructions between the branch and the doit, all of those instructions must be fetched by the Dispatch Unit tion of the branch is known as soon as it is computed by the Branch Unit. By passing this information off-chip to an intelligent cache controller, it is possible to preload the appropriate instructions. Although designing and testing an external cache system for the Fred architecture is beyond the current scope of our research, some speculation is useful. In addition to the standard normal cache implementation, suppose there is also a preload cache, containing a small number of preload cache lines, reserved specifically for loading nonsequential instructions. The cache controller would fill the normal cache one line at a time, allowing effective cache operation as long as the instruction stream consists of sequential instructions. When the Branch Unit computes a new target address it would immediately pass it off-chip to the cache controller. The cache controller can then use the new target address to fill the preload cache lines with the instructions from this target. When the doit is consumed, Fred's normal instruction fetch process would request instructions from the new target. The cache controller would then copy the normal cache lines from the preload cache lines, instead of loading them from memory. There are several facets to this mechanism: • Taken branches are more interesting, since non-taken branches simply continue the sequential instruction stream. However, both types could benefit from the preload cache.
• The preload cache would have to be loaded in parallel with the normal cache, since the normal cache might also need to load lines before the doit is consumed.
• The preload cache can simply give up in the event of memory faults, and let the normal cache handle those in sequence.
• If the target address is already in the normal cache, there is no need to use the preload cache at all.
• If the preload cache is invalid (either it has detected a fault, or is not being used for non-taken branches), the normal cache just loads the new target from memory. If the preload cache has not finished loading when the normal cache needs it, the normal cache should wait rather than re-issuing the same load requests.
• The preload cache probably only needs to contain one or two lines, since the normal cache will take over as soon as the first preload line is used.
• The preload cache does not need any replacement strategy, since only one branch/doit pair will appear in the Dispatch Unit at a given time (Fred does not issue speculative branches). Once the doit is consumed the preload cache is copied into the normal cache and invalidated.
Exceptions
The major innovation of the Fred architecture, which makes possible many additional features, is the functionally precise exception model [10] . A precise exception model allow the programmer to view the processor state as though the exception occurred at a point exactly between two instructions, such that all instructions before that point have completed while all those after have not yet started. Fred's functionally precise model instead simply presents a snapshot of the current instruction stream, in which some instructions have faulted, some have not yet issued, and any nominally sequential instructions which are not present have completed successfully out of order.
Instruction Window Size
Fred uses an Instruction Window (IW) to dispatch instructions. Instructions are fetched into this buffer, and issued from it as dependencies are satisfied. The IW is used not only to search many instructions for possible issue, but to track the status of currently executing instructions. Instructions are only removed from the IW when they have completed successfully. With a precise exception model, even though instructions can issue and complete out-of-order, they still must retire from the IW in sequential order so that the precise exception model may be maintained. To enable that to happen, every instruction would have to report its completion, even if it could never fault. This imposes an artificial constraint on the behavior and size of the IW, since the size of the IW directly limits the maximum distance between out-of-order instructions. This constraint is imposed solely to handle exceptions, which typically happen at widely spaced intervals. The IW would have to be significantly larger, both to hold the increased number of instructions which must be tracked, and to allow sufficient history for out-of-order completion to be possible. For example, the PA-8000 uses a precise exception model and has a 56-entry Instruction Reorder Buffer, which serves the purpose of an Instruction Window [7] with out-of-order completion and in-order retirement.
In contrast, Fred's functionally precise exception model allows instructions to retire in any order, in many cases as soon as the instructions issue. The Instruction Window must track all issued instructions which might fault only until they have completed successfully. Instructions which will never fault can be removed from the IW immediately after dispatch. Once an instruction has completed, it can be removed from the IW. There is no need to retire the instructions in any particular order, and except for dependency chains, the order of any two instructions is of no importance for either dispatch or completion.
This means that the primary factor governing the size of the IW is that it should be large enough to issue instructions efficiently. Simulation results indicate that for the current implementation, an IW greater than four slots is sufficient, and that larger sizes have no effect on performance, as shown in Figure 5 . This small IW size is a direct consequence of the exception model.
The register file is also much simpler under Fred's exception model. With in-order instruction retirement, a history or reorder buffer must used to ensure that register values are also retired (or made permanent) in sequential order, and to restore the original in-order values if an exception is taken. For Fred, no such buffer is needed.
Fast Completion
In a self-timed system, every data transaction requires a request/acknowledge handshake, which uses power and takes some time. Additionally, reporting instruction completion implies a possible contention for access to the IW by several functional units, which must be arbitrated, requiring additional time and power. Eliminating the requirement that every instruction report its completion allows much of this time and power penalty to be avoided, thereby improving performance. As just mentioned, a precise exception model would require every instruction to report its completion. Figure 6 shows the effect of fast completion.
Instruction Reissue
With a precise exception model, all out-of-order instructions are discarded when the exception occurs. Assuming that the exception is recoverable, the program must resume at the exception point, thus re-issuing those out-of-order instructions. The time and power used in executing these out-of-order instructions the first time is completely wasted. This is not necessarily a major concern, In contrast, Fred's exception model never re-issues any completed instruction. Because of this, the result of every completed instruction is always valid. This means that any instruction currently executing when an exception occurs does not have to be aborted. In fact, any executing instruction which can never fault will not even be recorded in the IW, so the exception handling routine may actually begin before the instruction finishes. If the exception handler needs to access the destination register value associated with that instruction, the scoreboard will ensure that the value becomes valid before it is used. In other words, the Register File does not require any special attention to ensure its correctness during exception handling. The normal scoreboarding mechanisms are sufficient, and exception processing can begin immediately, without waiting for the Register File to become quiescent.
Multiple Faults
Under certain circumstances there may be more than one faulty instruction in the IW when exceptions are detected. For example, consider this segment of code:
These instructions do not have any mutual dependencies, so they may all be executing at once. However, if the address for the load instruction would cause a page fault and the arithmetic instructions would overflow, the IW could indicate that all three instructions have faulted when 
Circuit Complexity
Because the functionally precise exception model does not reverse any state of the processor, the complexity is much less than otherwise might be expected. The Instruction Window can be much smaller, since it does not need to worry about the order of instruction completion, and does not even need to track non-faulting instructions. Likewise, there is no need to track or reverse the contents of any registers, so any history or reorder buffers are not required. Because of this, the hardware needed to implement Fred's exception model may be much less complex than that for a precise model. In a physical implementation, this means that less time is needed to design the exception circuitry, and that more silicon area is available to be used for other purposes. Of course, fast design time and reduced complexity (due to the lack of clock circuitry) are two of the often-quoted advantages of asynchronous circuits in general [1] .
The drawback to this approach is that a precise exception model is required for speculative execution. Fred cannot issue instructions speculatively, but it would be a relatively simple matter to implement speculative prefetching. However, as illustrated in Figure 2 , the decoupled branch mechanism is quite likely to provide enough prefetching time to eliminate the need for speculative execution. Other researchers have indicated that speculative execution is not always beneficial [2] .
Applicability to Synchronous Systems
The elastic nature of a micropipeline FIFO allows Fred's decoupled units to run at data-dependent speeds; producing or consuming data as fast as possible for the given program and data. Obviously, this behavior is inherent in a self-timed organization, and cannot be easily attained in a clocked system. Except for that, the major difference between synchronous and asynchronous architectures lies in how the IW is used to handle exceptions. The self-timed version reduces the size of the IW, removes the need for register history and state reversal, and thereby eliminates all the circuitry normally used for those functions. While the functionally precise exception model was developed for
Fred because traditional precise models will not work well (it is difficult to reverse the FIFO queues), this technique is not limited solely to self-timed systems.
The functionally precise exception model can be applied directly to synchronous processor design at the expense of eliminating speculative execution. Some form of decoupled branch mechanism would most likely be needed to maintain the original performance level normally seen when speculative execution is present. It is conjectured that given a compiler targeted specifically for this sort of architecture, and by utilizing prefetching caches as suggested in Section 3.3, a decoupled branch mechanism and a functionally precise exception model may well become the methods of choice in future processors, whether synchronous or asynchronous.
Conclusions
The Fred architecture was developed using an asynchronous philosophy incorporated at every stage of the design, resulting in a microarchitecture which is more closely linked to the basic structures of the self-timed circuits themselves. The decoupled branch mechanism is a natural result of the fact that every unit of the processor communicates via FIFO queues. It is quite difficult to reverse the operation of these queues, so the functionally precise exception model was developed. This provides additional unforeseen benefits in reducing circuity complexity and increasing performance. Taking advantage of the inherent qualities of the self-timed building blocks produces a processor design that is quite surprising in its simplicity and elegance.
