Abstract
1: Introduction
The previous four decades have been dominated by synchronous processors with global clocks. However, recently there has been renewed interest in asynchronous designs. This resurgence of interest in asynchronous techniques is driven by advancing technology which is increasingly exposing the inherent limitations of synchronous techniques.
High performance synchronous processors are devoting increasing amounts of silicon area, design effort and power to the global clock. For example the master clock on the Dec Alpha APX 21064 draws a peak switching current of no less than 43A [SI. Routing the global clock is also non-trivial, requiring careful simulation to ensure that clock skew is within manageable bounds [ l l ] . Furthermore, the problems associated with global clocks are magnified by both higher clock frequencies and increasing levels of integration.
All operations in a synchronous processor occur in concert with the global clock. The crock period must therefore cater for the longest latency operation, even if it occurs only rarely. All other operations, possibly requiring only a fraction of a clock cycle, are constrained to complete in a single cycle. The performance potential of synchronous processors is therefore artificially limited by forcing all operations to complete in an integral number of clock cycles. The clock cycle must also allow for worst case gate delays at worst case temperatures. Effectively the clock of a synchronous processor must always cater for a worst case scenario, with worst case clock skew and worst case gate delays at worst case temperatures.
Asynchronous processors employ no global clock. Instead synchronisation is achieved through the interaction of adjacent elements. Completion of one operation therefore initiates logically dependent operations. Asynchronous operation has several advantages. First, since there is no global clock to consider, designs can be partitioned into completely independent modules. Secondly, clock circuitry is not required, reducing silicon area. Thirdly, power consumption is lowered since there is no global clock consuming power and since idle elements are quiescent. In contrast, registers in a synchronous processor tend to be clocked in every cycle. Finally, asynchronous processors have a greater performance potential because operations are not constrained by the clock cycle. For example, addition can have a latency proportional to the carry chain length.
The Hades (Hatfield Asynchronous DESign) project is an investigation into asynchronous processor design. The differences between synchronous and asynchronous techniques make it undesirable to produce an asynchronous processor by simply mapping an existing synchronous design onto an asynchronous control structure. Designing asynchronous processors involves new problems which in rum require novel solutions. No aspect of design should therefore be sacrosanct in the move from synchronous to asynchronous techniques.
0-8186-7098-3/95 $04.00 0 1995 IEEE The Hades project aims to develop and assess schemes that increase thle performance of asynchronous processors. A baseline Hades processor has been designed using the formal specification language, CSP [4] . The concurrency inherent in CSP allows an asynchronous processor to be modelled as a hierarchy of concurrent processes communicating, asynchronously. Hardware simulations of Hades in VHDlL (VHSIC Hardware Description Language) [5] are at an early stage. VHDL is used as a simulation environment because of the concurrency features offered. A relationship is being cultivated between CSP and YHDL to facilitate rapid construction of VHDL simulators.
Hades builds on two other computer architecture projects at the University of Hertfordshire. The main progenitors of Hades are HARP (Hatfield Advanced RISC Project) [13] , a VLXW (Very Long Instruction Word) architecture and HSP (Hatfield Superscalar Processor) [ 151, a superscalar architecture. A common feature of both the HARP and HSP projects is the importance of instruction scheduling. Iln both cases the full potential of the architecture is only realised through the use of static code reordering at compile time. Static instruction scheduling is also an integral part of the Hades project.
In this paper a baseline version of Hades is presented to illustrate specific aspects of asynchronous processor design includiing an explicitly declared delayed branch mechanism and a decoupled operand forwarding mechanism. Where necessary the baseline model is expanded to shiow how other instantiations of Hades can be used to expllore the asynchronous design space. Section 2 briefly reviews related work. Section 3 provides an overview of Hades. Section 4 provides an overview of the Hades implementation. Section 5 describes the explicitly declared delayed branching mechanism. Sections 6 and 7
give further details of the Instruction Decode Stage and Register Files. Section 8 describes the Hades decoupled forwarding mechanism. Finally, Section 9 offers some concluding remarks.
2: Review of Related Work
This section places Hades in context by briefly describing other work in the area. Research into asynchronous processor design was initially stimulated by a seminal paper on micropipelines by Sutherland [17] .
The University of Manchester AMULET project investigates asynchronous logic, methodologies and processors. A :micropipelined processor, AMULlET1 [3], has been fabricated and another, AMULET2, is under development using the lessons of AMULETl. This project aims is to exploit the low power potential of asynchronous processors.
Alain Martin's group has designed and fabricated several asynchronous processors. The aim is to develop a methodology for asynchronous design using a basic processor architecture as a case study [7] . Recent work has involved a GaAs processor implementation [18] .
The Counterflow Pipeline Processor Architecture (CFPP) [12] is a radically different asynchronous pipeline structure where instructions and data flow through the processor pipeline in opposite directions. The aim is to develop processor organisations that allow efficient asynchronous realisations.
In spite of a steadily increasing amount of research in the area, there is still a widely held belief that asynchronous processors are inherently slower than their synchronous counterparts, partly because it is held that operand forwarding can not be efficiently implemented in asynchronous designs. The aim of the Hades project is to develop processor organisations, including efficient operand forwarding mechanisms, that will allow the f i l l performance potential of asynchronous processors to be realised.
3: An Overview of the Architecture
Hades has evolved from ideas developed with synchronous processors. However, the change in focus from synchronous to asynchronous techniques has necessitated alteration of several key areas of processor design. In particular a branch mechanism and an operand forwarding mechanism have been introduced to solve problems created by the move from synchronous to asynchronous techniques.
Hades has a RISC instruction set with a small number of simple instructions. The instruction set is compatible with the HSP instruction set [15] , allowing software developed for HSP to be reused. In particular we plan to use the HSP instruction scheduling software, which reorders code for parallel execution, on the Hades project.
Hades has two register files, one integer and one boolean. The boolean register file stores conditions that are generated by integer comparisons. All conditional branches then test a boolean condition. Comparison instructions are restricted to generating a single boolean value thus ensuring that both comparison operations and branch resolution can be completed as quickly as possible [ 141. Unconditional jumps correspond to a branch-on-false using boolean register BO which is always false. Branch to subroutine, move register to PC and trap instructions are also provided.
4: Implementation Overview
Underlying Hades implementations is the concept of an asynchronous FIFO (First-In-First-Out) buffer [ 171, Control in an asynchronous FIFO is localised and operation is elastic; throughput is determined directly by the availability of data. The elasticity in an asynchronous FIFO allows different input and output data rates to be sustained for short periods. An asynchronous FIFO therefore provides an organisation with localised control that can continue useful operations in the face of temporary interruption to the supply of instructions and data through cache misses or while branch instructions are being resolved.
All Hades implementations use the following four pipeline stages:
Separate instruction and data caches are adopted to reduce memory access bottlenecks. The basic pipeline can be instantiated as either a single-instruction-issue or multiple-instruction-issue processor. All instantlations of the architecture use in-order instruction issue but allow outof-order instruction completion. In-order instruction issue allows significant simplification of the instruction decode stages. Furthermore, since compile-time instruction scheduling is envisaged, the performance benefits of out-oforder instruction issue are significantly reduced.
The organisation of the baseline Hades processor is shown in Figure 1 . The Instruction Fetch (IF) Stage has the task of supplying instructions to the pipeline. A multiple-instruction-issue version of Hades can issue instructions in concurrent groups increasing the demand for instructions in proportion to the issue rate. To prevent a bottleneck the instruction bandwidth of the instruction cache must be maximised. There are two ways of achieving this: First, instructions can be fetched in groups, masking the latency of each individual cache access. Secondly, instruction fetching can be pipelined to increase throughput. Both alternatives will be modelled and the advantages and disadvantages of each assessed by simulation.
The task of the Instruction Decode Stage (ID) is twofold: First, the ID stage produces executable instructions for the Execution Stage. This includes providing control information, initiating register file accesses, allocating execution units and sign extending immediates and address offsets contained in the instruction. Secondly, the ID stage initiates accesses for the boolean values and return addresses required by conditional branches and other instructions which alter the flow of control. Initiating all register reads in the ID stage avoids arbitration for access to the register files yet allows branches to be resolved early in the pipeline.
The 
5: Explicitly Declared Delayed Branching
The Hades explicitly declared delayed branch mechanism aims to provide an effective method for resolving branches in an asynchronous processor.
Explicitly declared delayed branches are a generalisation of the standard RISC delayed branch mechanism whereby one or more instructions following each branch is always executed befort: the branch is taken. In Hades the number of instructions after a branch instruction which are executed regardless of the outcome of the branch is not fixed by the architecture but is encoded directly in the branch instruction A count, called the delay count, is associated with each branch indicating the size of the delay region. A zero delay count is iinitially associated with each branch during
111.
compilation. Then during scheduling, the instruction scheduler attempts to move instructions into the branch delay region to provide the pipeline with useful work while the branch is being resolved. Finally, the scheduler adjusts the delay count to reflect the number of instructions after the branch which must now be executed at run time before the branch can be taken.
Explicitly declared delayed branching is implemented as follows. Instructions from the instruction cache are returned to the Instruction Decode Unit which determines if the instruction is a branch. Branches trigger control flow resolution, while non-branches trigger a sequential fetch. Control flow resolution involves issuing the branch instruction to the Address Generation Unit and initiating a read for the branch condition from the Boolean Register File. The Address Generation Unit reacts to a branch instruction by first fetching the number of sequential instructions indicated by the delay count from the instruction cache. When the final instruction fetch from the delay region has been initiated, the Address Generation Unit uses the result of the branch instruction to determine the address of the next instruction fetch.
The implementation of an explicitly declared delayed branch instruction is illustrated in Figure 2 using the instruction "bt b l ,offset(#2)" as an example; all elements of state marked 'd' in figure 2 are not pertinent to the example and can be ignored. In practice the timing would be less precise than is implied by the figures; only the logical sequence of operations is guaranteed to occur. In the above example the branch delay was two. In the case of a branch delay count of zero, the fetch of the instruction immediately following the branch will already have been initiated before the branch is decoded. This instruction must therefore be squashed if the branch is taken.
The aim of the delayed branch mechanism is to hide the latency of branch execution by executing instructions immediately following the branch while the branch itself is being resolved. The success of all delayed branch mechanisms therefore depends on the instruction scheduler being able to fill the delay region of a branch instruction with sufficient useful instructions to allow operation to continue in the presence of an unresolved branch. During the last few years considerable experience has been gained in instruction scheduling for multiple-instruction-issue processors, both at the University of Hertfordshire [16] and elsewhere [9], [6]. This work suggests that filling a small number of branch delay slots is no longer a significant problem.
To allow comparisons with later designs, the baseline Hades model implements a fixed branch delay of one. Future extensions will not only allow branch instructions with variable delay counts but will also allow branches to be scheduled within the branch delay slots of other branch instructions.
6: Instruction Decode
The operation of the ID stage can be decomposed into four sub-operations:
Resource Allocation: Since multiple functional units are provided, all instructions must be allocated to functional units. Where multiple instances of the same functional unit are provided, instructions are allocated dynamically to functional units on a rotating basis. Register File Access: Two types of access are required, one to read the source registers and a second to reserve or lock a destination register. The operation of the register file is described in Section 7. Operand Forwarding: Decoupled operand forwarding is initiated during the ID stage. This mechanism is explained in detail in Section 8. Instruction Issue: Instructions are issued to functional units together with any literal operands embedded in the instruction.
Instructions are issued in-order, simplifying implementation at the expense of greater reliance on compile-time: instruction scheduling. True data dependencies (RAWS) and output dependencies (WAWs) are resolved by the register locking mechanism described in Section 7. As, a result, although the decoupled operand forwarding described in Section 8 increases performance, it is not essential for correct operation.
7: Asynchronous Register Files
The Hades regisier file organisation has been designed to allow read and write accesses to proceed concurrently and independently whenever possible. To ensure correct operation a register locking mechanism similar to [ 101 and [7] is employed.
During the ID stage the source and destination register fields of an instruction are sent to the register file, and register reads are initiated to obtain the source operands.
To ensure correct operation a lock bit is associated with each register. During the read access, a lock operation is also performed by setting the lock bit associated with the destination register. The locked register is only subsequently unlocked when the result of the instruction is returned to the destination register. The lock associated with each register mediates both read and lock accesses to the register file, stalling any subsequent instructions with a locked source or destination register. Register locking resolves RA'W hazards by stalling subsequent read accesses until the data is available and WAW hazards by stalling an instruction whose destination register is locked. Register read accesses in Hades therefore proceed as follows:
Output source operands to EX.
The final lock operation can commence immediately as long as neither of the source registers is the same as the destination register. If this is not the case, the destination register lock operation must be stalled until all read accesses to the same register have been initiated. Register writes which return instruction results to the register file proceed completely independently of the read operations.
When a write is complete, the destination register lock bit is also unlocked.
8: Operand Forwarding in Hades
It is usual to structure a pipeline so that once a result has been produced it travels through one or more stages before it becomes available to a subsequent instruction from the register file. When an immediately following instruction wishes to use the result, the pipeline may therefore have to be stalled until the data becomes available. Although this situation appears undesirable, the alternative of reducing the number of pipeline stages is also likely to reduce performance.
Instead a mechanism called register bypassing or operand forwarding is usually implemented to route data directly from the output of the functional unit where it is produced to the input of the functional unit which requires the new data. Such is the frequency of operand reuse, particularly in a multiple-instruction-issue processor [2], that the absence of some form of bypassing is likely to seriously degrade the performance of any synchronous or asynchronous processor.
8.1: Traditional Register Bypassing Mechanisms
In a synchronous system two independent activities can to be synchronised by associating each activity with a clock edge or other specific time in the clock cycle. Such synchronisation is implicit since there is no explicit communication between the two activities. Implicit synchronisation allows operand forwarding to be implemented with minimal overhead.
Accept instruction register fields. Stall until all source and destination registers are unlocked. Access source operands from register file. Data can therefore be communicated between non-adjacent stages using the implicit synchronisation provided by the clock edge. As a result data generated in the execution stage of one instruction can be forwarded directly to the beginning of the execution stage of a following instruction requiring the data. (see dotted lines in Figure 3) . Comparisons between the source and destination fields of the two instructions are used to control the additional data paths required. An asynchronous system contains no global clock and therefore no implicit synchronisation. Instead, synchronisation must involve explicit communication between elements. Synchronisation of adjacent stages in an asynchronous pipeline is essential to maintain the flow of data. However, more comprehensive synchronisation of non-adjacent stages is undesirable and can lead to a lockstep operation of pipeline stages and reduced performance. Unfortunately, the implementation of traditional bypassing schemes in an asynchronous environment tends to lead to an undesirable level of highlevel synchronisation. However, as noted earlier, without an effective form of operand forwarding the performance of an asynchronous processor is likely to be compromised.
8.2: Decoupled Operand Forwarding
Hades provides an unconventional decoupled operand forwarding mechanism. The central idea is to avoid highlevel synchronisation by separating the forwarding from other pipeline operations. The aims of the mechanism are fourfold:
To forward results directly between functional units, rather than indirectly through the register file. To remove the register file access from the critical execution path whenever a chain of dependent instructions is being executed. To eliminate high-level synchronisation from the pipeline, allowing operations to continue asynchronously. To prevent a locked register from stalling not only the current instruction attempting to access the lock but also subsequent, possibly independent instructions. Section 7 details a register locking mechanism that resolves RAW hazards by stalling instruction issue until the required operand is available from the register file. Decoupled operand forwarding eliminates most of these stalls by providing a mechanism for obtaining operands directly from the functional units that produced them. The objective is to increase performance. Correct operation is still ensured by the register locking mechanism.
Decoupled operand forwarding can be viewed as a form of distributed register caching in which the most recently generated results are retained locally within the functional units. Each functional unit includes a forwarding register that holds a copy of the last result produced by the unit. These results are then forwarded to subsequent instructions under the explicit control of the decode stage.
Instructions are allocated to functional units during the ID stage. This in turn identifies the forwarding register that will contain the data produced by the instruction. A tag is maintained by the ID stage for each forwarding register which uniquely identifies the current contents for the purposes of operand forwarding. When an instruction is issued the appropriate forwarding tag is updated and an overwrite signal is sent to the corresponding forwarding register, invalidating the existing data and allowing the register to be updated with a new result.
During the ID stage each instruction compares its source register fields with the forwarding tags. If no match occurs the source operand is obtained from the register file. If a match occurs decoupled operand forwarding is initiated by requesting the matched forwarding register to send data directly to the instruction's execution unit. The forwarding register will only output valid contents. If the register has received an overwrite signal, it will wait until the next result has been loaded from the functional unit before forwarding data. In this case the forwarding register transmits the operand directly to the execution units. If the functional unit is both forwarding data and executing the instruction, the forwarding register will receive both a forward and overwrite signal. Note that WAW hazards must still be detected using the register file lock bits. An instruction must still not be issued to a functional unit if the destination lock bit is set,
As an example of decoupled operand forwarding consider the following code fragment: add rl, r2, r3 /* rl := r2 + r3 */ sub r4, r3, rl /* r4 := r3 -r l */
Here the subtraction uses the result of the addition. The execution of these instructions is described with reference to Figure 4 , a simplified version of Figure 1 . and the IDU receives instruction 2 (sub r4,r3,rl). This time rl from instruction 2 matches FRI. A request is therefore sent to the forwarding register of ALUl asking it to send data directly to ALU2. As a further result, a read access is initiated for r3 but not for rl. A lock register access is also required for r4. Finally, an overwrite signal is sent to the ALU2 forwarding register and the state of the forwarding tags is updated.
The ALUl forwarding register cannot respond immediately since it is waiting the addition result from ALU1. However, as soon as this result is received it will be passed straight through the forwarding register to ALU2. Figure 4 (f) -Instruction 1 completes execution and the result is loaded into the forwarding register. This register is effectively transparent to the data which is passed immediately to ALUZ, allowing the subtraction to start. The ALUl result is also written to the register file concurrently. Note, however, that the most recent value of rl remains in the forwarding register and is therefore available for subsequent reuse.
Decoupled operand forwarding saves the last result produced by each functional unit for possible forwarding. There are three possibilities for the state of the processor when decoupled result forwarding is initiated.
The first possibility is that the operand required is available from both the register file and a forwarding register. In this case the operand will be obtained directly from the forwarding register. Since the forwarding operation is faster than a register access, performance will be improved as long as a register access is not required for a second register operand.
The second possibility is that the data is present in a forwarding register but not in the register file. Accessing the register file will therefore cause a stall, affecting not only the current instruction but subsequent, possibly independent instructions. In this case, decoupled operand forwarding provides the operand directly from the forwarding register. This increases performance on two fronts: First the operand is provided more quickly. Secondly, the issue of following instructions is not stalled.
The third possibility is that the operand is still being produced and is therefore available from neither the register file nor a forwarding register. This is the traditional operand forwarding case. Again decoupled operand forwarding boosts performance: First, the operand will be forwarded as soon as it is produced. Secondly, the issue of following instructions will not be unnecessarily delayed. It is therefore possible for subsequent instructions to overtake instructions waiting for their operands to be forwarded.
9: Conclusions and Future Work
Hades is a test bed for developing and assessing alternative asynchronous processor organisations. In this paper we have presented our initial ideas for Hades. In particular, we have outlined a proposed delayed branch mechanism and a decoupled operand forwarding mechanism.
The central feature of decoupled operand forwarding is the complete separation of bypassing from the register file writeback operation. This mechanism was developed to provide a n efficient bypassing mechanism for asynchronous processors which could be easily extended to multiple-instruction-issue designs.
However, as synchronous superscalar designs continue to require evermore-complex register file organisations with an everincreasing number of read and write ports, the separation of bypassing from write back could prove to be a useful technique for avoiding register file bottlenecks in synchronous as well as asynchronous designs.
The basic Hades model presented will undergo further refinement before being compared with more powerful multiple-instruction-issue Hades models. Our aim is to develop an asynchronous processor organisation which will compete effectively with synchronous superscalar designs. Since future superscalar designs are likely to use aggressive compile-timle instruction scheduling, an effective asynchronous design must also be amenable to instruction scheduling, in spite of the inevitable uncertainties associated with the timing cf operations within an asynchronous processor. We will continue to use CSP and VHDL to support our work, both to clarify our ideas and to support our hardware simulations. We also hope that CSP will allow us to demonstrate that all our Hades models are logically equivalent to a simple sequential Hades interpreter.
