Abstract
Introduction
A host of asynchronous pipeline controllers are being proposed over the years [1, 2, 3, 8, 11, 13] . Mainly they use to either 2-phase signalling protocol or 4-phase signalling protocol.
In 2-phase or transition signalling protocol, events (requests and acknowledgements) are identified at a transition of the control signals either from low-to-high or highto-low, and the levels of control signals have no significance. Hence, as shown in Figure 1 (a), a whole requestacknowledge cycle is completed when both signals made the same transition from one state to the other. The MOUSETRAP [1] , a simple and robust linear pipeline controller, is based on this protocol which proved to operate on ultra-high speeds. However, when using transition signalling protocol, usually translations from 2-phase to 4-phase are required at some points where level sensitive control is necessary. In the 4-phase protocol as shown in Fig , a given cycle has two phases, the working phase and the resetting phase. From the rising edge of request to the rising edge of acknowledgement is the working phase where a request is handled and completion is notified. The return-to-zero of both request and acknowledgement signals constitutes the resetting phase. The different sequencing of these 4-phase signalling transitions leads to different controllers for a range of cost and performance options as shown in [3] . Pipeline controller presented in this paper employs the Early Acknowledgement protocol introduced in [7] . This protocol is an improvement over the simple 4-phase protocol which hides the resetting phase of the signalling. In this protocol, the acknowledgement is indicated by the falling edge of the acknowledgement signal where as in the 4-phase protocol it is indicated with a rising edge. As shown in Figure 1(c) , the acknowledgement signal goes high as soon as the request signal goes high, thereby allowing the request signal to be reset on an early acknowledgement. The actual acknowledgement which is indicated by the falling edge of acknowledgement signal delimits the end of the current transaction and resets the acknowledgement signal for the next request-acknowledge cycle. Hence, this protocol eliminates the resetting phase inherent in the 4-phase protocol and yet retains its simplicity.
In this paper, we present a new asynchronous pipeline controller based on Early Acknowledgement protocol. In a pipeline with logic processing, the proposed controller can operate with minimal overhead and the performance is comparable to 2-phase pipeline controller. The controller based on Early Acknowledgement protocol for non-linear Conditional Branch operation is observed to perform slightly better than the 2-phase controller for the same operation.
The rest of the paper is organized in to 5 sections as follows. Section 2 introduces our controller, provides detailed operation and analyses the timing constraints and performance of the controller. Details of 2-phase controller and 4-phase controller which we used to compare with Early Acknowledgement controller, are given in Section 3. The operation and implementation of Conditional Branch nonlinear controller that we developed also for comparison is given in Section 4. In Section 5, the implementation of pipeline controllers and the simulation results that we obtained are presented. Conclusions and future work in the this research is given in Section 6.
Pipeline controller for Early Acknowledgement Protocol
The pipeline controller we present in this paper is an improvement of the controller that we proposed earlier in [4] . The new controller has reduced the overhead of the previously proposed controller under some reasonable timing constraints introduced. In fact, we could show that we can satisfy those constraints in the pipelined operation of the controller. In this section, a brief description of how the Early Acknowledgement protocol is employed in a pipelined controller is given first, followed by detailed operation of the pipeline controller. At the end we derive the timing constraints to be satisfied by the controller for proper operation and analyze the performance of the controller.
Pipeline Operation of Early Acknowledgement Protocol
First, we will define a few naming conventions that we use for Early Acknowledgement controller, and other 2-phase and 4-phase controllers throughout the rest of the this paper. A general diagram of a pipeline using bundled data scheme with logic processing in-between stages is shown in Figure 2 . In the interface of the controller, Rin N is the request from the input stage(N-1) and Ain N is the corresponding acknowledgement signal for the input side. Similarly, Rout N and Aout N are the request and acknowledge- Figure 2 . A General Pipeline with Logic Processing ment to and from the output stage(N+1). The local clock signal of the stage generated by the controller is clk N . The logic processing delay (t logic ) between stages of the pipeline is accounted for by the worst-case matched delay (t MD ) inserted in the request line between stages. For 2-phase protocol the delay can be symmetric such as a string of buffers, where as for 4-phase protocol (hence, for Early Acknowledgement protocol as well) the delays are asymmetric with a quicker resetting time. We distinguish these two delays such that t MD↑ represents the delay for the rising transition and t MD↓ represents the delay for the falling transition.
In the pipelines using bundled data scheme, data bundling constraints and semantics of the signalling protocol determines the proper sequencing of data and control signal transitions. As explained in the previous section, in the case of Early Acknowledgement protocol, we use the falling edge of acknowledgement signal to indicate the completion of working phase. Hence, in the pipeline operation we use the falling edge of the acknowledgement signal Ain to capture the data to a given stage. This leads to derive the clock signal clk of the Early Acknowledgement controller from Ain. Figure 3 shows the operational waveforms of the Early Acknowledgement controller when operating the pipeline of Figure 2 . First, the rising transition on request Rin N and the corresponding early acknowledgement (rising edge of Ain N ) is 'negotiated' between stage N and N-1 before hand. Since the data D on data N will be captured from stage N at the falling edge of Ain N (as clk N = Ain N ), data is expected to be ready on the falling edge of the Rin N . The captured data on data N will become valid at data N +1 after the processing in-between the two stages. The next stage of the pipeline (N+1 stage) is notified of this with a falling edge of Rin N +1 which is the delayed request Rin N from the stage N.
The distinctive advantage of Early Acknowledgement protocol comes into light when the pipeline has logic processing in-between stages where the requests for the succes- sive stages is required to be delayed. It can be observed that the overhead of the controller can be entirely hidden in the required matched delay (t MD ), provided that the processing delay is greater than the controller overhead. In such a scenario which is often the case in a complex pipeline processing with adders and multipliers in the logic processing units, the performance is comparable to the 2-phase signalling protocol as we obtained in our formal analysis and observed in our experiments. A quantitative analysis of this can be found in the Section 2.3.1 which derives expressions for the timing constraints and performance of the controller. Figure 4 depicts the controller that we proposed for Early Acknowledgement protocol. The controller consists of two AND gates, a C-element, an inverter and a asymmetric delay (t RD ) for the self resetting of rst signal. The clk which operates the clock of the pipeline stage, is derived from Ain, and allows the clocking of the stage to be made at the falling edge of the acknowledgement as mentioned earlier. Figure 5 shows the operation of the controller which confirms to the pipelined operation of Figure 3 . Initially, all the control signals are low except for clk signal. When the input stage raises the request Rin, the controller can immediately acknowledge the request (hence confirming to Early Acknowledgement protocol) by raising Ain. This in effect also lowers the clk signal as well. At first, this is made possible as there are no pending requests to the output stage (through Rout) which would otherwise have blocked the assertion in Ain by gating the request at A1 AND gate. As the early acknowledgement is provided by raising Ain, rst -the input for the A2 AND gate from the asymmetric delay (of which t RD↑ is almost 0) is also raised. The input stage lowers the request on response the early acknowledgement and also when the data is expected to be ready, which triggers the following events.
Controller Operation
• clk is raised, latching the new valid data from the input stage to the current stage register, • Ain is lowered, acknowledging the capture of data and also resetting the acknowledgement line, and • complete is raised, generating the rising edge of the output request Rout
Once the Rout is driven high, it can be maintained high with C-element after the complete signal is lowered by the selfresetting circuit of the controller. This also constitutes the local timing constraint to be satisfied by t RD↓ of the self resetting delay i.e. to hold the complete signal high, long enough to produce the Rout high before resetting. Since the controller has fully completed the handshake cycle at the input side, it is free to make a new request on Rin. However, as described earlier, the pending output request Rout high effectively blocks generating the early acknowledgement back to the input side. Upon receiving acknowledgement high on Aout which is in fact an early acknowledgement from the output stage, request Rout will be lowered which also indicates the validity of the data for the output stage.
The simplifications are done for this Early Acknowledgement controller compared to the Early Acknowledgement controller in [4] at the expense of some timing constraints which are discussed in the next section.
Timing Constraints and Performance
A close analysis of the operation of our controller is presented in this section by deriving the timing constraints to be met for the proper operation and the performance of the controller.
Timing Constraints
First, we will turn to the timing constraints required for the desired operation as described in Section 2.2. In the case of constraints to be satisfied by the environment, we assume the controller to be in a middle stage of a pipeline with the similar controllers in the previous and next stage which can operate at a speed equal to or slower than our controller. Assuming speed equal to our linear controller enable us to quantify the environment delays. Considering delays slower than our controller derives worst case scenarios for the constraints. In such an analysis we could observe the constraints can be satisfied within the pipeline itself as the constraints can be proved to satisfy due to the inherent delays within the controllers. Constraint 1. The first constraint is on the proper generation of complete signal. For complete signal to go high on the falling edge of the Rin, output of the self resetting delay rst should be settled on high when the Rin goes low. Time taken for rst to go high when Rin is raised (t Rin↑→rst↑ ), is given by:
where t RD↑ is expected to be the smallest delay of the asymmetric delay element. The common implementation of an asymmetric delay element with the smallest possible delay for t RD↑ , amounts to a delay in a single OR gate rise time. Hence,
The earliest time for Rin to go low is from time when rising edge of Ain is received by the previous stage on its Aout to time when its request Rout is lowered.
where t MD↓ is also assumed to be smallest possible, which is equal to t AN D↓ . Thus, it can be clearly seen that the timing constraint t Rin↑→Rin↓ > t Rin↑→rst↑ is satisfied.
Constraint 2.
The next is a timing constraint to be satisfied by the self resetting delay. The complete signal should not be self-reseted before the Rout high is produced. The reference point for delay measurements is fixed on Rout ↓. The time between Rout going low and high again can be derived considering the delays from the input and output sides of the controller.
The max operator is used to get the larger of delays. Either the delays on input side for complete signal (first operand) or the delays on the output side for Aout signal (second operand) drive the C-element to generate Rout high. Similarly, the time between events Rout ↓ to complete ↓ can be given as:
Here the min operator ensures the minimum delays on the two paths which drive A2 AND gate low resetting the complete signal. In the above two equations, the delays incurred from the environment at the input and output sides are considered to be either equal to or larger than the delays incurred by a linear controller. Hence, the three environment delays in the above two equations can be expressed as follows.
In order to generate complete ↓ and Rout ↑ properly, the above equations (4) and (5) should satisfy the constraint:
This requires analyzing four scenarios, for each possible combination of max and min operators. The notation t 1 and t 2 is used to denote first and second operands of the max and min operators are considered respectively. Hence,
Validating the first pair of equations; (10) and (12) against the above constraint derives the minimum delay for t RD↓ as follows.
Interestingly, in this case t RD↓ can be almost one gate delay according to the above constraint, which makes room for using a simple symmetric buffer as t RD delay element. The constraint can be asserted for another possible scenario of the controller using equations (11) and (12) which derives a minimum delay for t RD↓ in the case where second term of max operator is larger and first term of min is smaller.
Substituting the right hand side of equation (6) in above as the smallest possible delay of t Rout↓→Rin↓ , a conservative constraint for t RD↓ can be obtained as follows.
The above condition for t RD↓ states that the slower environment in the output side can be compensated by adjusting the t RD accordingly.
In the third case, equations (10) and (13) can be used to validate the constraints which yields the following condition.
From equation (8) it can be deduced that above constraint can be satisfied. This is because even with the worst case minimum delay where t MD = 0 in equation (8) and with a realistic assumption that t AN D↑ ≈ t AN D↓ in equation (17) the following inequalities hold, hence satisfying the constraint.
The final combination of equations (11) and (13) give rise to the following inequality.
Again, substituting on the left hand side with equations (6) and (8) with minimum delays, a conservative constraint for output environment can be obtained as follows.
This is not a strong constraint for our controller, since in a preferred application of this controller where there are processing elements within the pipeline and hence matched delays t MD↑ are required to be sufficiently large, the above condition can be easily satisfied. Hence, we have proven that the second constraint of the controller can be met in all four scenarios with first two cases defining the minimum delay for t RD and last two cases requiring satisfiable constraints on input and output sides respectively. It is also can be noted that a typical linear pipeline with our controller in every stage corresponds to the very first case. In such a scenario, minimum delays of equations (6), (7) and (8) holds and consequently the preconditions for the equation (14) are satisfied. As indicated earlier, t RD can be made to a simple buffer delay in that case.
Performance
Now, we derive equations for two important performance factors of the pipeline i.e. forward latency and cycle time. More importantly, we will show which components of the latter performance metric can be hidden in case of a pipeline with logic processing where the Early Acknowledgement protocol has a competitive edge. We assume that the controller in the middle stage of a pipeline with the same controllers in the previous and next stages. In contrast to the constraint analysis, we assume the controllers are operating at maximal speed in the performance analysis. An upper bound for performance can be obtained accordingly at optimal operating conditions.
The Figure 6 depicts the Signal Transition Graph(STG) for our controller in desired operation, when it meets the above specified constraints. Thick arrows indicate the signal transitions generated from the environment (previous and next stage controllers operating at maximum speed) of the controller where as regular arrows indicate transitions made by the controller. Transitions are annotated with the gate delays associated with them. For the delays from the environment, the delays incurred from the controllers of previous and next stages are used. Dashed arrows are for the clock signals of the controller stage and the following stage (clk N and clk N +1 ) which are not directly in the control path of main control logic, but useful in measuring the cycle time in terms of logic processing delay (t logic ). For clarity, not all the transition arcs for these two clock signals are shown.
Cycle time is defined as the interval between two successive data items passing through a pipeline stage when the pipeline is operating at maximum speed. We can measure the gate delays between two successive clk rising edges for this purpose or equivalently the delay between two successive falling edges of Rin.
First, we will identify the critical cycle of the controller using the STG branch and merge points. The path Figure 6 . STG for Early Acknowledgement Controller more critical than Rin− → Ain− → Rin+ → Ain+ as the delays in the former is larger. Similarly, the path Rout− → Ain+ → Rin− → Rout+ is more critical than the path Rout− → Aout− → Rout+. Hence, the critical cycle of the controller lies on the path marked with red color. In fact it is required to unfold the STG to previous and next stages as well to formally show that this path with the delays shown in the STG is indeed the critical path defining the cycle time of the controller. The details were left out and the same conclusion was arrived at assuming the environment consists of similar controllers operating at maximal speed. The cycle time can be obtained as a function of gate delays and required matched delay (t MD ) as follows.
In order to obtain cycle time and forward latency in terms of logic processing delay (t logic ) we need to express the required matched delay t MD for the operations in terms of t logic . When the data is latched with clk N + the next stage clock clk N +1 + needs to be made after t flop + t logic delay. We can relate t logic to t MD by measuring the same delay in two paths to the event of clk N +1 +.
• Path on control cycle:
• Path on data cycle:
To ensure the correct operation of the pipeline T 1 ≥ T 2 must hold. Thus, from above two equations we can derive an expression for the minimum value of t MD↑ as follows:
Thus, if
holds, we can find the cycle time in terms of t logic by substituting t MD↑ in equation (21) by the right hand side of (24) which yield the following.
This is quite impressive cycle time even compared to the 2-phase controller for which the key points are highlighted in the Section 3.1. In the case where logic processing time is smaller and the inequality (25) does not hold, we have the cycle time directly from equation (21) with t MD = 0, which is:
Forward latency is the time taken by a data item to emerge from an initially empty pipeline. Transitions that take place in the forward latency path starting from the Rin− of the STG is shown in the Figure 6 in blue color. When the inequality (25) holds, we can have the similar argument to obtain forward latency as follows.
When the logic processing delay is small and inequality (25) does not hold, the critical path for forward latency lies on the path:
which is:
Again, we can derive the forward latency on this path with t MD = 0, which is:
In a complex pipeline processing with adders and multipliers in logic processing unit, the condition (25) often holds. Cycle time and forward latency for our controller in that case is given by equation (26) and (28) respectively.
Comparison to 2-phase and 4-phase Pipeline Controllers
In order to demonstrate the advantage of Early Acknowledgement protocol based controller, we have compared its performance to a 2-phase and a 4-phase pipeline controller. The following sections describe the controllers used for this comparison and their key features.
2-phase Controller: MOUSETRAP
For the 2-phase or the transition signalling protocol, MOUSETRAP controller is selected for its simplicity and robustness. As shown in Figure 7 the controller consists of a simple transparent latch and a XONR gate. The same type of latches (instead of D-flipflops) and the latch enable signal enable are used also for the data path. In 
where the t latch is the data latching delay of the transparent latch of the controller. In comparison to the cycle time of Early Acknowledgement controller obtained in equation (26), it can be concluded that the latter is quite competitive with the 2-phase cycle-time. As for the forward latency the Early Acknowledgement controller (in equation (28)), has a larger latency than the 2-phase controller.
4-phase Controller
We have used the 4-phase controller proposed in [12] for this comparison. The controller is shown in Figure 8 . 
-phase Controller
We could derive the cycle time and latency for this 4-phase controller using the similar mechanism employed in controller for Early Acknowledgement protocol.
The STG for obtaining the above cycle time and latency is not shown due to space limitations. Quite evidently the cycle time of the 4-phase controller has controller overhead which lies on the critical cycle and it cannot be hidden by the matched delay, unlike Early Acknowledgement controller.
The details of the experiments and simulation results which confirm our claims is given in Section 5. Mainly, we have compared these three controllers to each other in two cases: pipeline without logic processing (t logic = 0 hence, t MD = 0) and with logic processing. The merits of using Early Acknowledgement controller could be observed in latter case where the t logic satisfies the condition that we derived in (25).
Conditional Branch Controller
We have used a Conditional Branch(CB) non-linear pipeline operation to demonstrate the simplicity of the Early Acknowledgement protocol (which is essentially 4-phase protocol) in composing complex pipeline constructs. First, the abstract operation of the Conditional Branch without any particular reference to a signalling protocol is given followed by the implementation of CB controller for each signalling protocol.
In contrast to Fork operation, Conditional Branch operation diverts the data to only one branch depending on selection signal to the controller. The interface of a two way Conditional Branch controller is shown in Figure 9 . Conditional Branch controller communicates with the in- Controller blocks further requests from input stage while waiting for an acknowledgement from the next stage, because, otherwise the data still needed by the output stage could be over-written. When the corresponding branch path acknowledges the request using either Aout1 or Aout2, the cycle completes and Conditional Branch stage is ready to process the next request on Rin.
Early Acknowledgement CB Controller
The Conditional Branch controller for Early Acknowledgement protocol is a simple extension of its linear controller. The controller can be composed of a linear con- The select signal is latched using the Ain as latch enable. This ensures the select is being sampled in the positive edge of Ain and stable in select l when the request is made on negative edge of Rin which in turn make the latch opaque asserting Ain low. A function generator gen which produce select from data as shown in Fig 10 is explicitly considered for analyzing constraints imposed from such an application. The select l diverts the request req from linear controller to either Rout1 or Rout2 conditional paths through the demultiplexes. Since only one request is acknowledged from either Aout1 or Aout2, the acknowledgements from the conditional branches can be simply ORed to produce the Aout of the linear controller.
4-phase CB Controller
The Conditional Branch for the 4-phase protocol is similar in construction to that of Early Acknowledgement protocol. The construction of the controller is same as in Figure 10 , except for using a 4-phase linear controller in place of the Early Acknowledgement linear controller. The operation as described in previous Section 4.1 is valid for the 4-phase Conditional Branch controller as well.
2-phase CB Controller
Conditional Branch controller for transition signalling protocol, is not straightforward as in Early Acknowledgement protocol or 4-phase protocol. Since there is no resetting of the request or acknowledgement signal, we cannot make use of a demultiplexes to route the request on the sampled select signal. Figure 11 shows Conditional Branch controller for 2-phase protocol based on [14] . Note that the D-flops are used in contrast to the transparent latches used Initially, all control signals are at the same state and complete signal is high which indicates the operations of the output side of the controller is complete. The select signal can either be at high or low depending on the data or other control information which handles the branching operation. When a request is made with a transition on Rin, difference in states of Rin and Ain generates the clk signal which is gated by complete. Since complete is high initially, the clk signal is raised latching the control and data signals. Once the Rin is latched, the same transition occurs in the Ain which acknowledges the request to the input side. The select signal and its complement will be latched to s1 and s2 flip-flops in a way that one of registers will be flipped depending on the select signal generating the correct branching request. For example, if the select signal is low at the request, s1 is flipped generating a transition on Rout1 making request on first branch.
Either of the request event causes the complete signal to go low indicating the latched data is being passed to the output stage, which will effectively blocks new requests from the input side. At the acknowledgement of the corresponding branch, each pair of request and acknowledgement signals return to the same state, raising the complete signal high and re-enabling the requests from the input side.
In comparison to the minimal overhead linear controller(MOUSETRAP), the s1 and s2 toggle flops to generate requests and completion detection mechanism of the controller incurs considerable overhead in the operation, adversely affecting its performance. A formal analysis comparing the performance of Conditional Branch controllers is presented in the journal version of this paper.
Implementation and Results
In this section we describe in detail the testcases that we made to evaluate the performance of each controller and the preliminary simulation results.
Implementation
As the proof of concept, we have evaluated the performance of each of the controllers on Xilinx Vertex-4 FPGA. We made maximum efforts to minimize the uncertain path delays in FPGA routing. All control and data path circuits of the designs are placed identically in each case using rloc placement constraints of the Xilinx ISE tool. Synthesis options, both general and Xilinx specific ones are tuned to suit asynchronous design synthesis. For example, use of global and regional clock buffers is disabled. Thus, we believe that the results we obtained are comparable with each other with minimum of uncertainty in measurements.
For linear controllers we have created simple 8-bit 4-stage FIFOs, operated by each type of controller. For the Conditional Branch controllers we have built a 8-bit Yshaped pipelines with 4-stages where 2-stages are in the stem of the pipeline and 2-stages are branched out. The Conditional Branch controller is placed in the second stage of the pipeline. All pipelines were constructed to be 8-bit.
Environment for the pipelines comprised of input generating shift registers and output capturing registers (two registers in the case of Conditional Branch) were operating with minimum overhead which maximizes the performance of the controller under test.
Performance of controllers were evaluated in two cases • pipelines operating without any processing • pipelines operating with processing between stages
In the first case, there is minimum delay between stages without any logic processing in-between which evaluates the maximum performance of the controllers for high-speed pipelines. Since there is no logic processing (t logic = 0), no matched delays were inserted between stages as well (t MD = 0).
In the second case, performance of pipeline controllers for a general scenario of pipelines operating with processing in-between stages is tested. In order to emulate the processing elements we have used simple buffers to delay the data-path. The introduced logic delay was between 6.9ns to 7.2ns (varied depending on the exact routing of the datapath) for each stage of the pipeline. This delay also satisfies condition (25) that we derived for Early Acknowledgement controller. The matched delays for controllers were tuned starting from a higher delay to the lowest possible where the proper operation of the pipeline is guaranteed. From the first column of results, it can be observed that the 2-phase controllers outperform the 4-phase and Early Acknowledgement controllers in linear and Conditional Branch operations when there are no processing inbetween pipeline stages (t logic = 0). Its performance advantage is evident in these cases where a minimum overhead in the controller is desirable. Since t logic = 0, the condition (25) for Early Acknowledgement controller does not hold, the overhead of the controller is exposed on the critical cycle time which explains its larger cycle time.
According to the second column of the results table, in the cases where logic processing is present between pipeline stages we observe that the Early Acknowledgement controllers perform better as its overhead got hidden in the required delay between stages. For the Early Acknowledgement controller, the condition (25) holds in this case, the performance is comparable to 2-phase controller in linear operation. Our Conditional Branch controller for Early Acknowledgement protocol performs slightly better than the 2-phase controller. The ability of the Early Acknowledgement protocol to hide the control overhead results in the performance gain 2-phase Conditional Branch controller. Even though the gain is relatively low, the simplicity of the Early Acknowledgement protocol as a 4-phase protocol makes it more appealing in this case of non-linear asynchronous pipeline application.
Conclusions and Future Work
We have proposed a new pipeline controller for Early Acknowledgement protocol. The most of the timing constraints required for the expected behavior of the controller, can be easily satisfied within the pipeline delays itself. When the pipeline has logic processing, the controller can operate with minimal overhead hiding its overhead in the required matched delay. In such a case, we could obtain cycle-time of controller comparable to 2-phase controller MOUSETRAP, both analytically and experimentally.
Furthermore, we could emphasize on the advantages of using Early Acknowledgement protocol which also inherit the simplicity of 4-phase protocol by comparing the Conditional Branch controllers for each protocol.
We would like to evaluate and confirm the performance of the controllers on ASIC, like on 65nm technology. Experimental results in such a case is deemed necessary to strengthen our claims of the advantages of using Early Acknowledgement protocol.
