Performance modeling and analysis of asynchronous pipelines for designers by Lu, Shih-Lien
AN ABSTRACT OF THE THESIS OF
ChihMing Chang for the degree of Doctor of Philosophy in
Electrical and Computer Engineering presented on January 24, 1997.




Better performance has been one of the main motivations behind the recent resur-
gence of interest in asynchronous circuits (no matterwhether this is always true or not).
We are particularly interested in the performance of pipelines since they are used exten-
sively in current digital systems. There exists an algorithm that can find the exact upper
and lower bounds on the separation time of events in a certain class of process graphs.
However, some transformations and complex mathematical analyses, such as graph de-
composition for infinite unfolded process graphs must be employed in order to reach
exact bounds. This algorithm may be a good candidate for the applicationof CAD tool
development and circuit synthesis, but it tends to block designers from visualizing what
factors really affect the performance of asynchronous circuits.
In this thesis, a simple approach is adopted to approximate the performance
bounds. Since our method is a symbolic approach instead of a numerical approach, it
allows designers to analyze the circuit performance while providing design guidelines
and approaches at the same time. Our approach has two steps.First, several basic
modules are chosen, including FIFO, Fork, Join, Toggle/XOR, Arbiter/Call and Select/
Redacted for PrivacyXOR. The individual output loop delay, equivalent input delay and equivalent output
delay are derived based on the Equal loopdelay theorem. The result is a set of differ-
ence equations. The performance approximation canbe obtained with simple mathemati-
cal operations on the difference equations, given the bounds of stagedelays. That is,
the performance bounds of output loop delay, equivalent input delay and equivalent out-
put delay can be represented as the bounds of stagedelays. Second, for alarger system
consisting of those basic modules, its performance bounds can be derived directly from
the bounds of output loop delay, equivalent input delay and equivalent output delay of
those basic modules which have been obtained already. This approach allows a fast and
easy calculation of performance bounds, avoiding the need torederive the difference
equations for the whole system. Both modular design and performance approximation
are possible with our approach.©Copyright by ChihMing Chang
January 24, 1997






in partial fulfillment of
the requirements for the
degree of
Doctor of Philosophy
Completed January 24, 1997
Commencement June, 1997Doctor of Philosophy thesis of ChihMing Chang presented on January 24, 1997
APPROVED:
Major Professor, representing Electrical and Computer Engineering
Head of Department of Electrical and Computer Engineering
Dean of Graduate School
I understand that my thesis will become part of the permanent collection of Oregon State
University Libraries. My signature below authorizes release ofmy thesis to any reader
upon request.





I am grateful to my advisor, Professor ShihLien Lu, and the rest of the commit-
tee members for their guidance. This research would never have been completed without
Professor Lu's continuous support in both finances and spirit.He provided technical
consultation and possible direction toward the answers whenever I encountered prob-
lems. His patience and encouragement gave me full confidence in doing the project.
This research would not be complete without obtaining the exact performance
bounds.I am grateful to Professor Henrik Hulgaard in the Department of Information
Technology, Technical University of Denmark. He provided me with the simulation tool
and taught me how to use it.His quick responses to my questions enabled me to com-
plete this research in a timely way.
My family's continuous support for my 25 years of student life is highly appre-
ciated. Their understanding and consideration relieved me from all other pressures while
pursuing this advanced degree.




1.2 Problem Statement 5
1.3 Thesis Contribution (Goal) and Organization 7
1.4 Summary 9
2. LITERATURE REVIEW 10
2.1 The Operation Of A Celement 10
2.2 Timed Petri Net Modeling 12
2.2.1 Approach By Ramamoorthy And Ho 14
2.2.2 Approach By Wuu And Vrudhula 15
2.3 Dependency Graph Modeling 15
2.4 Event Rule Modeling 17
2.5 Other Works 19
2.6 Summary 19
3. MICROPIPELINES AND CELEMENT MODELING 20
3.1 The TwoPhase Bundled Data Convention 20
3.2 Linear Micropipelines 21
3.3 Delay Modeling of Celements 25
3.4 Event Logic Modules 30
3.5 Summary 32
4. THE PERFORMANCE OF A LINEAR MICROPIPELINE
WITH FIXED STAGEDELAY 34
4.1 Definitions And Theorems 34
4.2 Transient Delay And Steady State Analysis 50
4.3 Summary 55TABLE OF CONTENTS (Continued)
Page
5. THE PERFORMANCE OF A LINEAR MICROPIPELINE
WITH VARIABLE STAGEDELAY 58
5.1 Upper And Lower Performance Bounds 58
5.1.1 Several Representations Of Logic Delays D? 4(j + 2) And D42(j + 2)59
5.1.2 Several Approximations To Dr(j + 2) 65
5.1.3 Several Approximations To T./1(j + 1) 76
5.1.4 Several Approximations To Di4(j + 2) 80
5.1.5 Several Approximations To YIN + 2) 87
5.1.6 Several Approximations To Dr(j + 1) 92
5.2 Design Procedure And Guidelines 95
5.3 Numerical Example 105
5.3.1 Original Design 105
5.3.2 Design Modification 107
5.4 Summary 108
6. PERFORMANCE ISSUES FOR
SYNCHRONOUS VS. ASYNCHRONOUS PIPELINES 110
6.1 Performance Constraints 112
6.2 Output Loop Delay < Worst StageDelay 115
6.3 Average Output Loop Delay > Worst StageDelay 118
6.4 Effect Of Delay Sequence (Pattern) 120
6.5 Effect Of Number Of Stages 123
6.6 Summary 125
7. PERFORMANCE EVALUATION OF TWODIMENSIONAL
ASYNCHRONOUS PIPELINES 126
7.1 Performance Of Asynchronous Pipelines With Fork 127
7.2 Performance of Asynchronous Pipelines with Join 134
7.3 Performance Of Asynchronous Pipelines With Toggle/Xor Pair 140
7.4 Performance Of Asynchronous Pipelines With Arbiter/Call Pair 150TABLE OF CONTENTS (Continued)
Page
7.5 Performance Of Asynchronous Pipelines With Select/Xor Pair 163
7.6 Summary 172
8. PERFORMANCE ANALYSIS AND DESIGN EXAMPLES
OF TWODIMENSIONAL PIPELINES 173
8.1 A Performance Analysis ExampleOpenLoop System 173
8.2 A Performance Analysis ExampleClosedLoop System 179
8.3 A Design Example 197
8.4 Summary 204
9. CONCLUSIONS AND FUTURE WORKS 205
9.1 Conclusions 205
9.2 Future Works 208
BIBLIOGRAPHY 210
APPENDICES 213
APPENDIX A: PETRI NET MODELS AND PROCESS NAMES
FOR BASIC MODULES 214
APPENDIX B: A CTSE CODE FOR A SYSTEM SHOWN IN FIGURE 8.1 216LIST OF FIGURES
Figure Page
2.1: The operation of a Celement 11
2.2: Circuits and their corresponding timed Petri net model. 13
2.3: A circuit and its corresponding Dependency Graph 16
2.4: A circuit and its corresponding EventRule system 18
3.1: Twophase bundled data convention. 22
3.2: A linear micropipeline. 23
3.3: The operation of a Celement (modified). 26
3.4: Modeling of physical and logic delays of a Celement. 28
3.5: Depiction of some event logic modules. 31
4.1: A mstage linear micropipeline. 35
4.2: Definitions of equivalent input/output delays and loop delays. 37
4.3: Pictorial representation of a firing sequence. 39
4.4: A data dependency graph for a linear micropipeline. 40
4.5: A mstage linear micropipeline and its equivalents. 42
4.6: Logic and equivalent input/output delay values for the stages
following max delay stage n when they become stable. 45
4.7: Definitions for local maximum delays and maximum regions 48
4.8: Transient delay and steady state analyses 51
4.9: Numerical example for a fourstage linear micropipeline. 54
4.10: Logic and equivalent input/output delay values for the stages
prior to stage n when steady state is reached for all stages. 56
5.1: Relative relationship among different approximations of D42(j + 2). 75
5.2: Comparison of our approximations with the exact bounds. 77
5.3: The relationship between individual and overall
output loop delay bounds. 81LIST OF FIGURES (Continued)
Figure Page
5.4: Relative relationship among different approximations ofD2k4(j + 2) 88
5.5: Partitioning a combinational logic block. 100
5.6: The equations corresponding to stage i being split into n stages. 103
6.1: One stage of a micropipeline. 114
6.2: Simulation results with Din=9, D2=4 and D0 1.11 117
6.3: Simulation result with D1n=30.8, D2=4 and D.1=15. 119
6.4: Simulation result with D1n=12.8, D2=4 and Dout=11 for different
cases of forward and backward delays sequence (pattern) 121
6.5: Simulation result with D1n=12.3 and D.F11.5. 124
7.1: A twodimensional micropipelineFork. 128
7.2: A Fork pipeline and its Petri net model. 133
7.3: A twodimensional micropipelineJoin 135
7.4: A Join pipeline and its Petri net model 139
7.5: A twodimensional micropipelineToggle/XOR 141
7.6: A Toggle/XOR pipeline and its Petri net model 149
7.7: A twodimensional micropipelineArbiter/Call. 151
7.8: An Arbiter/Call pipeline and its Petri net model. 162
7.9: A twodimensional micropipelineSelect/XOR. 164
7.10: A Select/XOR pipeline and its Petri net model. 171
8.1: A twodimensional pipeline systemopen loop 174
8.2: The flow chart for obtaining maximum output loop delays of Figure 8.1 176
8.3: A twodimensional pipeline systemclosed loop 181
8.4: The dependency flow chart of Figure 8.3. 182
8.5: Two loopsone datatoken loop and one spacetoken loop. 185LIST OF FIGURES (Continued)
Figure Page
8.6: Equivalent circuits of Figure 8.3 corresponding to different cases. 188
8.7: A general system with Join module and its approximation equivalent. 196
8.8: Design options for helping reach specifications 201LIST OF TABLES
Table Page
5.1: Comparison of our approximations with the exact bounds
using a threestage linear pipeline as an example 78
7.1: Comparison of our approximations with the exact bounds
using a general Fork pipeline as an example 132
7.2: Comparison of our approximations with the exact bounds
using a general Join pipeline as an example. 140
7.3: Comparison of our approximations with the exact bounds
using a general Toggle/XOR pipeline as an example. 150
7.4: Comparison of our approximations with the other approximations
using a general Arbiter/Call pipeline as an example 163
7.5: The result of our approximation to a general
Select/XOR pipeline. 170
8.1: Comparison of our approximations with the exact bounds
using Figure 8.1 (openloop system) as an example. 179
8.2: Comparison of our approximations with the exact bounds
using Figure 8.3 (closedloop system) as an example. 193LIST OF APPENDIX FIGURES
Figure Page
A.1: Process of a Celement. 214
A.2: Process of a delay path without initial token. 214
A.3: Process of a delay path with initial token. 214
A.4: Process of a Fork. 214
A.5: Process of a Join 215
A.6: Process of a Toggle. 215PERFORMANCE MODELING AND ANALYSIS OF
ASYNCHRONOUS PIPELINES FOR DESIGNERS
1. INTRODUCTION
We all live in a clock world.In daily life our living habitswhen to wake
up, to start working, to have a meeting, to call it aday, to go to bedare more or
less regulated by the clock. The same type of phenomenon occurs in the world of digital
circuits, although the preciseness in clock regulation is different. We can wake up a little
bit earlier or later in our daily lives, but with digital circuits, which are constructed using
clocked logic, the data must be stable before the clock arrives. Can we imagine what
our lives would be like without clocks to regulate our livingtempo? Will our lives or
the whole society become more efficient? It is not the intention of this thesis to answer
these particular questions; however, we are interested in the related questionwhat will
happen if digital circuits are designed without clocks? In this chapter, we will discuss
the advantages and disadvantages of leaving the clock out of digital circuits. Our major
interest is how the lack of a clock affect the performance of a digital circuit. The goal
of this thesis and the approach taken to fulfill this goal will also be presented in this
chapter.
1.1 Motivation
Clockedlogic design is now the most common discipline being used in the de-
sign of digital systems. The main reason for this design discipline being so prevalent
in the industry and in academies is because the knowledge required to design efficient
and effective circuits using this discipline is simple and easy for most design engineers.
Combinational logic circuits and registers are the two parts comprised in a digital system.
Boolean algebra is the only basic tool necessary to design a combinational logic circuit.2
Moreover, the key discipline in designing a correct and functional circuit (e.g., state
machine) is to stabilize the data before they are latched into registers by the clock. The
tools and discipline themselves do not change, but the way for not violating the discipline
is getting hard to observe nowadays. This is true because, as the die size grows and
transistor feature sizes continue to shrink, wiring delays play a more and more important
role compared with gate delays when a highspeed digital system is considered (implying
that a higher frequency clock is used). That is, when we go down to layout level, the
wiring delays can no longer be ignored and can possibly skew the timing of the whole
system. The wiring delays may not affect the data processing too much since most of
the data communicate locally. This is, however, not the case for the clock signal. The
clock is usually generated in a certain place and then distributed globally to the whole
chip. To make the whole system function correctly, clock skew (phase delay due to long
wiring) should be controlled so that the design discipline is not violated. Due to its
global distribution, controlling clock skew is not a trivial matter.Not to mention the
appropriate distribution of the clock, the clock itself may take up a lot of the die's area.
For example, more than a quarter of the silicon for the Alpha chip from Digital Equip-
ment Corporation (DEC) [1] is devoted to clock logic. The potential problems resulting
from the clock have triggered researchers to rethink the feasibility of eliminating the
clock from the design of a digital system.
The way to approach the design of a clocked system is usually called synchronous
design methodology. This is because, in this kind of system, the processing data must
always be synchronized with the clock. The design methodology which abandons the
usage of the clock in designing a digital system is called asynchronous (selftimed)
design methodology [2]. The design discipline in asynchronous design methodology is
to set up a communication protocol (handshaking) between communicating parties. The
mechanism for implementing this protocol is served as the function of synchronization,3
which ensures that the data have been latched correctly, and is performed locally in
asynchronous digital systems.
Asynchronous circuits are said to have several advantages over synchronous cir-
cuits [1,3,4].
(1) Clock skew free: Clock skew is not only hard to control but also degrades
overall performance compared with ideal nonclock skew systems [5].Since asynchro-
nous circuits use a local communication scheme (handshaking)between successive stages
and feature no global clock, clock skew can be avoided naturally.
(2) Noise reduction: In a conventional clockdriven system, all the activating
transistors in the system switch at the same time when the global clock arrives.This
leads to steep source current variation and results in highvoltage inductive noise in the
power line.In a selftimed system, handshaking is distributed with time, dramatically
reducing the number transistors switching simultaneously. This reduces power noise.
(3) Zero standby power: If made with a Complementary MetalOxide Semicon-
ductor (CMOS), then , theoretically, a selftimed system consumes no power (in reality,
a small amount of leakage current exists) when it is idling (notundertaking any hand-
shaking). On the contrary, a clockdriven system consumes power even when no tasks
are being executed, since the clock continues to operate.If an asynchronous micropro-
cessor is running under full (maximum) load (speed), the powerconsumption will be
compatible with a synchronous microprocessor. In reality, however, the microprocessor
is rarely running at the maximum speed in general applications, leading to less total
power consumption. This characteristic is especially useful for portabledevices.
(4) Low heat generation: More power is consumed implying more heat is gener-
ated. In Very LargeScaled Integrated circuit (VLSI) design, when the density is getting
higher and higher or more specific the size reduces to 0.5 micro level and below, tens
of millions to hundreds of millions of transistors could be contained in a single chip.
With this kind of density, the heat dissipation is a serious problem. Although a static4
CMOS is employed to save the power, a clocked microprocessor still intrinsically wastes
the power and therefore dissipates the "extra" heat. As mentioned before, asynchronous
microprocessors consume less power, hence, less heat is generated.
(5) Modularity and composibility: The feature of handshaking as a means of
synchronization within asynchronous systems enables the hierarchical design of selfti-
med circuits. A larger module can be constructed easily and efficiently by appropriately
interconnecting smaller modules using a selftimed signaling protocol. This larger mod-
ule will function correctly as long as the smaller modules are correctly designed. No
additional attention needs to be paid to timing. In conventional clocked circuit design,
composing two modules directly is, in general, disallowed since this might destroy the
synchronization due to additional and unexpected delays.
(6) Performance improvement: In conventional clocked circuit design (e.g.,
clocked pipeline structure), the clock period must be greater than the worst stagedelay
in order not to override some data. That is, the overall performance is bounded by the
worstcase delay. In selftimed circuit design, the current stage result can be transmitted
to the next stage as long as the next stage is ready to accept data. From the overall
viewpoint, only average delay is seen instead of worstcase delay.This will speed up
the throughput. As technology improves, due to the characteristics of modularity and
composibility, some oldversion module(s) can be replaced by new and fast version
module(s) without redesigning the whole system. This improves the performance of the
whole system and, hence, increases the product's market life with the least price.
Although there are some potential benefits for asynchronous circuits over syn-
chronous circuits, as stated above, the design of asynchronous circuits has some prob-
lems. One problem is that asynchronous circuits tend to need more hardware for design
implementation than their synchronous counterparts.This is especially true when
completion detection techniques [6,7,8] are used to signal data validity to the recipient.
The major problem for most design engineers may be their unfamiliarity with asynchro-5
nous design methodologies. The methodologies are versatile [4] and not easy to cope
with compared with the synchronous design methodology. This problem may become
less critical when the methodology is simplified and is widely taught.
1.2 Problem Statement
Although people claim several potential advantages to using asynchronous cir-
cuits, as stated in the previous section, and some advantages are obvious, others need
further formal verification. For example, since asynchronous circuits feature no global
clock, clockrelated problems occurring in synchronous circuits can be avoided naturally.
That is, the clock skew, and clockcaused noise, power consumption and heat generation
will not happen in asynchronous circuits.Moreover, the properties of modularity and
composibility are obvious for certain classes of asynchronous design methodologies, such
as micropipelines [9], trace theory [10,11] and communicating processes compilation
technique [12]. As to the issue of the advantage of lower power consumption for asyn-
chronous circuits, it is theoretically true since only the active portion consumes power.
Some models have been used to successfully estimate power consumption in asynchro-
nous circuits [13,14]. However, neither complete analysis nor extensive experiments
have been performed showing that asynchronous circuits consume less power when hard-
ware overhead is taken into account.For example, hardware is almost double if the
dualrail encoding method of completion detection technique is employed to indicate
data validity [6,7].Regarding performance, it is commonly believed that the perfor-
mance (average throughput) of an asynchronous pipeline implemented with variable
stage delay is better than that of its counterpart implemented using synchronous method-
ology. No work is done to formally prove that the above statement is always true under
all circumstances, and is applicable to a general circuit.
In general, there are two metrics of measuring the performance of an asynchro-
nous pipeline. One is throughput bounds (upper and lower) and the other is average6
throughput.Throughput bounds limit a circuit's worst and best throughput allowed in
an application, while average throughput defines the average performance. As will be
seen later in this thesis, the performance (average throughput) of asynchronous pipelines
depends on stagedelay patterns which might be totally random. Since performance
investigation from the probability point of view is beyond the scope of this thesis, we
will not propose any design guidelines for implementing circuits with optimized average
throughput; instead, we will discuss some interesting circuit behavior which results from
having different stagedelay patterns. Note that since the method adopted in this thesis
is a deterministic approach, average throughput discussed here is limited to a finite input
data set.That is, the conclusion regarding average throughput drawn in this thesis can
not be generalized to the case of an infinite input data set.In other words, it is not the
intention of this thesis to prove or disprove that the performance (average throughput
with an infinite input data set) of an asynchronous pipeline implemented with variable
stage delay is better than that of its counterpart implemented using synchronous method-
ology.Finding the throughput bounds is more interesting to us and will be addressed
in more detail in this thesis. Also, it is wellknown that the performance of synchronous
pipelines is limited by the maximum physical (propagation) delay of combinational logic
in each stage.Then, what determines the performance of asynchronous pipelines?In
this thesis, we are particularly interested in solving some of the problems regarding the
performance issues of asynchronous systems. Without considering whether the potential
advantages of asynchronous circuits are true or not, many research and development
projects on asynchronous microprocessors have been completed or are being undertaken
[15,16,17,18]. These works also familiarize designers with asynchronous design meth-
odologies.7
1.3 Thesis Contribution (Goal) and Organization
Better performance has been one of the main motivations behind the recent resur-
gence of interest in asynchronous circuits (no matter whether it is always true or not).
We are particularly interested in the performance of pipelines since they are used exten-
sively in digital systems. Why is performance information for asynchronous circuits so
important?It can be described from two different points of view.In the analytical
phase, performance analysis can generally verify if a circuit meets specified timing
constraints, such as time duration between consecutive output data.This information
plays an important role in determining the size of asynchronous First In First Out buffers
(FIFOs) used as an interface, and the performance (and hence the probability of synchro-
nization failure) of communication bandwidth between asynchronous and synchronous
systems, or between synchronous (clocked) systems driven by different clocks (or the
same clock but with clock skew) [19,20].In the design phase, with timing analysis
results, we are able to derive or observe some design rules (guidelines). These rules help
the design satisfy performance specifications.Instead of the trialanderror approach,
these guidelines may also help designers reach the optimized performance in shorter
design cycle time.
An algorithm developed for finding the exact upper and lower bounds on the
separation time of events in a certain class of process graph [21,22] might be a good
candidate for the application of ComputerAided Design (CAD) tool development and
circuit synthesis. However, some transformations and complex mathematical analyses,
like graph decomposition for infinite unfolded process graphs, must be employed to
reach exact bounds. These transformations and mathematical analyses tend to block
designers from visualizing what factors really affect the performance of asynchronous
circuits.This is especially true for most practicing engineers who are familiar with
synchronous designs and lack experience in asynchronous design. We believe thatone
of the reasons why most engineers are reluctant to engage in asynchronous designs is8
that it is difficult to analyze asynchronous circuits (in terms of performance), not to
mention the difficulty in designing a high performance asynchronous circuit.By first
investigating the performance of a simple linear micropipeline, we hope to help engi-
neers to understand the performance of asynchronous circuits and to acquire insight into
what influences their performance.
In summary, the goals of this thesis are:
(1) To provide a simple method for finding the throughput bounds (approximation)
of a general micropipeline, given stagedelay bounds;
(2) To acquire average throughput rate, given stagedelay patterns (sequences);
(3) To compare performancerelated issues of asynchronous designs with those
of synchronous designs;
(4) To provide a design guideline and several approaches to meet performance
bounds specifications.
The approach to meeting the goals of this thesis is stated as follows. This thesis
is organized into the following chapters. The motivation, problem and goal are discussed
in this introductory chapter. A reviewing of previous works regarding the performance
of asynchronous pipelines is done in Chapter 2. Chapter 3 gives the introductory materi-
als about micropipelines and the delay modeling of Celements. Based upon this Cle-
ment delay modeling, Chapter 4 investigates the performance of fixed stagedelay linear
micropipelines. Although performance results for this type of pipeline have beenre-
ported elsewhere [23,24,25], it is still covered in this chapter to demonstrate thatour
approach can lead to the same results. With slight modification, some theorems deve-
loped for fixed stagedelay micropipelines are applied to micropipelines with variable
stagedelays in Chapter 5.Using the knowledge and results from previous chapters,
some interesting phenomena are depicted and compared with synchronous designs
through several examples. Chapter 6 summarizes this comparison. Immediately follow-
ing is Chapter 7, describing the performance calculation for twodimensional micropipe-9
lines based on the concept and results shown in Chapter 5.Chapter 8 provides several
examples to compare exact throughput bounds with approximate bounds obtained using
our approach.Finally, we conclude our approach in the last chapter.
1.4 Summary
Asynchronous circuits are claimed to have some advantages over synchronous
circuits.The issue of better performance is one of these potential advantages.This
thesis tries to explore some performancerelated problems for asynchronous circuits.
The focus is trying to avoid too complex mathematics and algorithms that may possibly
block the designer's insight in finding the performanceinfluencing factors, and topro-
vide a simple method for obtaining approximate performance. This thesis is capable of:
(1) providing a simple method for finding the throughput bounds (approximation)
of a general micropipeline, given stagedelay bounds,
(2) acquiring average throughput rates, given stagedelay patterns (sequences),
(3) comparing the performancerelated issues of asynchronous designs with those
of synchronous designs,
(4) providing a design guideline and several approaches to meet performance
bounds specifications.10
2. LITERATURE REVIEW
As mentioned in the previous chapter, the focus of this thesis is on issues related
to the performance of asynchronous pipelines. This chapter presents a review of related
literature. The main purpose of this review is to demonstrate the current research status
of this particular performance topic by the other researchers.Only their models are
introduced.Their algorithms for obtaining performance results will not be discussed.
Interested readers may go to references cited in this thesis for details. The model and
approach presented in this thesis are believed to be much easier than the other works
with the sacrifice of accuracy.
2.1 The Operation Of A Celement
As will be mentioned in the next chapter, Sutherland [9] has proposed many basic
logic modules for the design of eventcontrol mechanisms in micropipelines. The Cle-
ment is one of these building modules. In reality, a Celement [26] is frequently used
in asynchronous designs. Figure 2.1shows two different symbols for a twoinput Cle-
ment. The upper left symbol is often used at lower level representations while the upper
right symbol is used at higher level representations. Use of the lower level representation
implies an interest in absolute logic level at terminals. On the other hand, higher level
representations are used when the distinction of absolute logic level at terminals is not
required or necessary. The operation of a Celement in terms of absolute logic level
at terminals is demonstrated as follows. The output signal of a Celement is high only
when both input signals are high; the output signal becomes low only when both input
signals become low. Otherwise, the output signal of a Celement remains in its previous
state. The waveforms in Figure 2.1 summarize the operation of a Celement as stated
above. Although Inl becomes High at point A, Out will not change until In2 also















Figure 2.1: The operation of a Celement.12
called a waiting delay.In reality, the logic level at Out terminal will not immediately
respond to both input terminal changes at point B.Instead, Out will not change until
point D. This is obvious since the signal needs time to propagate through Celement.
This time is called a propagation delay. Similar arguments can be applied to the wave-
form changes at point E, F and G.In summary, waiting and propagation delays can
fully describe the waveform relationship between input and output terminals of a Cle-
ment.In general, the waiting delay is not zero, the case in which both MI and In2
change at the same time.
The Celement at higher level representation utilizes tokens instead of absolute
logic value to indicate its operation. The solid dots in Figure 2.1 represent the tokens.
That is, the higher level token symbolizes both lowtohigh and hightolow signal
transitions. With token representation, the firing rule for Celement is stated as follows.
A Celement cannot fire until both input tokens arrive. Once it fires, both input tokens
are consumed and an output token is generated. Therefore, the waiting delay is the time
that one input token must wait for the other input token to arrive. Once both input tokens
arrive, the time from now until the output token is generated is called the propagation
delay. A more formal modeling of Celements will be discussed in the next chapter.
2.2 Timed Petri Net Modeling
Most performance analyses of asynchronous circuits use a timed Petri net [25]
approach as a base to develop into different forms. Figure 2.2 shows circuits and their
corresponding Petri net modeling. A Petri net contains two types of nodesplace and
transition. Places in a Petri net are drawn in circles and represent conditions; transitions
are drawn in bars and represent events. A token, symbolized by a solid dot, at a place
indicates the holding of the condition (state) of the place. The firing rules of Petri nets




Figure 2.2: Circuits and their corresponding timed Petri net model.
(a) Approach by Ramamoorthy and Ho.
(b) Approach by Wuu and Vrudhula.14
(1) A transition is enabled if and only if each of its input places has a token;
(2) A transition can fire only if it is enabled;
(3) Once a transition fires, a token is consumed from each of its input places and
a token is generated at each of its output places.
We assume that each place is allowed to occupy at most one token. Therefore,
a transition will be enabled but not be able to fire if each of its input places has a token
and one of its output places also has a token.
2.2.1 Approach By Ramamoorthy And Ho
The Petri net model extended to include the notion of time is called the timed
Petri net. Ramamoorthy and Ho [25], attach the execution time to each transition.
Accordingly, if an execution time for a transition is t, this implies that when a transition
initiates its execution, it takes t units of time to complete it.This kind of time modeling
seems ambiguous since, from the circuit point of view, the event (signal) transits instan-
taneously and does not take some units of time for execution. Only a processing element
needs time for execution (generally speaking, this refers to propagation delay).Most
likely, the execution time of a transition, as defined in this approach, implies that the
propagation delay of a processing element which has an output terminal carries an event
(signal) corresponding to that transition. For example, for the transition Ai in
Figure 2.2(a), its execution time DA, refers to the propagation delay of the Celement.
No waiting delay is included in DA,. The question marks "?" in the same figure mean
unknown execution time which depends on the other circuits it connects with. Once the
timed Petri net modeling the corresponding circuit is constructed, some theorems devel-
oped in this approach are applied to find the maximum performance (minimum cycle
time). Also note that this approach only allows constant transition execution time.15
2.2.2 Approach By Wuu And Vrudhula
Basically, the approach by Wuu and Vrudhula [24] follows the work by Rama-
moorthy and Ho [25]. Their way of associating the time attributed to Petri net, however,
is different. Wuu and Vrudhula's work is much closer to circuit operation. They define
the delay from a place (state) to a transition (event) and label this delay on the arc linking
the corresponding place and transition.This delay time represents the minimum time
interval from when the condition is satisfied to when the transition is activated. The
minimum time in their definition refers to the propagation delay.For example, in
Figure 2.2(b), DRA = Dm and equals the propagation delay of the Celement since the
minimum time would be the time that the waiting delay is zero.
2.3 Dependency Graph Modeling
The work proposed by Williams [27] uses a directed graph, which is a simplifica-
tion of the more general timed Petri net, to model a circuit and, further, to find its
performance. The nodes of the directed graph correspond to specific rising or falling
transitions of circuit components, and the arcs (edges) depict the dependencies of each
transition on the output of other components. The resulting directed graph is called a
Dependency Graph (DG). The delay of each transition in a DG is represented by a value
attached to the corresponding node. Figure 2.3 is an example of circuit modeling using
DG. Note that a bubble (or circle) attached to the input or output terminal of a Celement
represents that the logic level of the corresponding terminal is inverted. Folded Depen-
dency Graph (FDG) can be used to model a special ring architecture which uses the same
functionblocks and circuit configuration for all stages. FDG modeling is possible be-
cause the nodes can represent the same transition in all stages and each arc is annotated
with an integer weight giving the offset in stage indices to which the transition dependen-
cy refers. The work by Williams focuses on the performance of fourphase (since DG/16
Figure 2.3: A circuit and its corresponding Dependency Graph.17
FDG defines rising or falling transitions on nodes) ring architecture.His contribution
is in obtaining the relationship between the performance and the number of tokens in
a ring. The timing information attached to each node in DG/FDG is constant.
2.4 Event Rule Modeling
Another approach to circuit modeling is the EventRule (ER) model (system)
which was introduced by Burns [23]. An ER system is defined as follows.
E: a set of events (the event specified here actually refers to the transition), and
R: a set of rules defining timed constraints between the events defined in E.
Each element r E R is written c 4)-1>d,
where c E E is the source of r,
d E E is the target of r, and
a E [0, + 00) is the delay of r.
Figure 2.4 shows a circuit and its corresponding ER system. The delay definition is not
clear.For example, the two delays within the dotted box in Figure 2.4 have the same
amount rzl. In general, this would be true only when the delay defined in an ER system
is the minimum delay that causes a target event to occur once a source event happens.
Otherwise, these two delays would not be the same (e.g., if the waiting delay is taken
into account). Based upon linear programming, the performance can be obtained for the
circuit using ER modeling. Although Burns [23] proposed an approach for variable
stagedelays by unfolding graphs, no explicit algorithm was mentioned on how to find
the performance of infinite unfolded graphs. Recent publications by Hulgaard et al
[21,22] have extended previous results to include the performance of infinite unfolded
graphs.Since the exact upper and lower bounds on the separation time of events in a
certain class of process graph can be found using their approach [21,22], their algorithm
might be a good candidate for the application of CAD tool development and circuit18
Figure 2.4: A circuit and its corresponding EventRule system.19
synthesis. However, some transformations and complex mathematical analyses, such as
graph decomposition for infinite unfolded process graphing, must be employed to reach
exact bounds. These transformations and mathematical analyses tend to block designers
from visualizing what factors really affect the performance of asynchronous circuits.
2.5 Other Works
One work which explored asynchronous circuits with both fixed and random
delays was proposed by Greenstreet [19].This work is primarily a discussion of the
throughput and utilization of twophase, circulating pipelines (ring architecture). Thiele
[28] developed an approach that can be applied to obtain the performance of a more
general selftimed architecture. However, this approach also involves complex mathe-
matics.
2.6 Summary
Current literature on the performance analysis of asynchronous circuits has been
reviewed in this chapter. Some of the authors discussed their research on performance
for special architecture.For example, two discussed pipelines with ring architecture:
Williams for a fourphase signaling scheme [27] and Greenstreet for a twophase [19].
Wuu and Vrudhula discussed linear architecture [24].Others described methods which
are applicable to more general architecture [22,25,28]. Ramamoorthy and Ho only allow
constant delays in a circuit [25] and Hulgaard et al [22] and Thiele [28] involve too much
complex mathematics, although these last two may be good candidates for the application
of CAD tool development and circuit synthesis. The goal for this thesis is to propose
an easier model and approach that can handle the performance (throughput bounds) of
the general architecture of asynchronous pipelines (micropipelines) with variable stage
delays.20
3. MICROPIPELINES AND CELEMENT MODELING
The micropipeline is one of the many asynchronous design methodologies [4].
In this thesis, we are particularly interested in the micropipeline because its modularity
and composibility feature facilitates hierarchical design, which is preferred in the design
of complex systems by a team. Besides, it is simple and easy to understand. The
Celement is a key module in a micropipeline, coordinating the operation (synchroniza-
tion ) of adjacent stages.In this chapter a brief review of the design and operation of
a micropipeline is presented. An introduction to Celement modeling is included in this
chapter, laying the ground for the performance evaluation of micropipelines discussed
in following chapters.
3.1 The TwoPhase Bundled Data Convention
Just as the signals in synchronous systems, the signals in micropipelines have two
typesdata signals and control signals. The way of designing a data path, the place
through which data signals flow, is the same for both synchronous systems and micro-
pipelines. The way of dealing with control signals, however, is totally different between
these two design approaches.Therefore, in the following text we will focus only on
a discussion of control signals. In a conventional synchronous design philosophy, abso-
lute logic level (high or low) is assigned to each signal and different logic values have
different meaning for each signal. In micropipelines, signal transition is of more interes-
ted than absolute logic level. Moreover, both lowtohigh and hightolow signal tran-
sitions have the same meaning to circuit operation. That is, both rising and falling edges
of a signal may trigger circuit operations.Since the distinction of lowtohigh and
hightolow signal transitions is not necessary, a more general term event is used to
replace the signal transition.21
All of the module communications within a micropipeline follow the same basic
connection, as demonstrated in Figure 3.1(a). n a sender and receiver environment, a
handshaking protocol consisting of two control wires and many data wires is used. One
of the control wires is called request and the other one acknowledge. The signals on
both request and acknowledge wires must be designed to follow the transitionsignaling
rule. For the signals on data wires, absolute logic meanings are still preserved such that
the conventional combinational logic designs can still be used in micropipelines. To
ensure the correct circuit operation in Figure 3.1(a), the twophase bundled data conven-
tion must be strictly observed. This convention requires that a complete operation (called
cycle) between adjacent stages have two phases: the sender phase sends a request signal
to start the operation and receiver phase sends an acknowledge signal to close the opera-
tion. "Bundled data" indicates that data and request wires must be treated as a bundle:
data must be stable before the request signal arrives at the receiver end. The complete
handshaking between sender and receiver in a cycle is described as follows.Sender
generates valid data on the data wires and then issues a request event to signal receiver
that the data are available. The receiver takes the data whenever it is ready and then
produces an acknowledge event to tell sender that the data have been received. At this
point, sender can put new data on data wires and the whole procedure repeats. Note
that data can not be changed before acknowledge event is issued. The relative signal
timing diagram is illustrated in Figure 3.1(b).
3.2 Linear Micropipelines
Due to the modularity and composibility feature, the basic control circuit for
micropipelines can be constructed easily and readily by stringing each stage together
directly. The resulting pipeline is called a linear micropipeline. From the higherlevel
sender/receiver point of view, each stage plays both roles (sender and receiver) in micro-












Figure 3.1: Twophase bundled data convention.






























Stage 3 Stage 4
A3
D L D L














relative to stage one when stage one tries to send data to stage two; after stage two
obtains data from stage one, it will pass data (which may be modified) to stage three
as a sender.All data and control signals transferred between consecutive stages must
follow the twophase bundled data convention.
The mechanism for fulfilling the twophase bundled data convention requirement
in a micropipeline is very straightforward.It requires only strings of bubbled Cele-
ments (a Celement with a bubble on one of its input terminal, indicating that the signal
is inverted) with appropriate interconnection, as illustrated in Figure 3.2(b). Observe that
only odd number of bubbles (one in the basic circuit) are allowed in every loop around
which events flow. The purpose of the bubble in each loop is to form oscillation such
that data fed into the input end can be bubbled through to the output end automatically
without other control circuits.Although the absolute state of a control signal does not
matter, its state relative to the other related signals (e.g., two input terminals for a
bubbled Celement) does matter. Therefore, the state (output) of current bubbled Cle-
ment can be described in terms of the states of predecessor and successor bubbled Cele-
ments:
IF the states of predecessor and successor are different
THEN copy predecessor's state
ELSE remain in the present state
The control circuit is stable only when one of the following three conditions is
satisfied:
(1) All of the bubbled Muller Celements are in the same states.This corre-
sponds to an empty pipeline. New data can be fed into this circuit.
(2) Alternate stages are in opposite states.This corresponds to a full pipeline.
No more data is allowed to enter this pipeline until some data have been consumed
(removed).25
(3) Stages near the input end have the same state, and stages near the output end
have alternate states.This corresponds to a partly filled pipeline. New data can only
fill empty stages.
The stage states (output states of bubbled Celements) that are not in one of these
state conditions cause the circuit to be unstable. The circuit will not stop (idle) in an
unstable state; it will continue to propagate through automatically until one of the above
conditions is reached. Once a stable condition is reached, the circuit becomes idle
(theoretically, consuming no power) until a new datum enters or a datum moves out to
destroy the stability.
Each delay element in Figure 3.2(b) ensures that the request signal and data are
bundled in a way that valid data will always arrive at the next stage prior to the arrival
of the corresponding request signal.Therefore, the value of each delay element must
be equal to or greater than the worstcase delay of the combinational logic circuit within
that stage. The original data storage element used in micropipelines is a twocontrol
wire (capture and pass) register.The capture signal latches present data and the pass
signal allows new data to enter the storage element. In Figure 3.2(b), a DoubleEdge
Triggered FlipFlops (DETFF) [29] is employed instead because it is single wire (easier
to understand than two wires operation) and works as a regular D FlipFlop except that
it is triggered at both rising and trailing edges of a control signal. If all internal process-
ing logics are removed, the whole circuit shown in Figure 3.2(b) works like a FIFO.
3.3 Delay Modeling of Celements
As stated in the previous section, a Celement coordinates two adjacent stages
in a micropipeline, such that handshaking between these two stages is performed ap-
propriately.Since a Celement is the key module in a micropipeline, we will restate
its operation and depict it in Figure 3.3 (a slight modification of Figure 2.1). The output
























Figure 3.3: The operation of a Celement (modified).
A higherlevel token (solid black circle) on each arch represents
a lowerlevel transition for each signal.27
becomes low only when both input signals become low.Otherwise, the output signal
of a Celement remains in its previous state. Since micropipelines use twophase transi-
tion signaling methodology (i.e. lowtohigh and hightolow transitions have the same
meaning to circuits) and one of the Celement inputs used in micropipelines is inverted,
an absolute signal level representation (high or low) is avoided to prevent confusion.
This is especially true when we are only interested in the performance issue. As a result,
a token is introduced to depict a signal transition (both lowtohigh and hightolow).
Both token and event refer to the signal transition although token is most widely used
when the performance issue is discussed. With token representation, the Celement
firing rule can be restated as follows. A Celement cannot be enabled (fired) until both
input tokens arrive. Once it is enabled, input tokens are consumed and an output token
is generated.
In order to model the flowing of tokens in a micropipeline, the physical (or prop-
agation) delay and logic (or waiting) delay shown in Figure 3.3 should be reflected in
the Celement model given in Figure 3.4. Physical delay in Figure 3.3 is defined as the
duration from the time input tokens are consumed to the time output tokens are gener-
ated. Physical delay in our Celement model is denoted as d in Figure 3.4(a). This delay
can be further merged with the (combinational logic) stagedelay of the micropipeline
in both the directions of the token path; the result is named forward delay (Fi) and
backward delay (Bi), as shown at the right hand side of Figure 3.4(a). The total stage
delay Di for stage i in a micropipeline is defined as Di = Fi + Bi.
Logic delay in Figure 3.3 is defined as the time duration that one input token must
wait for the other input token in order for the Celement to fire.Figure 3.4(b) shows
our Celement logic delay model. The arcs for the Celement in Figure 3.4(b) corre-
spond to the arcs for the Celement in Figure 3.4(a) with the same arc labeling R°,
i=1;'_and In order to simplify the logic delay representation, we will map these arcs





Due to /Due to Ci+
28
Backward Delay(Bi)
Stage i/ Stage i
Backward Delay(Bi_j)Forward Delay(Fi).
(a)
j+ 1 Data token
Stage i/ Stage i
A
Space token j +/ Space token
(b)
Figure 3.4: Modeling of physical and logic delays of a Celement.
(a) Physical delay modeling.
(b) Logic delay modeling.29
Figure 3.4(b). The token on arc k_.1is called a data token since it is the token issued
from the previous stage to flag the valid data. On the other hand, the token on arc idt('
is named a space token because it is generated from the next stage to acknowledge space
availability (indicating the previous datum has been received by the next stage).Note
that the Celement in Figure 3.4(b) has four terminals but in Figure 3.3 it has only three
terminals.In terms of functionality, these two Celement representations are the same
because, physically, the terminal R° and '4;11 share the same wire. The Celement
representation depicted at Figure 3.4(b) is adopted for our performance derivation in the
following chapters.
In Figure 3.4(b), logic delay Dr represents the amount of time that a data token
waits for a space token to appear and logic delay DP is the time period that a space token
waits for a data token to arrive.It is important to know that logic delay is a function
of time (tokens) in general. To maintain notation consistency throughout a whole pipe-
line, we define that the (j +1)th data token (dark color) is always matched with the jth
space token (dark color).After a Celement fires, the two output tokens (gray color)
are numbered as j+/ s. The output token produced at the upper right arc is called a data
token; the output token generated at the lower left arc is named a space token.That
is, a data token always travels forward at the upper circuit path to indicate that intermedi-
ate data (result) is moving toward the output end to produce a new output result and a
space token always travels backward at the lower circuit path to indicate that an empty
space (stage) is moving toward the input end to accept new input data. Although differ-
ent tokens at different locations (arcs) have different names, there is no difference among
these tokens in terms of circuit operations. In general, logic delay should be represented
as DP(indexl, index2) and D42(indexl, index2), where indexl refers to the appearing se-
quence of its input data token and index2 refers to the appearing sequence of its input
space token.According to the logic delay definition and firing rule for Celement,
Dr(j + 1,j)* D12(j + 1,j)= 0 for all of the j.For example, let Inl and In2 in30
Figure 3.3 represent the data token and space token of Figure 3.4(b), respectively. Since
the first data token (rising edge) arrives earlier than the first space token by X (logic
delay), then D12(j + 1, j) = 0 and Dr(j + 1,j) equal to X. Since the data token repre-
senting the trailing edge of Inl arrives later than the space token representing the trailing
edge of In2 by Y (logic delay), then D24(j + 1,j) = 0 and a42(j + /, j) equal to Y If
both tokens arrive simultaneously, then Dr(j + /,j) = DI2(j + /,j)= 0.For more
compact representation, the second token index of a logic delay representation is omitted
in this thesis. The micropipelines under investigation are assumed initially empty; there-
fore, D14(1,0). D24(1) = 0 and D42(1, 0) = D42(1) is undefined for all of the i.Note
that all the delays, including logic delays, mentioned in this thesis are greater than or
equal to zero.
3.4 Event Logic Modules
We have introduced basic control circuits, linear micropipelines, previously. For
diverse applications of micropipelines, more event logic modules are requiredas basic
elements from which to construct more complicated designs of event logic circuits.
Event OR, event AND, Toggle, Select, Call and Arbiter, as shown in Figure 3.5, are the
basic modules discussed by Sutherland [9].
(1) Event OR: If any input of this module changes its state, the corresponding
output also changes its state. In terms of events, the arrival of any input event generates
an output event. This works exactly the same as a conventional XOR gate so its symbol
is adopted as event OR module.
























Figure 3.5: Depiction of some event logic modules.
3132
(3) Toggle: Toggle generates an output event alternately, starting with the dotted
terminal.It responses to events at its input after the initial master clear signal which
is not shown in the figure.
(4) Select: Select steers an input event to the corresponding output according to
the Boolean value at the other input terminal. Note that the Boolean value must arrive
before the input event.
(5) Call: This hardware Call module serves the role of memorizing the subroutine
return address.It remembers which input events, RI or R2, have called the procedure
(the event at R is generated).After the completion of procedure execution (the event
at D is generated), it returns a matching done event on D I or D2. The Call works
properly only when the current call has been completed before the next call occurs.
(6) Arbiter: An arbiter is used to guarantee mutually exclusive access to shared
or protected resources.It grants requests (token on R1 or R2) by generating a token
on the 01 or G2 terminal. The service is performed using the firstcome firstserved
rule. When one of the input events grants the service, the other event (if any) will be
delayed automatically by this module until the current one is done.
We will discuss the performance of micropipelines constructed by these modules
after the performance of linear micropipelines is studied in the following chapters.
3.5 Summary
Micropipelines feature transition signaling schemes. That is, both the lowto
high and hightolow controlsignal transitions (events) have the same meaning to cir-
cuit operation.To ensure correct functioning of a circuit, micropipelines must follow
the twophase bundled data convention.Since micropipelines deal with events in de-
signing the control path, some basic but important event logic modules are provided for
convenience. Once each module is designed appropriately and the bundled data conven-
tion is strictly observed, direct interconnection is allowed between modules to form a33
larger and more complex module or system. The delay modeling of a Celement is also
included in this chapter for the performance calculation of linear micropipelines, which
will be discussed in the following chapters. The performance for a more general micro-
pipeline can be obtained based upon the performance approach for linear micropipelines.34
4. THE PERFORMANCE OF A LINEAR MICROPIPELINE
WITH FIXED STAGEDELAY
In this chapter, we are interested in finding out how a linear micropipeline per-
forms with fixed stagedelay. Although performance results for this type of pipeline
have been reported by others [23,24,25], the topic is still covered in this thesis to demon-
strate that our approach can lead to the same results.Besides, some properties, defini-
tions and theorems developed in this chapter are applicable to the performance analysis
of a more general micropipeline discussed in later chapters. The approach adopted is
based mainly on the delay modeling of Celements, as described in the previous chapter.
4.1 Definitions And Theorems
Since two logic delays are defined for each Celement and each logic delay be-
longs to a different "loop," the modeling of linear micropipelines can be derived straight-
forwardly. Based upon the Celement delay model, the performance analysis of micro-
pipelines with fixed stagedelay is presented here. Fixed stagedelay implies that both
the forward delay (Fi) and backward delay (Bi) for a stage i are independent of time
(token), i.e., they are constant for all js, where j is a token sequence index. Hence, total
stagedelay, Di(= Fi + Bi), for a stage i is also a constant. Forward delay, Fi (excluding
Celement physical delays d in [9]), is constant because the "delay" element of the ith
stage in Figure 3.4(a) is chosen as the maximum combinational logic delay, such that
the bundled data convention is satisfied.The backward delay (Bi) is included in our
discussion for completeness.In our Celement model, Bi also contains physical delay
d.Before we conclude with the results, some definitions are established first.From
these definitions, we conclude with a few results as theorems and corollaries. The micro-
pipeline we are interested in is a mstage pipeline, as shown in Figure 4.1.Input From Environment
-44
35
Output To Environment J-
.
70-
Figure 4.1: A mstage linear micropipeline.36
Definition 1: A loop i is defined as a circle around which a token will travel within a
stage i. The ith loop delay is the time required for a token to travel around loop i.
For each loop, we can define four loop delays depending on the starting point
(corresponding to the point 1, 2, 3 and 4 in Figure 4.2) of the loop. Among these loop
delays, the two starting with points 1 and 2 are the same for fixed stagedelay micropipe-
lines.Similarly, loop delays starting with points 3 and 4 have the same value.Since,
as will be seen later, we are interested only in loop delays with loops starting with points
1 and 3, therefore, these two loop delays will be described as follows for stage i,
/ < i < m.
Ti(j + 1) = Fi+ Dr+I i(j + 1) + Bi+ Dr2(j + 2)
= Di +i(j + 1) + D12(j + 2)
+ 2) = B + D42(j + 2) + F +i(j + 2)
= Di + + 2) +i(j + 2), j
Definition 2: We define the output loop delay (or cycle time) of a mstage linear micro-
pipeline as the time for a token to travel around loop m+1, as shown in Figure 4.1.It
is denoted as Tmi +/(j + 1), and Tmi(j + 1) = Dont +D4m2+ i(j + 2).
Definition 3: (Environmental) input delay, D1 is defined as the time required for the
environment to prepare and inject new data into a micropipeline. (Environmental) output
delay, Dut, is defined as the time required for the environment to collect and process
the resulting data from a micropipeline. These delays are indicated pictorially in
Figure 4.1.In this chapter, Din and Dow are assumed to be constants.
Definition 4: The mstage pipeline performance, throughput P(j+1), is defined as the
reciprocal of output loop delay Trni +/(j + 1).That is, P(j + I) =IITmli(j + 1) =
11(Dout +D4m2+ i(j + 2)).37
Output delay Input delay








Figure 4.2: Definitions of equivalent input/output delays and loop delays.38
Theorem 1: (Equal loopdelay theorem.) Looking at Celement C1 +1 in Figure 4.2, we
find that TN + 2) = 71+1(j + 1) for all js. This theorem can be applied to all Cle-
ments in a micropipeline.
Let us assume the Celement Ci+1 has just fired.According to the firing rule
of a Celement, the next time it will fire again is equal to the time that each token travels
around its own loop. The equation modeling the firing rule for Ci+/ can be expressed
as follows.
Bi+ + 2) + F, + Di2+4 i(j + 2)
= Fi+1+ i0+4 2(j + 1) + 13,+1 +14+2 j(j + 2), or equivalently,
+ 142( j + 2) + Lo`+1 i(j + 2) = Di+1 + 141_2(j + I) + Dr_2F1(j + 2)
Several examples for different j are shown in Figure 4.3. The highlighting (bold)
in Figure 4.3 indicates the previous equation in pictorial form when j =k. Since
DT+.41(j + 2) and /4+2 /(j + 2) are "mutually exclusive" (if one is nonzero, the other one
must be zero) of each other, in general, we may treat these two delays as one unknown
variable.Unfortunately, we still cannot solve the above equation unless both
Df+42(j + 1) and D42(j + 2) are known. This leads to the data dependency graph for the
entire micropipeline, as shown in Figure 4.4. With this data dependency graph, we can
easily trace and calculate each logic delay in a linear micropipeline. The following
analytical approach gives us more insight onto linear micropipelines.
Definition 5: The data token stagetravelling time L2 2(j + 1,j + 1) is defined as the
data token travelling from stageto stage i, as shown in Figure 4.3, and is equal to
D24(j + 1)+ F,, j0. Again for the sake of simplicity, LP(j + 1, j + 1) is short
handed to LF(j + /).j=k-2 D12(k)




D42F i(k) 1;0?': 2(k)
Bi
D2+1tk +D4+ k +1).F .Dz+ ti+
B.
j=k10:2(k + 2) .1.'= DPI _1(k + 2) Dg_i(k + 2)





















The time for a micropipeline to complete




its first result is called latency,
has a latency of L = Lf2(1)
i=
Definition 7: (Refer to Figure 4.2) Equivalent input delay, Dip(j + 2), is defined as the
delay which is seen by looking into the right cross section of stage i (input delay cross
section i).Equivalent output delay, Drt(j + 1), is defined as the delay that is detected
by looking into the left cross section of stage i (output delay cross section i).Their
values are,
Diin(j + 2) = DP(j + 2) + Di ,and
Drt(j + I) = D2 +1(.1 + 1) + Di
Once the equivalent input delay is defined, all stages to the left of an input delay
cross section can be replaced by a stage with its delay value equal to an equivalent input
delay. For the same reason, all the stages to the right of an output delay crosss section
can be replaced by a stage with its delay value equal to an equivalent output delay. As
a result, Figure 4.5(a), (b) and (c) are equivalent.
Theorem 2: Given a mstage linear micropipeline, an equivalent input delay for each
stage i is bounded by Di < Di" (j + 2)Max(D in, D1.., Di) and a logic delay Dr(j + 2)
isbounded by 0D4k2(j + 2) s Max(0,Di:+ 2)Dd, where1 < i < ln,
1 m + 1, + 2) =Din and Dm+1=Dout. This is true for all j0.
<Proof> Consider C1 in Figure 4.5(a). According to Theorem 1,
Din + D2140 + 2) = D + D2240 + 1) + D420 + 2)
If DinD1 + D224(j + 1), then
D54(j + 2) = 0 and












Figure 4.5:4.5: A mstage linear micropipeline and its equivalents.
(a) A linear micropipeline.
(b) First equivalent form.
(c) Second equivalent form.43
Dip(j + 2) = D1 + D12(j + 2) = DinD224(j + 1) (4.2)
IfDi < DI + D34(j + 1),then
D12(j +2) = 0 (4.3)
D` (j +2) = Di + D12(j + 2) = Di (4.4)
From Eq. (4.1) and (4.3),
D12(j + 2) =0,or (4.5)
= DinD1D224(j + 1) 5 DmDI (4.6)
From Eq. (4.2) and (4.4),
Di (j +2) = Din + 1) < Din,or (4.7)
= DI (4.8)
ConsiderC2in Figure 4.5(a). According to Theorem 1,
DiNj + 2) + D34(j + 2) = D2 + Di4(j + 1) + D42(2(j +2)
If D71(j + 2)D2 + D234(j + 1),then
D34(j +2) = 0,and
D22(j +2) = + 2)D2D234(j + 1) (4.9)
DT(j + 2) = D2 + D212(j + 2) = Di7(j + 2)D234(j + 1) (4.10)
IfDi7(j + 2) 5 D2 + D234(j +1), then
D12(j +2)=0 (4.11)
Di2n(j + 2) =D2 + D422(j +2) =D2 (4.12)
From Eq. (4.9) and (4.11),
D22(j +2) = 0,or (4.13)
= DT(j + 2)D2D234U Dip(j + 2)D2 (4.14)
From Eq. (4.7), (4.8), (4.10) and (4.12),
+ 2) = DinD224(j + 1) D234(j + 1)Din,or (4.15)44
= D1D34(j + 1)D1, or (4.16)
= D, (4.17)
The conclusion can be reached by applying an approach similar to that shown above to
all Celements in a micropipeline.
Corollary 2.1: If Max( Din, DIDi) = D. where 1 s im and n < i, then
D,422(j + 2) = 0 for all js.
<Proof> From Theorem 2, we know
+ 2) Max( D i, D1...Dn) = Dn (4.18)
Also, from definition
Di: (j + 2) = D, + D;t1,2(j + 2) (4.19)
From (4.18) and (4.19), we conclude D42(j + 2) = 0 for all js.
Theorem 3: Given a mstage linear micropipeline, equivalent output delay for each stage
is bounded by Di LC Drr(j + 1)Max(Dt, Dm,...,Di) and logic delay Di4(j + 2) is
boundedby0D2k4(j + 2)Max(0, Drt(j + 1)D k_ I),where1 < i
1km + 1, Dm". i(j + 1) = Doia and Do =Din. This is true for all j0.
Corollary 3.1: If Max( DiDm, Dour ) = Dn, where 1i 5 m and ni, then
D2n4+1(j + 1) = 0 for all js.
Theorem 3 and Corollary 3.1 can be reached using approaches similar to those
used in Theorem 2 and Corollary 2.1.
Theorem 4: If Din i(j + 2)Di and Din(j + 2)Dr+' i(j + 1) for all the j, then
D24t(j1) = 0, Ief. i(j + 1) = 0 and D12(j + 2) = D i(j + 2)Di.
<Proof> In Figure 4.6(a), consider Celement C1,
Din i(j + 2) + Di4(j + 2) = Di + DT:11.1(j + 1) + Dt2(j + 2) (4.20)
Also, consider CelementC1.1,
Di + D42(j + 2) + D?: i(j + 2) = Dni (j + 1) + i(j + 2) (4.21)45
(a)
stages following stage n
(b)
Figure 4.6: Logic and equivalent input/output delay values for the stages
following max delay stage n when they become stable.
(a) A general threestage configuration.
(b) Logic and equivalent input/output delay values for the stages
following stage n.46
If we add Eq.(4.20) and Eq.(4.21), a new equation is formed as follows.
Dr i(j + 2) + Dr(j + 2) + /4`+/ /(j + 2)
= D?'+I /(j + 1) + + 1) + D1_1. j(j + 2) (4.22)
f; + 2) > Dour t; BecauseDi." +1) for all js, from Eq.(4.22) /11./
0D24
Ptinn?41(2) = 0 to balance Eq.(4.22) on both sides. i+
j = 114'4.1 12) = 0.D11 1(3) = 0 to balance Eq.(4.22)on both sides.
If we assume j=k holds, that is
j = k = 14_41(k + 1) = 0Dr+I i(k + 2)= 0 to balance Eq.(4.22) on both sides.
When j=k+1, because /41_1(k + 2) = 0,
+ 3) = 0 to balance Eq.(4.22) on both sides.
By induction, Le_f_1(1 + 1) = 0 for all js.
From Eq.(4.20), because D1 + 1(j + 1) = 0 and Dr i(j + 2) > Di for all js,
Di24(j1)0 and D12(j + 2) = D;" i(j + 2)D,.
Corollary 4.1: If Max(Din,Di,...DDoud=Dn, then D/4,2(j + 2) = D Dk and
D2k4(j + 1) = 0, where n+1.km+1 and j0. We define Dm+1 =pout-
<Proof> From Corollary 2.1, + 2) = Dn and from Theorem 3, Dr4!2(j + 1) is
bounded by Max(Dont,D,..,Dn+2), which is less than Dn; therefore, by applying
Theorem 4 to all the stages, the above claim can be justified.
Corollary 4.2: If Max(Din,D1,...D,Doud=Dn, then Di,;'(j + 2)= Dnin+ i(j + 2) = ...
= Di,;;(j + 2) = Dn for all j O.
Corollary 4.3: A mstage linear micropipeline has throughput P(j+1) equal to the maxi-
mum total stagedelay. That is, P(j+1) =
<Proof> If Max(D1n,D1,...D,n,Dond=D, by definition, P(j + 1) = 1/Tmli(j + 1) =
1/(Dout Dm42++ 2)) and from Corollary 4.1Dm42+ /(j + 2)= DnDm+1 =47
D,Dour; therefore, + 1) =11(Dout+ Dm42+1(j + 2)) = 1/D,.
Definition 8: A stage k in a linear micropipeline becomes stable when both logic delays
Dr(j + 2) and Dit 1(j + 1) become stable. A logic delay becomes stableor reaches
steady state when its value becomes constant after certain time (token) jq.
Note that, according to the Corollary 4.3, throughput is not a function of token
index j.It is a constant and equal to the reciprocal of maximum total stagedelay within
a whole linear micropipeline. This is not unexpected since from Corollary 4.1, we know
that all logic delays following the stage with maximum stagedelay are always constant.
The above discussion is summarized and shown in Figure 4.6(b) when all stages in a
micropipeline have reached steady state.In Figure 4.6(b), logic delays Dr(j + 1) = 0
and D42(j + 2) = D,Di, where i refers to all stages following the maximum delay
stage n.
Up to this point, all logic delays following the stage with maximum total stage
delay have been solved as shown previously. If we can also solve the logic delays prior
to the stage with maximum total stagedelay, the whole micropipeline is then fully deter-
mined. That is, the traverse of token flow for each j will become solvable. The follow-
ing definitions and theorems will help us lay the ground for the transient and steady state
analysis of logic delays discussed in the next section.
Theorem 5:If Dr i(j + 2)D,forallj0,thenD12(j + 2) = 0,where
1i < m + 1. We define Dion(j + 2) = Di, and Dm+ = Dow.
This theorem is obvious as long as the governing equation based upon Equal
loopdelay theorem is written out.
Definition 9: Given a mstage linear micropipeline (refer to Figure 4.7),
if Max(Din,D1,...,D,,,Dow) = D, then D, is called the 1st local maximum delay;
if Max(Din,D1,...,Dn_ 1) = Dq, then Dq is called the 2nd local maximum delay;pth max. region 4th max. region 3rd max. region 2nd max. region 1st max. region
pth local max. delay
4th local max. delay
1+1
3rd local max. delay V





Figure 4.7: Definitions for local maximum delays and maximum regions.49
if Max(Din, D1,...,Dq_j) = Di, then Di is called the 3rd local maximum delay.
The naming process continues until Din becomes the pth local maximum delay.
It should be noted that, in general, the rth local maximum delay may not be equal
to the rth global maximum delay for 2 s rp. The rth local maximum delay is de-
fined as the maximum delay among the ones corresponding to the lefthandside stages
of the (r-1)th local maximum delay. However, the rth global maximum delay is always
defined over the whole micropipeline. We define the 1st local maximum delay as equal
to the 1st global maximum delay. For convenience, they are called maximum delay or
maximum stagedelay.
Definition 10: The sth maximum region, as shown in Figure 4.7, is the region containing
stages between sth local maximum delay stage (including) and (s-1)th local maximum
delay stage (excluding), where 1s s p. The 0th local maximum delay stage is de-
fined as the maximumdelay stage.
Corollary 5.1: If r represents the stages having local maximum delays, then D42(j + 2)
= 0 for all j0.
This corollary can be derived from Theorem 2 and Theorem 5. The result is
shown in Figure 4.7.
Corollary 5.2: If stage s corresponds to the bth local maximum delay stage, then
D12(2) = DsDi, where i represents the stages within the bth maximum region.
Corollary 5.3: If Diin(j + 2)Dnii(j + 1) for j=0, then Dr+I j(j + 2) will remain zero
until D;"(j + 2) + 1) for jk.
According to the firing rule and Corollary 5.2 and 5.3, Diin(j + 2) will not change
until Diin(j + 2) S Dr: i(j + 1) for jk.50
4.2 Transient Delay And Steady State Analysis
With the definitions and theorems provided in the previous section, we are ready
to further explore the transient and steady state analysis of logic delays for the whole
micropipeline. Given a mstage linear micropipeline, assume that the nth stage has the
1st global maximum delay, Dn,and the ith stage has the 2nd local maximum delay, Di.
Figure 4.8(a) shows the circuit for the transient delay and steady state analysis of the
(n-1)th stage. According to the firing rule and Corollary 5.2 and 5.3, Dnin_2(j + 2) will
remain equal to Di until Dnin_2(j + 2) D flout 1) for jk.This is the situation
when stage nI becomes stable. Looking at Figure 4.8(b), because we assume stage n-1
turns out to be stable at j = k + 1, it implies that
k(D,Di) + Dn_ > Di
>DiDn (4.23) DnDi
where k is the minimum integer satisfying this equation.
{0
,for 0 5 j0
/:30,4(j +1)=l(DnDi),for 1 5 j 5_ k
DnDn_ I,for k + 1 5 j
DiD_i
{
,for / 5 j 5 /
Dn42+ , jD(j1)DDn_1 ,for 2j 5 k
0 ,fork + 1j
(4.24)
(4.25)
Delay Ai4(j + 1) in Eq.(4.24) is considered first.For 1jk, D24(j + 1)
increases as j increases because (DnDi) > 0.Is DnDn_for jk1 greater
than j(D,Di) at j=k? From Eq.(4.23), assume DiD1 = Q(DD,) + R, i.e.
Q = (DiDn_1R)I(DnDi), where Q is integer and 0R < DnDi. Since k
is the minimum integer satisfying Eq.(4.23), let j = k = Q + 1 for the worst case (the
situation when R > 0). Therefore, Bo4(j + 1) = (Q1)(D,Di) = Dr,Dn_iR
at j = k < D = .bo,4(j + 1) at j = k + 1.From the above discussion, it is5'




















(k+1 )Di + kDn + 0
(b)
k(DnDi)











summarized that Boi4(j + 1) increases with j and when it reaches DnDn_ 1, it would
no longer change (becoming stable).
We then consider delay Dn42 /(j + 1) in Eq.(4.25).For j= 1Dn42+ 1) 1(j
= DiDn_ 1. For the second case 2jk, Dn42 i(j + 1 )=IDi(j1 )DnDn_ 1
=(DiDn_j) + (j1)(DiDn).Because (DiDn) < 0, Dn42 i(j + 1) becomes
smaller when j becomes larger.In addition, Dn42 1(j + 1) could not be negative; there-
fore, we conclude that D4n2 j(j + 1) decreases when j increases.It eventually reaches
zero and will no longer change (reaching steady state).
The above equations (4.23), (4.24) and (4.25) are valid only for the stage n-1
provided that stage n has the maximum stagedelay. However, using a similar approach,
they can be generalized to apply to all stages within the 2nd maximum region. As a
result, a more general formula can be obtained for these stages.Again, if maximum
delay is Dn and 2nd local maximum delay is Di, then the equations for transient delays
and steady state analysis of the (nq)th stage, where i5nqn-1 or
1q.(ni) are
Di







,for 0 < j < kg_
Dn.-, for kg_ +5 jkg (4.27)






,for kg+ 2 kq
,for kq + 1
53
(4.28)
where ko=0 and kn_ i= kn_(i+1) + 1.
Figure 4.9, which has gone through several time steps, illustrates part of a linear
micropipeline with a maximum delay of 17 and a 2nd local maximum delay of 15. The
values of transient delays for each logic delay in Figure 4.9 can be easily verified with
Eq.(4.26), (4.27) and (4.28) given Dn = 17, Dn_j = 12, Dn_2 = 10, Dn_3 = Di = 15
and 1q3.Substituting these values into Eq.(4.26), we get k1 = 2, k2 = 5 and
k3 = 6. In other words, stage n-1 reaches steady state at j=3, stage n-2 at j=6 and stage
n-3 at j=7. The values corresponding to steady state are enclosed in a box in Figure 4.9.
Note that D,2,4_,7(j + 1) increasing with j and D1,12_q(j + 1) decreasing with j are
true for all stages within the 2nd maximum region (actually, this statement is valid for
all stages, given a linear micropipeline). Before stage n-I reaches steady state, none
of the logic delays for stages prior to stage n-1, but within the 2nd maximum region,
change. At the moment stage n-1 reaches stable state, stage n-2 starts to change by
increasing Dn24+ 1) and decreasing Dn42 2(j + 1).This domino process continues
until stage i, the 2nd local maximum delay stage, reaches its steady state.
In reality, the Eq.(4.26) ,(4.27) and (4.28) hold for all the maximum regions
within a linear micropipeline before D2,141,(j + /) starts change, where t corresponds to
all the local maximum delay stages.For example, in the 3rd maximum region, Di in
Eq.(4.26) ,(4.27) and (4.28) refers to the 3rd local maximum delay, Dn represents the
2nd local maximum delay and Dn-r are the delays corresponding to the stages within
this region.All stages within maximum regions attempt to reach steady state simulta-
neously. However, except for stages within 2nd maximum region, all other stages reach





































Figure 4.9: Numerical example for a fourstage linear micropipeline.55
this is that when the Eq.(4.23), (4.24) and (4.25) were derived, D,24+/(j + 1) = 0 for all
j0; therefore, Dnj + 1) = D, + Dn24+ i(j + 1) = D. However, for example, in
the 3rd maximum region, assuming that D, is equal to 2nd local maximum delay,
(
24, +1)is not always zero; hence, Diout( j + 1) = Di + Di+ iu + 1) is not always
equal to D,. This is why all stages prior to the 2nd maximum region reach temporary
steady state only when D,24+ 1(j + 1) = 0. When Dr+ i(j + 1) starts to change, as it does
when stage i+1 becomes stable, Dr(j + 1) changes also. This change moves the stages
away from temporary steady state and toward the final steady state.In terms of timing,
stages in the 3rd maximum region may reach temporary steady state either faster or later
than stages in the 2nd maximum region. This makes the analytic approach more diffi-
cult, if not impossible. However, no matter how complex the situation is, all stages prior
to the 1st local maximum delay stage must eventually reach the final steady state with
D ', .1 2 ( j + /)= 0 for j >_1 and Dr24+ i(j + /) = Dr for j0 and 1rn/,
as shown in Figure 4.10.
4.3 Summary
This chapter defines several terms to be used in the following chapters for conve-
niently describing the operation and performance of micropipelines. Some theorems and
corollaries were also developed to target the transient and steady states of logic delays
for each Celement (stage) and the performance of micropipelines. For the stages prior
to the maximum delay stage (assuming stage n), logic delays D2+4+ 1) increases and
D42(j + 1) decreases when j increases, where 1in. Each logic delay stops change
when the steady state of corresponding stages is reached.However, the logic delays
D2+4 /(j + 1) and /42(j + 1) for stages following the maximum delaystage, are constant,
wheren+/5_i5_m+ 1.As a result, the throughput of a linear micropipeline with4
stages prior to stage n
56
Figure 4.10: Logic and equivalent input/output delay values for the stages
prior to stage n when steady state is reached for all stages.57
fixed stagedelay is a constant and equals the inverse of the maximum total stagedelay.
Giving the difficulty in obtaining analytical solutions to logic delays in closed form for
fixed stagedelay micropipelines, the possibility of doing so for variable stagedelay
micropipelines is inconceivable. Therefore, we will not find the average throughput for
variable stagedelay micropipelines; instead, we are more interested in the throughput
bounds, which will be discussed in the following chapter.58
5. THE PERFORMANCE OF A LINEAR MICROPIPELINE
WITH VARIABLE STAGEDELAY
In this chapter, the properties of a linear micropipeline with variable stagedelays
are discussed. A variable stagedelay comes from the completiondetection technique
[6,7,8] used for encoding the data in each stage. As noted in the previous chapter,
finding out analytical solutions to logic delays for fixed stagedelay micropipelines is
very difficult, if not impossible. More difficulty is expected in doing so for micropipe-
lines with variable stagedelays. This prohibits us from analyzing the average through-
put in closed form (we will discuss the average throughput in numerical form in Chapter
6).However, we are able to obtain the throughput bounds. Therefore, we are only
interested in the throughput bounds of micropipelines in this chapter.First, various
representations of logic delays (and hence output loopdelays) are given. Second, based
upon one of these representations, a design procedure is introduced and several design
guidelines to achieve required bounds are suggested.Finally, an example is used to
demonstrate the design procedure and guidelines, given the constraints of lower and
upper bounds of output loopdelay.
5.1 Upper And Lower Performance Bounds
From previous discussions, the performance of a micropipeline with fixed stage
delays is limited by the maximum total stagedelay. Ignoring the overhead of a Cele-
ment's physical delay, this is approximately the same as a synchronous pipeline's perfor-
mance (a more detailed comparison will be discussed in Chapter 6). From the
performance point of view, an asynchronous approach provides no benefit over the syn-
chronous approach. This is because a constant delay element which is greater than or
equal to the worstcase delay of a combinational logic circuit must be inserted at each
stage's datatoken path.However, in many cases, the strategy of adding worstcase59
delay in each stage is too conservative and cannot take full advantage of the "average"
case performance. Hence, completion detection techniques are developed to improve the
possible speedup, leading to variable stagedelay for each stage [6,7,8]. That is, stage
delay is variable and is a function of token index j (time). More precisely, the ith for-
ward delay, ith backward delay, input delay and output delay are represented as
Fi(j + 1), B i(j + 1), Din(j + 2) and Dm(j + 1), respectively.
In this section, we are interested in obtaining the upper and lower performance
bounds of an FIFO. As will be shown later in this thesis, it is also very helpful to know
the equivalent input and output delays of each stage. By reviewing our FIFO model,
we know that, as long as all of the logic delays Dl 4(j + 2) (D24(1) = 0) and D12(j + 2)
become known, all of the performancerelated issues can be derived. In the following,
different logic delay representations and approximations are examined. The approxima-
tion error is also depicted.
5.1.1 Several Representations Of Logic Delays fe(j + 2) And D4,2(j + 2)
All definitions and Theorem 1 (Equal loopdelay) stated in Chapter 4 are applica-
ble to variable stagedelay micropipelines by replacing all fixed delays with variable
delays. As a result, the two different loop delays for loop i in the variable stagedelay
version are stating as follows.
Ti(j + /) = Fi(j + 1) +i(j + 1) + B i(j + I) + Dr2(j + 2)
Ti3(j + 2) = B i(j + 1) + D42(j + 2) + F,(j + 2) +i(j + 2)
As mentioned in Chapter 4, given a mstage linear micropipeline, its throughput
is
P(j + 1) = 1 /Tml +1(j + 1) = 1 /(Dout(j + 1) + Dm2+1(j + 2))
Obtaining an analytical solution of throughput in closed form (implying that logic delay
D m42+1(j + 2) should be solved) involves solvinga set of linear/nonlinear difference60
equations derived from all of the Celements in a pipeline. Linear difference equations
are established according to Theorem 1 stated in Chapter 4. The nonlinear difference
equations come from Celement modeling, i.e. D' 4(j + 1) * D42(j + 1) = 0.Solving
linear/nonlinear difference equation set is not an easy task.Therefore, logic delays are
retained in the representation of output loop delays. As will be seen in the following
chapter, this representation is sufficient to explain some interesting characteristics of
micropipelines.However, we can "solve" these linear/nonlinear difference equations
in numerical form directly by substituting j from 0 to k, where k is any positive integer.
Actually, all simulation results to be discussed in Chapter 6 are obtained by solving these
difference equations using Mat lab [30]. Using an approach similar to that described in
the proof of Theorem 2 (Chapter 4), we will prove that output loop delay and logic
delays have the following representations.
T,in+1(j + 1) = Max(DFm+i(j + 1),
DBm(j + 1),
m
DB,_ i(j + 1) +
=in













(F1(j + 2)F1(j + 1))
(F1(j + 2)F1(j + 1))
k -1









DF k+ i(j) +(131(j + 1 + kI)B (j + kI))DB k_ i(j + 1)
i=k
DF,dj + 1m + k) +
D1 2(j + 1kl),
1 =k -1
m1
(131(j + I + kI)131(j + kI))DB k_ i(j + 1)
1=k
m-
D12(j + 1 + kI),
1=k- I







DBk_ (j + 1)DF 1(0 + 1)- Di4(j + 1),
1 =k +1
k -1
DBk_2(j + 1) + (F (j + 2)F (j + 1))DFk(j + 1)
1=k-1








(F1(j + 2)F (j + 1))DFk(j + 1)1144(j + 1),
1=3
k -1
DB 0(j + 1) +(F (j + 2)F fi + 1))DFk(j + 1)
1=1
where 1km + 1,
DF (j + 1) = Fi(j + 1) + B (j + 1),




DB0(j + 1) = Din(j + 2),
DFm+1(J + 1) = Doudj + 1),
142(j + v)= 0,
D m24+2(j +v)0, j > 0 and v E integer.
<Proof> Consider C1 in Figure 4.5(a). According to Theorem 1 in Chapter 4,
Din(j + 2) + D34(j + 2) = DF (j + 1) + M4(j + 1) + D412(j + 2)
If D, (j + 2)DF1(j + 1) + D224(j + 1), then
D214(j + 2) = DF1(j + 1) + D24(j + 1)Din(j + 2), and (5.4)
D12(j + 2) = 0 (5.5)
If Din(j + 2)DF1(j + 1) + D24(j + 1), then
D14(j + 2) = 0, and (5.6)
D12(j + 2) = Din(j + 2)DF1(j + 1)D34(j + 1) (5.7)
From the conditions and Eq.(5.4) and (5.6),
D2/4(j + 2) = Max(0, DF i(j + 1) + D24(j + 1) + 2)) (5.8)
From the conditions and Eq.(5.5) and (5.7),
D12(j + 2) = Max(0, D in(j + 2)DF1(j + 1)D224(j + 1)) (5.9)
Consider C2 in Figure 4.5(a). According to Theorem 1,
DB/(j + /) + D12(j + 2) + D34(j + 2) = DF2(j + /) + D234(j + /) + D422(j + 2)
If DI31(j + /) + D12(j + 2)DF2(j + 1) + D34(j + 1), then
M4(j + 2) = DF2(j + 1) + D234(j + 1)DB i(j + 1)D12(j + 2), and (5.10)
D422(j + 2)= 0
If DB1(j + 1) + D12(j + 2)DF2(j + 1) + D234(j + 1), then
D24(j + 2) = 0, and




From the conditions and Eq.(5.10) and (5.12),
D34(j + 2) = Max(0, DF (j + 1) + D234(j + 1)DB1(j + 1)D12(j + 2))(5.14)
From the conditions and Eq.(5.11) and (5.13),
D422(j + 2) = Max(0, DB1(j + 1) + D412(j + 2)DF2(j + 1)D34(j + 1))(5.15)
Replacing D412(j + 2) in Eq.(5.15) with Eq.(5.9), we have
D422(j + 2)
= Max(0,
DB1(j + 1)DF2(j + 1)
1=3
Di 4(j + 1),
DB0(j + 1) + 1(F i(j + 2)F (j + 1))DF (j + I )
1=1 1=2
Di 4(j + 1)) (5.16)
The general form of logic delay /42(j + 2), as shown in expression (5.3), can be easily
reached by applying similar approach to all of the Celements in a micropipeline. Since
Dt2(j + 2) is known (1 5_ km + 1) and Tmi(j + 1) = Dout(j + 1) + D4m2+ (j + 2),
expression (5.1) can easily be verified. Logic delay Di4(j + 2) in expression (5.2) can
also be obtained by backward substitution.For example, substituting D34(j + 1) in
Eq.(5.8) with Dr(j + 2) in Eq.(5.14) (note that, index change is required).
Note that expression (5.1), (5.2) and (5.3) are not unique representations for out-
put loop delay and logic delays. In the proof of (5.1), (5.2) and (5.3), Equal loopdelay
theorem is applied to all Celements from C1 to Cm+ j.If Equal loopdelay theorem
is applied to all Celements from Cm+ to C1 and the substitution is made such that
only the same type of logic delays (DP(j + 2) or D4k2(j + 2)), but different index, appear
in the same expression, the alternative representations are shown as follows.
Tml/(j + 1) = Max(DFm+ (1 +1),
DB,7,(j + 1) + + 2)) (5.17)D24(j + 2)
= Max(0, min(
k +1
DF k(j + 1)DB k_ i(j + 1) + D14(j + 1),
1 =k +1
k- I
D F k ( j + 1)j (Fi(j+ 2)Fi(j + I))DBk_2(j + 1) +
1=k-1
k -1 k +1






k -1 k +1





DB k_+ 1)DF (j + 1) + D-112(j + 1kl),
1=k- I





+ D12(i + I + kl),
1= k- 1
(B/j + 1 + kl)13/j + kl))DF,(j + 1m + k)
1=k-1
D12(j + 1 + kI),DBk_i(j + 1)
1=k
(B/j + 1 + k 1)Blj + k1))DF,,±i(jm + k)
65
m
+ DI 2(j + 1 + k1))) (5.19)
I= kI
It is not surprising to know that expressions (5.2) and (5.19) have similar forms,
except for having opposite signs and the disparity of min and Max when the previous
proof is reviewed (e.g., comparing Eq. (5.8) and (5.9)). A similar argument can apply
to expressions (5.3) and (5.18).Expression (5.17) is obtained by the formula
Tnzi +1(j + 1) = Dout(j + 1) +
Dm42+(i + 2) with k=m+1 in expression (5.19).Note
that, in general, expressions (5.1), (5.2), (5.3), (5.17), (5.18) and (5.19) are a function
of both k and j.Therefore, the delay of any combination of k and j that would make
any term in each element infeasible or undefined will not include that corresponding
element in its representation.For example, we are interested in finding D22(3) for a
threestage micropipeline using expression (5.19), i.e., m=3, k=2 and j=1. The expres-
sion will be





The expression DB1(2) [BI4 1)B/31)]DF4(0) + I Di2(41) should
l =2 l =1
not be included since B3(0), DF4(0) and D432(1) are not defined.
5.1.2 Several Approximations To Dr(j + 2)
In the early design phase, an engineer may be more interested in quickly obtain-
ing an approximation of the pipeline's performance, rather than in finding its exact66
bounds.Exact performance bounds can be obtained during the final design phase by
using simulation tool.
Since Tmli(j + 1) = Doull + 1)
D+ 4m2+(1 + 2), the upper and lower bounds of
output loop delay T,/n+/(j + 1) become known once the upper and lower bounds of
D4m2+ i(j + 2) are known. A similar argument can be applied to equivalent input delays
for each stage D'i.yj + 2) = Bk(j + 1) + Di4,2(j + 2) + Fk(j + 2).Therefore, we only
need to focus on two representations ((5.3) and (5.19)) of logic delay Dr(j + 2),
in + I.It should be noted that the exact values of Dr(j + 2) obtained from
expressions (5.3) and (5.19) should be the same for any given k and j, even though their
representations are different. The major difference is that the sign in front of the logic
delay term in expression (5.3) is negative and in expression (5.19) it is positive. Since
all of the logic delays are defined as being greater than or equal to zero, the sign differ-
ence allows us to approximate the upper and lower bounds of D42(j + 2) from right
hand and lefthand sides.Several approximations are discussed as follows. Assuming
that the upper and lower bounds of each stage are given, for example,
DT,'" < D,(j + 2) 5_ DlYax,
D',;;7 < Dow(j + I) < Dmouax,
< + 1)Brax,and
Fin< F i(j + 1) < Frax, 1 < i < m.
Several notations are adopted.
Max(f( ))1i: The exact upper bound of f( ) for all js; sometimes, also written
as /11"x for convenience.
min(f( ))1i: The exact lower bound of f() for all js; sometimes, also written
as rin for convenience.x.
67
UP,(f(-))1i: An expression approximating Max(f( ))1j; based on approximation
LW,(f( ))1 : An expression approximating min(f( ))IJ. based on approximation '
x.
Note that UP,(f (-))Ij could be greater than, equal to or less than Max(f( ))1i depending
on approximation x. A similar relationship can be applied to LiVx(f( ))1i and min(f( ))1i.
<Approximation I>:
From expression (5.3), if the logic delay term is dropped, we have
D4k2(j + 2)
5 Max(0,
DBk_i(j + 1)DFk(j + 1),
k -1
DB k_2(j + 1) + (F1(j + 2)F1(j + 1))DFk(j + 1),
1=k-1
k -1
DBI(j + 1) +
1=2
k-
DB0(j + 1) +
1=1
(Fi(j + 2)- Fi(j + 1))DFk(j + I),
(F/j + 2)F1(j + 1))DFk(j + 1)) (5.20)
Apparently, the largest value of D1:2(j + 2) for all js, denoted as Max(D4k2(j + 2))1p will
be less than or equal to the largest value of the right-hand-side of expression (5.20).
Also, the smallest value of Dt2(j + 2) for all js, denoted as min(D4k2(j + 2))1p will be
less than or equal to the smallest value of right-hand-side of expression (5.20).Since
we assume that all stage-delays for each stage are independent (e.g., Fp(j + 1) and
Fq(j + 1) are independent where p # q, 1 5 p 5 m and 1 < qm. The same as-
sumption can be applied to Bp(j + 1) and Bq(j + 1), Fp(j + 1) and Bp(j + 1), and68
F p(j + 1) and Bq(j + 1)), we can make the positive term maximum and the negative




1 k DFmin k
k
DBmax + (FmaxFmin)DPnin k-2 1 1 k
1=k-1
k
DBmax + (Fraxnnin) DF"kiin,
1=2
k-1
DB0Mc"1( Fri"Frin DFiknia )
l =1
= UP 1(Dr(j + 2))Ii (5.21)
TheUP1(Dk2(j+ 2))1i is the representation of Max(0,) in (5.21) for approximation
1, i.e.,Max(Dt2(j+ 2))IiUPi(D`k12(j +2))1p
If we make the positive term minimum and the negative term maximum, then















= LIV1(D4k2(j + 2))1f (5.22)
The LIVI(D4k2(j + 2))1./ is the representation of Max(0,) in (5.22) for approximation
1, i.e., min(D1,2(j + 2))1i 5 LW/(D42(j2))1i.
where Dgin = D'innin,
DBti, ax = , to
DPnin = Dmin m F out,
DF mMax+ =Max,
DBIYlax = FrcaBrax = DFra x,
DB'nin = PininB`nin = DPinin,1 < itit.
<Approximation 2>:
From expression (5.3), if the logic delay term is dropped, except for D2k4++ 1),
and since Dr(j + 1) = DF k(j + 1) + Dr+ i(j + 1), we have a more accurate expression
for D't2(j + 2).That is,
Df2(j + 2)
Max(0,
DBk_i(j + 1)Dr(j + 1),
k-1
DBk_2(j + 1) + (F1(j + 2)F1(j + 1)) (j1),
1 =k -1
k-1
DI 3 j(j + 1) +
1=2
DB0(j + I) +
k-1
1=1
(F1(j + 2)F1(j -I- 1))Djcar(j + 1),



















1 1))1.1 Max(D"t(j + '
k -1











= LW2(D42(j + 2))1j
where min(Dr(j + 1))ii = Pknin + Brn + min(D2k4+ j(j + 1))1i,
Max(D°kut(j + 1))11 = + Br ax + min(D2k4+ (1 + 1 ))ii,




Max(D m1_ i(j + I))1j= Dnx.
To utilize this approximation, the minimum value of D2k4(j + 1) should be known in
advance. Since we are interested in a micropipeline which is initially reset, the minimum
value of D2k4(j + 1) is zero (e.g., D2k4(1) = 0).That is, the equal sign of expression
(5.21) holds for initially reset micropipelines.Note that the assumption of
min(D°kut(j + = PknthB Tinmin(D2k4+(f + 1))11 and Max(Drt(j + 1))1j =Frx
B jrxmin(144+ 1(j + 1))1j are for the convenience of calculation. The exact expres-
sion will be discussed in a later section.
<Approximation 3 >:
From expression (5.19), if the logic delay term is dropped, we have
D4k2(j + 2)
Max(0, min(




(B (j + 1 + k 1)131(j + k1))DF k+ i(j),
In- I
DB k_ i(j + I ) (B i(j ± 1k 1)B/j + kI))DF,n(j + Im + k),
i= k
DBk_ i(j + 1) (Blj + 1 + kI) + kI))DFm+ j(jMk)))
1=k
Max(D4k2(j + 2))11

































From expression (5.19), if the logic delay term is dropped, except for
D'kI2 /(j + 2), and since Dikn i(j + 2) = DBk_i(j + 1) + Dk2 i(j + 2), we have a more
accurate expression for D4k2(j + 2).That is,
D4k2(j + 2)
Max(0, min(
i(j + 2)DFk(j + 1),
k
Dikn i(j + 2)
1=k




(B /j + 1 + k 1)B (j + k1))DF ,(j + 1 + k),
Dikn1(1 + 2) (Blj + 1 + k 1)B (I + k1))DF,+ I( jm + k)))
= k
Max(D4k2(j + 2))Ii
Max(0, min( Max(D;',' 1(1 + 2 ) DP12"1
k










= UP 4(D`k12 (J + 2))I
min(D4k2(1 + 2))Ij
Max(0, min( min(Dr: 1(1 + 2))1j DFSIcla x,
k





min(Dikn 1(1 + 2))li (BriaxBI n)DFIla x,
i=k
min(Dr 1(1 + 2))Ii
1 =k
(BrixB`fin) DF mma+x I))
(5.29)
(5.30)
= LWID4k2(j + 2))1i (5.31)74
where Din'min = Dqiin 0 m'
= Di1ax
0 i;,
min(Diin(j + 2))Ij= DBPin + min(D42(j + 2))1p
Max(DiNj + 2))1i= DBrax + min(142(j + 2))1p 1i 5 m.
To utilize this approximation, the minimum value of Dt2 /(j + 2) should be known in
advance.Since, in general, /42 1(j + 2)0 given j=q and k=w where q0 and
1 5 wm + 1, the equal signs from expressions (5.26) to (5.31) may nothold for
initially reset micropipelines.The assumption of min(Dr(j + 2))1i = DBmin +
min(D12(j + 2))1./ and Max(DiNj + 2))1j= DBr' + min(D12(j + 2))1i is for the conve-
nience of calculation only. Exact expressions will be discussed in a later section.
Besides the four approximations given above, other means can make the approxi-
mation more accurate, with the price of also making it more complex. For example,
instead of ignoring the whole logic delay term in expression (5.19), we can take into
account more logic delays when making the approximation.
The above approximations are summarized in Figure 5.1. From the above discus-
sion, we know that there are four approximations of the exact upper bound and four of
the exact lower bound.If the micropipeline is considered to be reset initially, then
Max(D4k2(j + 2))1j = UP j(D4k2(j + 2))1i= UP2(142(j + 2))1pOtherwise, the errors
based on these four approximations will be shown as follows. Note that the error itself
is also an approximation.
Upper Bound: Best Approximation Error(x100%): [(5.24)(5.30)1/(5.30)
Upper Bound: Worst Approximation Error(x100%): [(5.21)(5.27)1/(5.27)
Lower Bound: Best Approximation Error(x100%): [(5.25)(5.31)1/(5.31)











(5.31) (5.25) (5.30) (5.24)
LW4 LW2 UP4 UP2
rAppro. error Appro. error 1
(5.28) (5.22)(5.27) (5.21)
LW3 LW, UP3 UPI
* If a micropipeline is initially reset, expressions (5.21) and (5.24)
will coincide with exact upper bound.
Figure 5.1: Relative relationship among different approximations of Dr(j + 2).76
5.1.3 Several Approximations To T 11(i + 1)
As mentioned before, a method for finding the exact bounds of a general system
with choice modeled in Petri net has been developed [21, 22]. However, the algorithm
used in this work cannot guarantee the exact bounds for a system which is data dependent
and contains the mechanism of mutual exclusion. For such systems, only approximate
results can be determined. A tool called "CTSE" accepts the input of a Petri net model
and implements this algorithm [21, 22]. According to the author, CTSE is not an acro-
nym. However, the "C" stands for "conditional" or "choice" and "TSE" stands for "time
separation of events". We would like to compare our approximations, based on
Figure 5.1 and the fact of Tn.,/j(j + 1) = Doidi + I) +D4.2+1(j +2), with the exact
values obtained through CTSE. Figure 5.2(a) shows a threestage linear pipeline (m=3)
used for comparison. The corresponding Petri net model for this pipeline is depicted
in Figure 5.2(b). The token marking in Figure 5.2(b) indicates the initial condition of
the pipeline. Table 5.1 shows ten simulations corresponding to different stagedelay
combinations. The first number in brackets represents the minimum value and the sec-
ond number represents the maximum value. The result is also plotted in Figure 5.2(c).
Due to the initial reset condition, the output loop delay using UP1() approach has the
same value as the exact maximum bound, as mentioned earlier.Besides, since
D2540 + 1)= 0 by default in the case of m=3, the two approaches UPI() and UP2()
are the same. This is also true for LIV/() and LW2(). By observing the second simula-
tion case in which UP3() = 19 < 24 = L14712(), it is possible that UP34(.)LI4712().
Note that it is not necessary true that LIVI() has the same value as the exact minimum
bound (refer to the 8th and 9th simulation results); although it does in most cases in our
simulations. These simulations also show that UP4() approaches the exact maximum












Max.,UP 1,UP,, UP 3 and UP4
of 7;inj(j + 1)
(c)
UP4,LW4 -ft-
inin.,LWLW,, LW 3 and LW4
of T,/,/(j + 1)
Figure 5.2: Comparison of our approximations with the exact bounds.
(a) A 3-stage linear pipeline.
(b) Petri net model.
(c) Result comparison.78
Table 5.1: Comparison of our approximations with the exact bounds
using a threestage linear pipeline as an example.
1 2 3 4 5 6 7 8 9 10
Din[2 4][2 4][2 4][22 24][6 7][6 7][6 7][36 47][36 37][6 7]
F1[12 24][20 24][22 24][12 14][10 20][10 20][10 11][20 22][20 22][20 22]
B1 [4 5][4 5][4 5][2 5][2 5][2 5][10 15][4 15][4 15][4 15]
F2 [22 29][25 29][12 16][12 16][2 14][12 14][12 13][22 30][22 30][22 30]
B2 [2 4][2 4][2 4][2 4][3 7][3 7][4 7][1 7][1 7][1 7]
F3[20 26][10 13][10 14][10 14][2 3][29 33][12 13][12 13][12 23][22 23]
B3[4 6][4 6][4 6][1 6][1 6][1 6][2 6][2 14][2 14]2 14]
Dow[2 4][2 4][2 4][2 4][2 4][2 4][2 4][2 4][2 4][12 14]
Max42 36 37 34 38 39 28 58 58 46
UP/42 36 37 34 38 39 28 58 58 46
UP242 36 37 34 38 39 28 58 58 46
UP332 19 20 20 9 39 19 27 37 37
UP439 36 24 24 22 39 21 38 48 38
min24 24 18 12 4 30 18 22 14 24
LW124 24 18 12 4 30 18 25 15 24
LW224 24 18 12 4 30 18 25 15 24
LW324 14 14 11 3 30 14 14 14 24
LW424 24 14 11 4 30 15 22 14 24
To simplify the discussion of design procedure stated in a later section, only
UP/(D4m2+ j(j + 2))[i is considered from now on, unless specified otherwise. We also
assume that the minimum value of D'in2+ i(j + 2) is zero. As a result, since Tmii(j + 1)
= Dow(j + 1) + D4.2+ i(j + 2), output loop delay has a lower bound,
Trni/(j + 1)Dow(j + 1) (5.32)
and an upper bound based on (5.20) for k = m +1,Tim+ i(j + I)Max(Dout(j + 1),
DB,n(j + 1),
DBi(j + 1) +
DBm _2(j + 1) +
Diqj + 1) +
1 =2
Di(j + 2) +
(Fi(j + 2)Fi(j + 1)),
1= m
(Fi(j + 2)Fi(j + 1)),
= m1
(Fi(j + 2)Fi(j + 1)),
79
(Fi(j + 2)Fi(j + 1))) (5.33)
l =1
Expressions (5.32) and (5.33) allow us to find upper and lower bounds of individual
output loop delay for each j if both the forward delay Fi(j + 1) and backward delay
13,(j + 1) for each j are known.
If each stage's upper and lower bounds are given, then expression (5.32) and
(5.33) can be rewritten as
/(j + 1) (5.34)














Expressions (5.34) and (5.35) allow us to find the upper and lower bounds of overall
output loop delay if the upper and lower bounds of both forward delay F i(j + 1) and
backward delay B i(j + 1) are known. That is, each output loop delay corresponding to
a different j must be bounded by expressions (5.34) and (5.35).Figure 5.3 shows the
relative relationship among the expressions (5.32), (5.33), (5.34) and (5.35).
Note that expression (5.34) is a sufficient (not necessary) condition for a micro-
pipeline to satisfy the required lower bound. This is obvious since expression (5.34)
drops the logic delay term D4m2+/(j + 2).In other words, D4m24./(j + 2) is assumed to
have a lower bound zero. The upper bound of output loop delay, shown in expression
(5.35) (note that the "=" sign holds for initial reset pipeline), however, is a sufficient
and necessary condition for a micropipeline to satisfy the required upper bound. The
"=" sign in expression (5.35) can be justified if we substitute F i(1) = Fmin
F,(2) = F imax, B (I ) = Brax, DB0(1) = DMA and DFm+I(/) = Do into expression
(5.1) for j=0, where 1 im. Note that D2k4(1) = 0, 1 < km + 1.
5.1.4 Several Approximations To Di4U + 2)
Just like the logic delay D4k2(j + 2), there are four approximations to Di4(j + 2).





























:Overall output loop delay bound(exact)
Overall output loop delay bound(approximation)
Figure 5.3: The relationship between individual and overall




DF,(j + 1m + k) +(Bi(j + 1 + k 1)131(j + k1))DB k_ i(j + 1)
1=k
M-
D12(j + 1 + k1),
1=k I
DFm+10m + k) +(BIO + 1 + k 1)Bi(j + k1))DBk_10 + 1)
1=k
D12(j + 1 + k1))
1=k I
<Approximation 1>:

























DFminm+ 1 (137in a x )D1411)
1=k
= LW I(D2k4(j2))1i









If the logic delay term is dropped except for the term D4k2_/(j+ 2) and since
Din
1(j + 2) = DB k_ i(j + 1) + D42




DFrxmin(DtP_ i(j + 2))1i,
k
DFmc"k +1(Br'Br in) 1(j + 2))1p
1=k
In




= UP2(D2k4(1 + 2))Ii
min(D2k4(1 + 2))1i
Max(0,




(fflininBrax)Max(DP_ 1(1 + 2))1p
m -I
(WI"Brax)Max(Di: 1(j + 2))1p
1= k




= LW2(D2k4(j + 2))1j (5.39)
where min(D`P(1 + 2))Ii = Fnklin + Btkninmin(Dr(j + 2))1i,
Max(Dikn(j + 2))Ii= F+ Bikfrfax + min(D4k2(1 + 2))1i,
min(Ditj(j + 1))1i = ,
Max(DVj + 1))1i= D11111 at.
To utilize this approximation, the minimum value of D4k2(j + 2) should be known in
advance.Note that the assumption of min(D'IN + 2))1i = + Bmin
min(Df2(1 + 2))1i and Max(D`1,n(j + 2))I= F111" + + min(D4k2(j + 2))1j are for the
convenience of calculation. The exact expression will be discussed in a later section.




DFk(j1)DB k_ j(j + 1) + D14(j + 1),
1=k+1
k -1 k+ I
DF k(j + 1) (F/j + 2)F i(j + 1))DB k(j+ 1) +D14(j + 1),
1= k- 1 1=k
k- I k+ I
DFk(j + I) (F 1(j + 2)F (j + 1))DB+ 1) +D14(I + 1),
1=2 1=3
k -1 k +1
DFk(j + 1) (F i(j + 2)F (j + 1))DB0(j + 1) +D14(j + 1)))
1 =1 1=2
<Approximation 3 >:



























= LW3(D2k4(j + 2))1/ (5.41)
<Approximation 4 >:
If the logic delay term is dropped except for the term D24k++ 1) and since
Dry + 1)= DFk(j + 1) + D2kt i(j + 1), we have a more accurate expression for
D2k4(j + 2).
Max(D2k4(j + 2))I









Max(Dr(j + 1))ii +(Fr' DBir ))
l =1
= UP 4(D2k4(j + 2))li
min(D2k4(j + 2))1i
Max(0, min( min(Dr(j + DB nxl,
k I









min(Dr(j + 1))1 (Fra.Fri)DBrx)) /
1=1
= LW4(D2k4(j + 2))1j (5.43)






min(Dr(j + 1))1i= min(D+1 1(1 + 1))1p
Max(Drt(j + 1))I1= DBrax + min(Dr+I i(j + 1))1j, 1i < in.
To utilize this approximation, the minimum value of Di+4/(j + 1) should be known in
advance. Since D14(1) = 0, the equal sign of expressions (5.41) to (5.43) hold for
initially reset micropipelines.That is, if a pipeline is initially empty, then
min(D2k4(j + 2))1i = LW3(D2k4(j + 2))1i= LW4(D2k4(j + 2))Ii.Figure 5.4 demonstrates
the relationship between exact bounds of Di4(j + 2) and expressions (5.36) to (5.43).
Theassumptionofmin(Dr(j + 1))1j=Dfir + _1(j + 1))Ijand
Max(Dr(j + 1))1j = DBrax + 1(1 + 1))1i is for the convenience of calculation
only.Exact expressions will be discussed in a later section.
5.1.5 Several Approximations To DA (j + 2)
According to the definition of equivalent input delay Dr (j + 2) = DBk(j + 1)
+ D4k2(j + 2) and from the expression (5.3) of D4k2(j + 2), we have
DT(j + 2)
= Max(DBk(j + 1),















Appro. error Appro. error
(5.43) (5.39) (5.42) (5.38)











* If a micropipeline is initially reset, expressions (5.41) and (5.43)
will coincide with exact lower bound.
Figure 5.4: Relative relationship among different approximations of /44(j + 2).DB k_2(j + 1) +I (Ffi + 2)I-1(j + 1))
1=k- I










DB0(j + 1) +(Flj + 2)F,(j + 1)) Di(j + 1))
1=1 f =2
where 0km,
DB0(j +1) = Di(j + 2).
If the stage-delay bounds are given and pipeline is initially reset, then
Max(DT(j +2))Ij


























= LIV/(D`kn(j + 2))1i (5.46)
Several conclusions can be reached from previous discussions.
1> Max(142(j + 2))1i= UP i(D112(j + 2))1j (5.47)
min(D4k2(j + 2))1iL1471(Df2(j + 2))1i (5.48)
2> Max(DiN + 2))1i= UP 1(14(j + 2))1i (5.49)
min(DiN + 2))1iLIVI(DM + 2))1j (5.50)
3> Max(DP(j + 2))1./. Max(Dt2(j + 2))lj (5.51)
min(DPU + min(142(1 + 2))1j
4> Comparing expressions (5.21) and (5.45), we find
(5.52)
UP i(Dikn(j + 2))1j ={
DB iliclax if UP i(Dt2(j + 2))I. =
FMax + Brcin + up AD42u+ 2))Ii otherwise
0
(5.53)
and, comparing expressions (5.22) and (5.46), we have
if LW 1(142(j + 2))1i =0
Li V i(DVU + 2))Ij FnklinBikVax + LI V AD112(j2))Ii otherwise
(5.54)
Note that, the conditions "Os" in expressions (5.53) and (5.54) correspond to the first
element instead of the calculation result (the other elements could be zeros after calcula-
tion) of expressions (5.21) and (5.22).In general,91
Max(Dr(j + 2))1jFr' + BM' + Max(Dr(j + 2))!/, and
min(DP(j + 2))1jPknin Bikninmin(Dt2(j + 2))1i
If the representation of DI+ 2) in (5.19) is used to find Dr(j + 2), the representation
of Dr(j + 2) becomes
Dikn(j + 2)
= Max( DBk(j + 1), min(
k -1
DBk_ i(j -I- 1)F (j + 2)Fk(j + 1) + D12(j + I + k1),
1=k-1
DBk_ 1(j + 1) + Fk(j + 2) + Bk(j)DFk+ i(j) + Dr2(j + 1 + k1),
1=k-1
DB k_ i(j + 1) + Fk(j + 2) + Bk(j)
In-1
(Bi(j + I + kI)B i(j + kI))
1=k+1
DF,(j + 1in + k) + D12(j + I + kI),
1=k-1
DB k_ i(j + 1) + Fk(j + 2) + Bk(j)
1=k+1
(Bi(j + 1 + k 1)B i(j + kI))
in
DF,+ i(jm + k) + D42(j + 1 + kI))) (5.55)
1=k-1
The conclusions are
1> Max(D4k2(j + 2))1jUP3(D4k2(j + 2))1i (5.56)
min(D1,2(j + 2))1jLW3(D14,2(j + 2))1i (5.57)
2> Max(DN + 2))1iUP3(14(j + 2))Ii (5.58)
min(DN + LW3(D1P(j + 2))Ii (5.59)
3> Max(DN + 2))iiMax(Dr(j + 2))1i (5.60)92




if UP 3(D14,2(j + 2))I1 = 0
(5.62)
+ Btknin + UP 3(D4k20 + 2))li otherwise
{DB'knui if LW3(Dr(j + 2))1f = 0
Fllkiin + BV'ax + LW 3(D;c12(j + 2 Ai otherwise
(5.63)
5.1.6 Several Approximations To Dirt(j + 1)
Similar to the previous derivation and according to the definition of equivalent
output delay Dr(j + 1) = DF k(j + 1) + D2k4+ 1(j + 1), we have the following represen-
tation, if D2k4+ i(j + 2) is represented as the form of (5.2).
Dokut(j + 2)
= Max( DF k(j + 2),
DFk±i(j + 1) + Bk(j + 2)Bk(j + 1)-
1=k
D12(j + 2 + k1),
k+1
DFk+2(j) + (Bi(j + 2 + k 1)B i(j + 1 + k1)) + B k(j + 2)B k(j + 1)
1 =k+ 1
k+1
I Dp(j + 2 + k1),
1=k
m-1




Dr2(j + 2 + k1),
1=kIn
DFi(j + 1in + k) + (B i(j + 2 + k 1)Bi(j + 1 + k1))
1=k+1
+ Bk(j + 2)Bk(j + 1)
1=k
93
Dr2(j + 2 + k1))(5.64)
The conclusions are
1> Max(Dit i(j + 2))1i j(D1,+1(j + 2))1i (5.65)
min(Dk+ i(j + 2))1iLI V I(D2k4+1(j + 2))1i (5.66)
2> Max(Dr(j + 2))1i j(Dr(j + 2))1j (5.67)
min(Dr(j + LI 1(Dr(j + 2))1i (5.68)
3> Max(Dr(j + 2))1iMax(Dit i(j + 2))1j (5.69)
min(Dr(j + 2))1jmin(D2k4+ i(j + 2))1j (5.70)
4>
UP1(Dr(j + 2))1i =
LIVI(DZ`a(j + 2))li =
{DFAiax if UP 1(D2k4+ 1(1 + 2))1j = 0
B 114' +Pknin + UPADP+ 1(12))Ij otherwise
(5.71)
{DPknin if LW i(D2k4+ i(j + 2))1j = 0
F 14:1' +Blknin +LI 47 AD2k4+ 1( j + 2 DIj otherwise
(5.72)
Note that, UP/(Dr(/)) = DFIA;fax, and LWI(Dr(/))=DF'knin.
If the representation of Di4+1(j + 2) in (5.18) isused to find DM + 2), then
Dr(j + 2) becomes
Drt(j+ 2)
= Max( DF k(j + 2), min(
k+2
DFlc+ (j + 1) + Bk(j + 2)Bk(j + 1) + Dr(j + 1),
1=k+295
5.2 Design Procedure And Guidelines
In this section, how to design a linear micropipeline is shown based upon expres-
sions (5.34) and (5.35), such that the resulting maximum and minimum output loop
delays satisfy a specification.That is, given a delay bound,Qmin< Q <QMax, of
combinational logic circuit,how does one design a linear micropipeline such that its
output loop delay Tmi +/(j + 1) satisfies TminTrni +/(j + 1)Tmax? In the above
statement, Q represents the original logic circuit delay before partition and Q1 in and
QMaxsymbolize the minimum and maximum signal delays, while Tmin and Plax desig-
nate the minimum and maximum output loop delay requirement after the original logic
circuit has been partitioned and implemented using a linear micropipeline. For clarity,
register propagation delays and Celement physical delays are ignored in this discussion
without loss of generality. The effect of these two delays on performance of micropipe-
lines will be discussed in Chapter 6. Backward delays are also ignored. Assuming that
a mstage micropipeline is considered: let Frax = aiTmax and F'inth = biTmax,
0 im + 1, where Fr =Dr, Drninin, FMax and Fmim+nryot.




To satisfy the specification using fixed stagedelay micropipelines, we need
Max
i> at leastQ pipe stages
I
pestages if each stagedelay = Tmax except for a certain
()Max




is indivisible), which has the stagedelay =96
r ii> at mostQmaxpipe-stages if each stage-delay = ra" except for a certain Tmin
QMax
stage, if any (the situation whenTmmis indivisible), which has the stage-delay =
QMax[QMax* Tmth
Tmth
The resulting output loop delay is Tmi 1_1(j + 1) = Tmax for the former condition
and is Tml +/(j + 1) = Tmin for the latter condition.
If this circuit is implemented using a variable stage-delay micropipeline, we can
rewrite expression (5.34) and (5.35) as
Tmi +1(.1 + 1)bm±iTmax (5.82)
Tmij(j + 1)Max(am+ iTmax,
a,nTmax,
al bdryk",
l =m -1 l =m
m
( al bdTmax,
= m 2 I = mI
m m (E bdTmax) (5.83)
i=o t=
The goal is to find coefficients ai and bi,0 m + 1, such that expressions (5.82)
and (5.83) satisfy requirement TurinTmi +1(j + 1)Tmax. From Frax = aiTmax and
Fru' = biTm', we know
Frax = a > Frin = b iTAlaxai > bi, 0im + 1. (5.84)
In general, there are two terms in each element (equation) of expression (5.35).The
first and second elements have second term equal to 0. From expression (5.35) and the97
upper bound requirement, we know the first term Ilfax, 0im + 1, in expression
(5.85)
(5.35)
FM= = aiTmax <Tmaxal < 1
From conditions (5.84) and (5.85), we also know
aibi and aiI = hi s 1










From expression (5.83), to satisfy the upper bound requirement, coefficients of Tmax
must be less than 1.That is,










There are many solutions that satisfy conditions (5.84) through (5.89). One of the pos-
sible solutions is obtained by replacing the inequality signs with equality signs in (5.88)







am = 1, and
ai_i = bi = ci_1,1iin,which is substituted into equation (5.87), we get
(c/ + C2 + -I- cin_i1)TMax = QMax
From expression (5.84), we have
a7bi cici_1, 1 C i G m.
(5.90)
(5.91)
If Qmax is a multiple of Tmax and if we letc1= c2 = = cm_i = 1, where
Max
m ,this is equivalent to the fixed stagedelay case with stagedelay = Tmax for
T
each stage. The coefficients ao,b0 and b/ (note that ao=b1 in (5.90)) can be chosen
arbitrarily as long as they satisfy conditions (5.84) through (5.86).If the number of
pipestages m used to compose a pipeline is given or limited, the possible solutions will
be reduced. Otherwise, the number of pipestagemcan also be a controllable parameter
when we design a micropipeline.
Once cs andmare determined, all lower boundsFinth and upper bounds Frax,
0i 5m1,of stages become known, assuming that the bounds of (environmental)
input and output delays are also controllable.It is possible that the resulting
m







imposed on this design procedure, as stated above.It is of no concern if
pinin Qmin
1 =1 1 =1
Is there any problem if FininQmin? Before this question is
answered, the effect of partitioning a combinational logic block is discussed first. Given




Vinin and Qmin, and
m
1=1
011' and Qmax if this combinational logic
block is partitioned into m subblocks (stages), where Ql in is the delay lower bound and99
Ma'is the delay upper bound for stage 1, 1 5_ 1 s m? Without loss of generality,
Figure 5.5(a) depicts a combinational logic block which is divided into three stages
(m=3) as an example. The longest path within the original block is denoted as G which
has delay equal to QM", while the shortest path is designated as 1 which has delay equal
to Qmin. Other paths within this block which have delays bounded by [Qmin,Qmal are
generically represented as H. In this example,
1 =1
// = Qmin and
1=1
G1= Qmax. How-
ever, the individual stagedelay bounds are determined by choosing the maximum and
minimum delays within each stage, i.e., Qra = minimum[G1H111] and Qr" =
Maximum[G1H111], 1 s 1 s 3. The condition
1=1
Q7"7 = Qmin holds if and only if
minimum[G41-1111] = 11,1 s 1 s 3.Otherwise, for example, in Figure 5.5(a), if parti-
tion is made such that minimum[G111111] = 11, minimum[G2y212] = H2 (implying
H2 < 12),and minimum[G3,1-1313] = 13, then
Qmin
Qrn =I ± H2 + 1 3 <++ 13
1=1
In summary, if a combinational logic block with upper and lower bounds is
partitioned into m stages, the summation of the lower bounds of each stage will be less
than or equal to the original undivided lower bound, i.e.,Qmin < Qmin. A similar
1=1
approach can be applied to obtain the upper bounds relationship. That is, the summation
of the upper bounds of each stage will be greater than or equal to the original undivided
upper hound, i.e.,Qmax > Qua'. A numerical example is demonstrated in
1=1
Figure 5.5(b). The original combinational block is bounded by [11,20] but after parti-
tioning the summation of lower bounds becomes 3+3+3=9 (<11) and the summation
of upper bounds becomes 9+5+8=22 (>20).longest path










Figure 5.5: Partitioning a combinational logic block.
(a) A combinational logic block is arbitrarily partitioned into 3 stages.
(b) An example shows the delay bounds change before and after partition.1 0 1
Continuing the discussion of our design solutions, will the condition
m
< Qmin cause a problem? From previous study, we know it is safe. How about
/=/
if min > Qmin? This indicates that the solution can not be implemented physically
1=1
no matter how the original block is partitioned. There are three choices.First, we can
m
find another set of coefficients such thatFlinn < Qmin The second choice is to leave
1=1
the coefficients intact and really implement this circuit by artificially adding extra delays
into each stage according to the coefficients we obtained.In fact, the condition
m
n > Qmin relaxes the lower bounds at stages and provides additional design free-
/=/
dom. Therefore, the third choice is to redesign the stages with, for example, less hard-
ware complexity by taking advantage of the relaxation in lower bounds. In the above
design procedure, the assumption = QMaX was made. In reality, this assump-
1=1
tion is not necessary but reduces the possible solutions.
Now, if a linear micropipeline is given and does not satisfy the bounds require-
ment, how can we effectively and efficiently modify this micropipeline such that its
output loop delay turns out to meet the requirement? Since a lower bound is easy to
meet (by arbitrarily reducing the magnitude of I) omui't1) , we are more interested in how to
reach upper bound requirement, i.e., how to reduce the maximum output loop delay.
Four guidelines (or approaches) based upon expression (5.35) are addressed in this sec-
tion. The first approach is to increase each stage's lower bound. This approach is
obvious due to the "" sign in front of the lower bound term Finn in expression (5.35).
The second approach is to switch stages if allowed. The third approach is to shift some
delays from one stage to the other.Finally, the fourth approach is to split a stage into102
two or more stages.Usually, the above approaches can be combined to effectively and
efficiently reduce the magnitude of expression (5.35). An example will be given to
demonstrate these approaches in the following section. Among these approaches, the
stagesplitting approach is most commonly used and is further discussed below.
Figure 5.6 illustrates the effect of stage splitting. Splitting a stage i in a micro-
pipeline corresponds to splitting an equation (element) with its first term equal to DWillax
in expression (5.35). Two phenomena are warranted when a stage i is divided.First,
the magnitude of the original equation is larger than the magnitude of each of the split
rn
equations. That is, in Figure 5.6, the magnitude of DBrax (Fr"'Frn) is
I =i +I
larger than the magnitude of any equations within the shaded area. Second, the magni-
tudes of the other equations (corresponding to unsplit stages) in expression (5.35) remain
unchanged.The above two phenomena are based upon the assumption of
DBrax =
/=1
Br and DIFinin =
1=1
Dfrnfin, where n is the number of stages into
which stage i is split and Dkivir and DBmijin represent maximum and minimum total
stagedelays for new split stages. Given these phenomena, which stages should be divid-
ed remains to be answered. Our experience in synchronous designs suggests that the
stage with the largest maximum total stagedelay DBrax is a good candidate.Surpris-
ingly, based upon our approach, this is not the case in asynchronous pipelines. The
stages with corresponding magnitudes in expression (5.35) exceeding the required upper
bound are all candidates for further division.Since a larger total stagedelay does not
necessarily imply a larger corresponding magnitude in expression (5.35), splitting a stage
with the largest maximum total stagedelay may not reduce the value of the maximum
upper bound. Generally speaking, splitting inappropriate stages won't change the result-

































+ ( Fr" Frn) )
1 =i +1
Figure 5.6: The equations corresponding to stage i being split into n stages.104
Driirax +(Fr'Fria) > TMax > DBIt" + Fr"), splitting stage k
1 =i +1 1= k+ I
won't change the final value of expression (5.35). However, by splitting stage i, a design
that meets the requirement may be obtained.
How do we know if a stagesplitting approach will help to satisfy the upper
bound? By setting the first term of elements whose magnitudes exceed the upper bound
in expression (5.35) to zero, if the magnitudes of resulting equations still cannot meet
the requirement, stagesplitting does not help in reducing the upper bound at all.That
k
is, if(FraxFrn) > Tma', 1 5_ k s m, for any k, the expression (5.35) willnever
1=1
be satisfied by simply using a stagesplitting approach. In other words, if the environ-
mental input and output delays can not be split, then stagesplitting helps only if
m
Dm " < Tmax and DMA out (FraxF 7 )TMax. All of these criteria are basedon
1=1
the truth that the first term in (5.35) can be reduced by stagesplitting but the second
term cannot.
If the goal is to obtain a linear micropipeline whose output loop delay satisfies
the bound Tin < T/j + 1)Pi', then it seems that the implementation of fixed
stagedelay micropipelines can meet this goal very well and efficiently, compared with
the implementation of a variable stagedelay micropipeline. This is because, not only
are the number of pipestages (QMax
TMax1) needed to implement a circuit less, but also
the hardware costs less. Most variable stagedelay micropipelines use a dualrail encod-
ing method for which hardware costs double those of fixed stagedelay micropipelines.
Moreover, fixed stagedelay micropipelines are even easier to implement. Then, why
do we need to even consider implementing variable stagedelay implementation? Be-
cause the average throughput, the other performance design metric, is different.Fixed
stagedelay micropipelines have fixed throughput rates, but variable stagedelay micro-105
pipelines have variable throughput rates. As will be seen in Chapter 6, variable stage
delay micropipelines may have better average throughput than fixed stagedelay micro-
pipelines. In real circuits design, both throughput bounds and average throughput should
be satisfied.
5.3 Numerical Example
In this section, a numerical design example is given to demonstrate the design
procedure stated in the previous section.Several approaches to helping the existing
design meet the specification are also included in this section.
5.3.1 Original Design
Consider a combinational logic circuit having delay bounds expressed as
10 5 Q 5 46.Design a threestage micropipeline such that its output loop delay have
bounds 1 5 T4(j + 1) 5 31, assuming backward delay B,(j + 1) = 0, 1i3.
j O.
From the design specification, we have Qmin = 10, Via' = 46, Tmin = 1,
TMaX= 31 and m = 3. The design goal is to find coefficients ai and b., 0 5 i 5 4.
From (5.87), we have
3
Ial* 31 = 46al + a2 + a3
1 = 1
From (5.88), we have
b431











One set of coefficients that satisfies the above constraints is
2a315 18a , =13 anda0 =3 43 331'2319 31, 31
b4 b2 =b, = and bn2 31' 31' 31'=319 a u31
Note that the value of bo is arbitrarily assigned as long as it is less than or equal toao.






Substituting these bounds into expression (5.34) and (5.35), We have
T4(j + 1)1, and
T4(j + 1) = Max(2,
15,
18 + (156),
13 + (156) + (189),
3 + (156) + (189) + (137))
= Max(2, 15,27, 31, 27) = 31.107




Now consider a change in the design requirement. The upper bound of the new
specification is changed from 31 to 29, that is, L , P(j + 1)1. We can either go
through the whole design procedure and find a new solution (coefficient set) that satisfies
the new requirement, or we can use the following approaches based upon the old solution
to reduce the maximum output loop delay.
Approach I>: stage splitting
Although Figiax (18) is the largest upper bound of stagedelays in the micropipe-
line, splitting the second stage won't change the maximum output loop delay (31). This
is because the magnitude of elements corresponding to the second stage is 27, which is
less than 31. Instead, the stage corresponding to having a magnitude of elements equal
to 31 should be split.In our example, the first stage should be split.If we divide it
into two stages such that
7 5 Fi(j + I)13 2F1 ,(j + 1)7, and
5Fi /(j + 1)6
The new maximum output loop delay becomes
T4(j + 1)Max(2, 15, 27, 25, 29, 27) = 29.
Approach 2>: delay shifting
If we shift two units of the first stage's lower bounds to the second stage's lower
bounds, we then have
5Fi(j + 1)13 and 11F2(j + 1)18108
The resulting maximum output loop delay becomes
T4(j + 1)Max(2, 15, 27, 29, 27) = 29.
Approach 3>: lower bound increasing
By increasing the second stage's lower bound to 11, that is,
11F2(j + I )18
the resulting maximum output loop delay changes to
T4(j + 1)Max(2, 15, 27, 29, 25) = 29.
Approach 4>: stage switching
If we switch the first and second stages, such that
9F 1(j + I)18 and 7<_ F2(j + 1)13
the resulting maximum output loop delay becomes
7'4(j + 1) s Max(2, 15, 27, 28, 27)= 28.
All of the above approaches lead to smaller maximum output loop delays. These
approaches can be combined to reduce the maximum output loop delay even more effi-
ciently and effectively.During the process of modifying a design, there are many
choices in each of the approaches. Some choices result in a smaller maximum output
loop delay.Others cause it to grow larger.Each of the above approaches may have
a possible sideeffect, which is making the average throughput smaller (or making the
average output loop delay larger). The best approaches or choices to adopt are those that
will make both average and maximum output loop delays meet the specifications.
5.4 Summary
In this chapter, performancerelated issues of a linear micropipeline with variable
stagedelays are discussed. Given the lower and upper bounds of each stage's forward
and backward delays, the micropipeline's lower and upper throughput bounds can be
easily obtained. The lower output loop delay bound is a sufficient condition for a micro-109
pipeline to meet its lower specification; that is, the "=" sign might notoccur under
certain conditions of combinational forward and backward delays for each stage. The
upper output loop delay bound, however, is a sufficient and necessary condition. There-
fore, we are able to find a micropipeline's worst performance (exact value). Basedupon
representations of throughput bounds, a design procedure is introduced and several de-
sign approaches are suggested to help achieve a new bound foran existing design (appar-
ently, there are ample solutions). Each design approachmay have possible sideeffects,
making the average throughput smaller or theaverage output loop delay larger. The best
approaches or choices to adopt are those that will make bothaverage and maximum
output loop delays meet the specifications.110
6. PERFORMANCE ISSUES FOR
SYNCHRONOUS VS. ASYNCHRONOUS PIPELINES
Given a pipeline circuit, two types of performance parameter, average throughput
and throughput bounds, are usually concerned. In synchronous pipelines, average
throughput is equal to the throughput bounds (upper bound equal to lower bound) which
are limited to the worst stage-delay. As to the asynchronous pipelines (micropipelines),
however, their behavior is quite different. Since throughput bounds (related to the over-
all output loop-delay bounds) of asynchronous pipelines have already been discussed,
in this chapter average throughput will be focused and will be compared to the perfor-
mance of synchronous pipelines. Instead of adopting the approach from the probability
point of view [31,321, we are interested in using numerical forms to explain some inter-
esting phenomena based upon the model and expressions derived from previous chapters.
Note that the average throughput (or average output loop delay) discussed in Sections
6.3 and 6.4 is the "average" performance over the limited number of input data set,
therefore, the conclusions drawn in these sections can not be directly applied to the case
of infinite number of input data set.
The main expressions used for performance comparison in this chapter are (5.1)
and (5.2), which are restated in the following.
T:n+ i(j + 1) = Max(DF,n+ i(j + 1),
DB ,(j + 1),









DB rn_2(j + 1) + (Fi(j + 2)F i(j + 1))> Dr(j + 1),
1= m1 1 =mDB1(j+ 1) +
1=2




(F1(j + 2)Flj + 1))
(F1(j + 2)F1(j + 1))
k-











(111(j + 1 + k 1)131 + k1))DBk_ i(j +I)
1=k
DFm(j + 1m + k) +
DFm+i(jm + k) +





(Bi(j + 1 + k 1)131(j +k1))DBk_i(j + 1)
m-1
D12(j + 1 + k1),
1=k-1
(IV + 1k 1) + k1))DB k_j(j + 1)
D12(j + 1 + k1)) (5.2)
1=k-1
Generally speaking, each element (except for the first and the second) within
expression (5.1) consists of three terms. The first term DEW + 1) (including environ-
mental input and output delay), representing the sum of forward and backward delays,
is called delay-sum (or backward total stage-delay compared to forward total stage
delay DF/j + 1)). The second term F1(j + 2)F1(j + 1) expressing the difference of for-
ward delay is called forward delay slope. As defined before, the third term Dr(j + 1)112
is logic delay. The last two terms of (5.1) only occur in asynchronous circuits and are
the results of handshaking. Hence, they are called asynchronous (or handshaking) pa-
rameters.
6.1 Performance Constraints
We will first compare the performance of a synchronous pipeline with that of a
fixed stagedelay asynchronous pipeline.This comparison is based upon the usage of
edgetriggered registersrising (or falling) edgetriggered registers for synchronous
circuits and doubleedgetriggered registers [29] for asynchronous circuits. The reason
why doubleedgetriggered registers are used in asynchronous circuits is because a mi-
cropipeline uses a twophase handshaking scheme. Details on how clock skew affects
clock period in synchronous pipelines can be found elsewhere [33]. Some restrictions
are briefly summarized below.
2T skew,maxTregTchange T hold (6.1)
T period > TregT settle + T setup + 2T skew,max
where T period : clock period
Tskew,max : maximum clock skew
T reg: register propagation delay
T hoid : register hold time
T setup: register setup time
Tchange: the time for the first output bit of the combinational logic to
(6.2)
change
T settle: combinational logic settle time
Expression (6.1) is required to prevent same input signal from being latched by two
consecutive registers at the same clock edge (with phase shifted due to clock skew).
The clock period should satisfy condition (6.2) in order to latch valid (stable) data.113
On the righthand side graph of Figure 3.4(a), physical delay d is lumped with
delays in both datatoken and spacetoken paths. To clearly demonstrate performance
constraint, it is better to separate d from delays, as shown in Figure 6.1, i.e. F =F;d
and Bt = B;d. From Figure 6.1, there are two constraints for micropipelines to work
properly, as shown below.
B;dTreg + T change T hold (6.3)
F;d ? Treg + TsettleTsetup (6.4)
To prevent new data from overriding old data which have not been latched by the next
register, a relationship like (6.3) should be enforced. Expression (6.4) must be satisfied
to avoid the violation of the bundled data convention for micropipelines.Expression
(6.3) and (6.4) are based upon the assumption that both edgetriggered and double edge
triggered registers have the same T reg, T hoid and Tsetup.
Usually B; is zero for simple micropipelines, as we will assume here.In asyn-
chronous pipelines, expression (6.3) can easily be satisfied in practice.However, it is
difficult to satisfy expression (6.1) in synchronous pipelines. This is because, as die size
grows and transistor feature size continues to shrink, wiringdelays play a more and more
important role compared with gate delays. That is, clock skew (phase delay due to long
wiring) may be longer than T reg and/or Tchange and may cause expression (6.1) to fail.
From expression (6.2), the throughput of a synchronous pipeline (best case) is
1 aperiod=11(TregT settle + Tsetup + 2T skew max)comparedto1/(Fi + Bi)=
11 (T reg T settle T setupd) for an asynchronous pipeline. The relative value of
2Tskew ,maxand d determines if asynchronous pipelines are better than, equal to or worse
than their synchronous counterparts.If 2Tskewmax = d is assumed, the circuits imple-
mented using micropipelines [9] have comparable performance to the same circuits im-
plemented using synchronous pipelines.114
Figure 6.1: One stage of a micropipeline.115
6.2 Output Loop Delay < Worst StageDelay
It is commonly believed that asynchronous pipelines can achieve better perfor-
mance (average) than synchronous pipelines. The main difference between them is that
in synchronous circuits next data cannot be processed until clock has arrived; whereas,
in asynchronous circuits next data can be potentially executed right after the completion
of current data processing.Therefore, the performance of synchronous circuits is
bounded by worstcase delay of a combinational logic block; whereas, the performance
of asynchronous circuits is bounded by "average" delay.Usually, the "average" sense
only comes from the worstcase and bestcase physical delays. For example, for a
nonpipelined nbit ripplecarry adder, the time needed to process some data that need
no carry ripple (best case) is much less than the time required to process certain data
that require n carry propagation (worst case). Therefore, the "average" propagation time
over a large set of data is in between. In this section, it is pointed out that the "average"
sense as stated above is not thorough, especially when a pipeline is considered. More
factors affecting "average" performance will be described in this section.
Consider, for example,a simple pipeline where backward delay, B i(j + 1), is
ignored.If the input and output environmental delays can be made arbitrarily small
compared to F i(j + 1), 1 5 i 5 m, the first and the last elements in expression (5.1)
won't possibly be the maximum value. Moreover, if the rest of the elements in expres-
sion (5.1) have the same probability (this assumption is, in general, not true) of being
selected as the representation of output loop delay, the expectation of output loop delay
(considering delaysum only) for a mstage pipeline is the sum of the product of each




where E[Tmi+ I,delavsum] : expectation (mean) of delaysum term of output loop116
delay
E[F1]: expectation (mean) of stagedelay for stage i
: probability of each element selected in expression (5.1)
It appears that equation (6.5) matches the "average" sense for asynchronous pipe-
lines. However, equation (6.5) is only one of the many factors that make asynchronous
pipelines faster. From expression (5.1), we also notice that forward delay slope and logic
delay may also affect the performance of asynchronous pipelines. Apparently, forward
delay slope can be positive, zero or negative.If it is negative, output loop delay will
be reduced. This implies that not only the delay at instant j, but also the delay difference
(or delay pattern) between adjacent js, affects performance. Since logic delay is never
less than zero, this term always contributes positively toward obtaining better perfor-
mance, based on expression (5.1).However, if expression (5.17) is considered, logic
delay makes the performance of asynchronous pipelines worse. Both statements are
correct.It is the approach (lefthand side or righthand side) that makes these two
statements look opposite.
Figure 6.2 shows the results of a twostage pipeline simulation for a case in
which average output loop delay (11.3242) is less than the maximum stagedelay (20).
In order to simplify our observation and to clarify the simulation result, only the delay
of the first stage in this micropipeline is set to be variable, as shown in Figure 6.2(a)
(assuming B1(j + 1) = 0 for all js), while other delays are set to be constant, i.e., Dth =9,
D2=4 and Dout=11. Figure 6.2(c) lists the specific situations when j=9 and 10, in which
the output loop delay is 743(j + 1) = DB 1(j + 1)D234(j + 1) according to expression
(5.1) for m=2. Interestingly, comparing Figure 6.2(a) and Figure 6.2(b), it is noticed that
the maximum stagedelay (20) won't be reflected to the output loop delay with the same
magnitude. That is, even if a micropipeline stage has a larger delay at a given instant





si(j + 1) + Fi(j + 2)2012.92
F,(j + 2)F1(j + 1)8.747.08
DM + I) 0 0
D234(j + 1) 6.740
Output Loop Delay13.2612.92
(c)
Figure 6.2: Simulation results withDin=9,D2=4 andDout=11.
(a)Delay pattern of F1(j+1)(assuming BI (1+1)=0).
(b)Output loop delay T13(j + 1).
(c)Several delay values when j=9 and 10.118
example, T3(10) = DBOO)Di4(10)= 206.74 = 13.26 when j=9. We speculate
that only part of the maximum stagedelay is passed (13.26) to the output (or environ-
ment) and the rest of it (20-13.26=6.74) is "absorbed" by its handshaking Celement.
The amount of delay (6.74) absorbed by Celement is not detectable by the environment,
hence, leading to a better performance. This kind of performance behavior does not
occur in synchronous pipelines.
From the above discussion, we demonstrated that both the delaysum term and
the asynchronous parameters in expression (5.1) influence the performance of asynchro-
nous pipelines.Considering the effect of the asynchronous parameters gives an added
incentive to employ the asynchronous design approach over the synchronous approach
when negative values of the asynchronous parameters are guaranteed. This is because
the output loop delay becomes the delaysum term minus the magnitudes of the asyn-
chronous parameter, making "average" sense of delay even smaller. A different approach
to describing the same result presented in this section can be found in [34].
6.3 Average Output Loop Delay > Worst StageDelay
Several reasons that give asynchronous pipelines better performance than syn-
chronous pipelines have been explored and brought up. Since in practical applications
the number of input data is finite, we will address the next question based on this as-
sumption: do asynchronous pipelines always perform better than their synchronous coun-
terparts? Again a twostage linear micropipeline is investigated for clarity. The output
loop delay representation is simplified by assuming Din(j + 2), DB2(j + 1) and
Doudl + 1) to be constants and backward delay /31(j + 1) = 0 for all js.Figure 6.3
shows the simulation result with Dth=30.8, D2=4 and Dout=15. From Figure 6.3(a) and
Figure 6.3(c), it is found that the average output loop delay (31.4300) happens to be





















Figure 6.3: Simulation result withD in =30.8, D2 =4 and D out= 15.
(a) Delay pattern of F1(j + 1).
(b) Forward delay slope Fi(j + 2)F1(j + 1).
(c) Output loop delay T3(j + 1).
j+1120
notice that this is caused by the overall effect of positive forward delay slope
Fi(j + 2)Fij + 1), as shown in Figure 6.3(b).For example, when j=8, from the
simulation result we found that D24(9)=D34(9)=0 (not shown in Figure 6.3) and
T13(9) = Din + (F1(10)F1(9))(D34(9) + D234(9))= 30.8 + 9.530 = 40.33 (by
trial and error, we found this equation in expression (5.1) satisfies our simulation result).
The positive forward delay slope 9.53 when j=8 makes the output loop delay correspond-
ing to the same j larger than the maximum stagedelay by the amount of 9.33 (= 40.33
31). From this simulation, we observed that the average outputloop delay for asyn-
chronous pipelines may be larger than maximum stagedelay for a limited number of
input data set.That is, asynchronous pipelines may have worse average performance
than synchronous pipelines, given a set of input data.
6.4 Effect Of Delay Sequence (Pattern)
The discussions in sections 6.2 and 6.3 verify the statement that both the delay
sum and the asynchronous parameters affect the performance of asynchronous pipelines.
Moreover, it has been shown that average output loop delay could be less than, greater
than or equal to (obviously, this is possible although not mentioned in previous sections)
the maximum stagedelay, given a set of input data. In other words, when the number
of input data is finite, asynchronous pipelines could have better, worse or equal perfor-
mance than synchronous pipelines, depending on the magnitude of each stagedelay and
its pattern in time. The simulation result in Figure 6.4 shows how the delay pattern of
forward delay F1(j + 1) and backward delay /31(j + 1) affects the output loop delay.
This simulation is done for a twostage micropipeline with Din=12.8, D2=4 and
Do,=11. In the left most graph of Figure 6.4, six different cases of output loop delays
are plotted against time. The righthand side graphs of Figure 6.4 show the correspond-













case a:Ave.=24.3975case b:Ave.=23.5102case c:Ave.=23.6036
\FB17"..---\filF1/
y \ / NN, / /





% . p 1 20 llll 0 2 02
case d:Ave.=22.6984 case e:Ave.=25.5711case f:Ave.=27.3121
Figure 6.4: Simulation result with Din=12.8, D2=4 and D0=11 for different
cases of forward and backward delays sequence (pattern).122
different j) of F1(j + 1) and B /(j + 1) for all these cases (except for cases a, b and c,
where either F1(j + 1) or B j(j + 1) or both have constant values equal to the average
of these delay elements) are the same except that their orderings are different.
Comparing average output loop delay for case a and case b, we observed that
decreasing F1(j + 1) monotonously leads to better overall performance. On the contrary,
by comparing case a and case c, we found that increasing 131(j + 1) monotonously makes
circuits faster.Case d shows that the average output loop delay is made even smaller
by decreasing F1(j + 1) and increasing B1(j + 1) monotonously. The opposite slopes
of F1(j + 1) and B1(j + 1) applied to this pipeline, as shown in case e, slow down circuit
operation. Among these simulations, case f gives the worst performance.
In a certain sense, we can explain, from expression (5.1) and (5.2), why case d
gives the best performance among these cases. As discussed in section 6.2, negative
forward delay slope and positive logic delay will potentially make output loop delay
smaller. Decreasing F1(j + 1) monotonously means a negative forward delay slope
which implies that a better performance might be achieved. If the term B1(j + 1 + k1)
B1(j + k1) in expression (5.2), called backward delay slope, is positive, a more
positive logic delay is achieved. Increasing B1(j + 1) monotonously guarantee it. One
needs to note that decreasing and increasing monotonously the forward and backward
delays are not practical in designing real circuits. The main purpose of this section is
to show that the delay pattern does affect the average output loop delay of asynchronous
pipelines.In comparison, there is no such effect in synchronous pipelines.One also
should note that the three terms in expression (5.1) are not independent from each other.
Hence, expression (5.1) should not be used as a guideline to design a highperformance
(average throughput) asynchronous circuit.Establishing an algorithm to arrive at an
optimized design (best performance) for an asynchronous pipeline is not trivial.This
is because the delay pattern in each stage is nondeterministic. The study of asynchro-123
nous pipelines from the probability point of views might be a good start in formalizing
design guidelines to obtain the optimized average throughput [31,32].
6.5 Effect Of Number Of Stages
The effect of delaysum, forward delay slope and logic delay on the performance
of asynchronous pipelines has been presented in previous sections. However, from ex-
pression (5.1), there remain one more parameter which may affect performance.This
is m, the number of stages for a pipeline. An interesting example in demonstrating how
the number of pipestages may affect performance is given here. As we all know that
the number of identical stages cascaded to form a synchronous pipeline does not change
the overall pipeline's performance (ignoring clock skew). Is the same statement applica-
ble to asynchronous pipelines?Figure 6.5(b) shows the output loop delays of a two
stage and a threestage asynchronous pipelines with each stage's forward and backward
delays for both pipelines shown in Figure 6.5(a). Apparently, from our simulation result
with Din = 12.3 and Dnia = 11.5, these two pipelines have different output loop delays.
This is because the m determines the number of summation terms having effect on per-








T14(j + 1) = Din +(F1(j + 2)F1(j + 1)) Di 4(j + 1)
1 =1 1=2
T/4(j + 1) contains twomore terms, F3(j + 2)F3(j + 1) and D44(j + 1), than


















Figure 6.5: Simulation result withDin =12.3 and Dow=11.5.
(a)Delay pattern of F i(j + 1) and+ 1) used for all stages.
(b)Output loop delays for both 2stage and 3stage micropipelines.125
and Di4(j + 1) in 71(j + 1) are different from those in T4(j + 1). Note that the results
shown in Figure 6.5 does not include the effects of Celement physical delay and register
propagation delay.
6.6 Summary
The main purpose for this chapter is to demonstrate the performance difference
between synchronous and asynchronous pipelines. The conclusion drawn in this chapter
for a fixed stagedelay micropipeline is that it has a constant performance. This perfor-
mance is equal to the maximum stagedelay within the whole micropipeline. Moreover,
its performance is equivalent to that of a synchronous pipeline as long as2TSkew,max= d,
where d is the physical delay of a Celement. As for the performance of variable stage
delay micropipelines, we showed that not only stagedelay (delaysum) but also stage
delay pattern (asynchronous parametersforward delay slope and logic delay) affect
performance. This leads to the conclusion that asynchronous pipelines may have better,
worse or equivalent performance (average) than synchronous pipelines, given a finite
number of input data set. The results in this chapter also show that the number of
identical stages cascaded will affect the resulting performance.126
7. PERFORMANCE EVALUATION OF TWODIMENSIONAL
ASYNCHRONOUS PIPELINES
Linear micropipelines are also called onedimensional micropipelines. The per-
formance of this kind of pipeline for both fixed stagedelay and variable stagedelay
cases has been studied in previous chapters.If the pipeline is initially reset, the upper
bound of output loop delay calculated using our approach is tight (exact) for both cases.
In this chapter the performance of twodimensional micropipelines, based upon the lin-
ear pipeline model, is discussed. The twodimensional pipelines presented in this chap-
ter are constructed from all of the modules depicted in Chapter 3 Figure 3.5.
Since micropipelines feature modularity and composibility in circuit construction,
we would like to adopt this feature as much as possible in evaluating their performance.
This feature will reduce the complexity in calculating performance when consideringa
larger system. To facilitate the calculation while maintaining the accuracy of approxima-
tion, the Equal loopdelay theorem is used to derive performance estimations for each
basic module. The result is a set of difference equations. Then, the given stagedelay
bounds are applied to this difference equations to obtain the approximate upper and lower
bounds. When evaluating the performance of a system constructed from these basic
modules, the results (in terms of stagedelay bounds) obtained for each basic module
can be applied directly to the system, instead of constructing huge expressions (differ-
ence equations) describing the system in terms of j.
It seems that the concept of equivalent input and output delays is not so important
when linear pipelines are discussed. One of the reasons is because the output loop delay
can be obtained directly using Eq.(5.1) without utilizing the concept of equivalent input
and output delays. Of course, the same result can be reached if approached using equiva-
lent input and output delays. As will be seen in the next chapter, the concept of equiva-
lent input/output delay becomes more important when a twodimensional pipeline sys-127
tern is treated. As a result, the expressions for equivalent input and output delays for
each module will be presented.
7.1 Performance Of Asynchronous Pipelines With Fork
A Fork pipeline, formed by Event AND module, is shown in Figure 7.1(a). The
equivalent input and output delays (Dr i(j + 2), Dr(j + 1) and Dr(j + 1)) in this
Fork figure are used to represent environmental input and output delays. The perfor-
mance of Fork pipelines is developed using equivalent delay technique, as shown in
Figure 7.1(b). The result can be applied to Fork pipelines with any stages. Note that
to save space, we are only interested in upper bound calculations.
The equivalent input delay Diinij + 2) can be represented as follows.
Diing(j + 2) = Dr(j + 1) + Bi(j + 1) + D42(j + 2) + Fi(j + 2) (7.1)
It should be noted that Dr(j + 1) and D12(j + 2) are not independent.Instead, their
relationship is governed by an equation obtained by applying Equal loopdelay theorem
to Ci.That is,
D;" i(j + 2) + D24(j + 2)
= Fi(j + 1) +[44(j + 1) + Dr(j + 1) + Bi(j + 1) + D12(j + 2) (7.2)
+
= Max(0, ± 2)Fi(j + 1)E;14(j + 1) Dr(j + 1) Bi(j + 1))
Therefore,
Dr(j + 1) + 142(j + 2)
= Max(Dr(j + 1), DI" i(j + 2)Fi(j + 1)E44(j + 1)Bi(j + 1)) (7.3)
Note that N4(j + 1) is a shorthand of Dr+1/q(j + 1).
By applying Equal loopdelay theorem to Event AND Ca in Figure 7.1(a), we
have128
cross section v
D?'"(j + I) r-
cross section q









Figure 7.1: A twodimensional micropipelineFork.
(a) General representation of pipelines with Fork.
(b) Equivalent circuit when looking into cross sections q and p.
(c) Equivalent circuit when looking into cross section r.129
B i(j + 1) + D42(j + 2) + F (j + 2) + Dr(j + 2) +(j + 2)
= B+ 1) + D42(j + 2) + F i(j + 2) + 14,4(j + 2) + Diyj + 2)
D'aiv(j + 2) + Dr(j + 2) = DPaq(j + 2) + D.1,4(j + 2)
Dr(j + 2) = Max(0, D;,4(j + 2)Dr(j + 2)) .5 Max( 0, D2p4(j + 2))(7.4)
DPaq(j + 2) = Max(0, Dr(j + 2)Df,4(j + 2)) < Max(0, Dr(j + 2))(7.5)
Note that DZP(1) = DT(1) = 0 if the pipeline is initially reset. Also note that all delays
in Figure7.1are variable.
From expression (7.4),
Dr(j + 2) 5 Max(0, 14,4(j + 2)) and
Di,4(j + 2)Max( 0,
Dr(j1)DB i(j + 1)DPaq(j + 1)logic delay)
Max(0,
Dr(j ±)DB i(j + I )DPaq(j + 1))
Max(0,
Dr(j + 1)DB i(j + 1))
DqaP(j + 2)Max(0,
Bol,ut(j + 1 )DB i(j + 1))





Considering(7.1), (7.3)and the initial condition whereD94(1)= DZP(1) = 0, we have
Dzi7q(2) = Max(B/I) + F/2),
Din 1(2)F i( I ) + F i(2))
Max(Dinq(2))ii0 = Max(Frax + BM"",
+ Fr ax (7.9)130
From (7.1), (7.3) and (7.7), we have
Diin,q(j + 3)= Dr(j + 2) + Bi(j + 2) +1)12(j + 3) + F i(j + 3)
sMax(Fi(j + 3) + Bi(j + 2),
Dr(j + 1)F i(j + 2) + Fi(j + 3)Bi(j + 1) + Bi(j + 2),
Diin(j + 3) + Fi(j + 3)Fi(j + 2))
Max(Dijn,q(j + 3S Max( Frax + Bra x,
Dout,MaxL.FraxFrinBraxffrin
Din,maxFraxpinin) 1
Combining expressions (7.9) and (7.10), we conclude
Max(Dijnij + 2))1i S Max( Frax + Bra x,




= UP/(D iij + 2))li (7.11)
And, the output loop delay will be
Max(71+ 1,q(j + 1 ))Ii s Max( Irqui'max, UP 1(Drq(j + 2))li) (7.12)
Similarly, we have




= UP/(Diin,p(j + 2))1i (7.13)
Max(71+ 1,p(j + 1))1j s Max( Dp"tAlax, UP j(Din,p(j + 2))1j) (7.14)
It is useful to obtain the equivalent output delay by looking into cross section
v, as indicated in Figure 7.1(a), when the system becomes more complicated, as will be131
seen in the next chapter. The equivalent output delay of Dri(j + 1) can be represented
as follows.
Dra(j + 1) = Fi(j + 1) + Dr(j + 1) + Dr° + 1) + Bi(j + 1), or (7.15)
Drt(j + 1) = Fi(j + 1) + Di,4(j + I) + DfNj + I) + Bi(j + I) (7.16)
/30(1ut(1) = Fi(1) + Bi(1)
Max(Dri(1))1i=0FraxBrax (7.17)
From (7.15), we have
Drt(j + 2) = Fi(j + 2) + Dr(j + 2) + DqaP(j + 2) + Bi(j + 2) (7.18)
Since Dr(j + 2) = Max(0,144(j + 2)Dr(j + 2)) from (7.4), we have
Dr(j + 2) + Dr(j + 2) = Max(Dr(j + 2), DP(j + 2)) (7.19)
The above expression implies that only maximum logic delay (Di4(j + 2) or /44(j + 2))
can be detected when looking into cross section r.That is, logic delay D'aiP(j + 2) or
+ 2) can not be detected. This property makes a system easier to analyze and the
approximation result is more accurate. If the initial condition is also taken into account,
the equivalent circuit can be viewed as shown in Figure 7.1(c). Due to the symmetry
of Fork circuits and from (7.6), (7.18) and (7.19), we have
D7ut(j2)Max( Fi(j + 2) + 13,(j + 2).
Dr(j )Bi(s) + 1) + Bi(j + 2),
Drto + 1)Bi(j + 1) + Bi(j + 2)) (7.20)
Max(Dr(j + 2))1i s Max( Fr" + Brax,
DoutmaxBrayBripin
DOut,MaxBraxB7lin)
Combining (7.17) and (7.21), we conclude





As was done for linear micropipelines, we would like to compare the output loop
delays ( UP/and UPI P) based on our approach with the exact upper bounds (Max,q
and Max,p) obtained using CTSE. Figure 7.2(a) shows the Fork circuit and Figure 7.2(b)
depicts its corresponding Petri net model. One dummy transition (Aint) and two dummy
places (6 and 10) are artificially added to the Petri net to model the fork behavior (one
input token generates two output tokens). To correctly model performance, the delays
of these two dummy places are set to zero.All of the Celement physical delays are
assumed to be zero. Since only the upper bound is concerned in most of the applications,
the simulation results in Table 7.1. show upper bound comparisons only. As observed
from this table, our approaches have greater or equal values than the exact upper bound,
which is as expected. Figure 7.2(c) shows this comparison based upon Table 7.1.
Table 7.1: Comparison of our approximations with the exact bounds
using a general Fork pipeline as an example.
1 2 3 4 5 6 7 8 9 10
Dili [2 15][26 40][2 15][26 30][26 30][6 10][2 15][6 30][16 35][2 15]
F i [1 10][20 22][1 10][20 22][20 22][10 11][1 13][10 31][20 31][1 10]
B i [3 18][4 15][3 8][4 15][4 15][4 7][3 10][14 17][14 17][3 8]
Nut[29 35][12 14][25 30][20 54][20 34][10 55][30 35][10 35][10 35][25 30]
now
I- 'P [21 25][12 14][21 24][22 44][22 44][22 44][39 44][22 24][22 44][31 33]
Max,q35 42 30 57 57 55 63 51 58 47
UP1q49 42 38 57 57 55 63 51 58 47
Max,p59 42 44 67 47 59 44 59 49 33







































UP 1,q & UP -x-
Figure 7.2: A Fork pipeline and its Petri net model.
(a) General representation of a Fork pipeline.
(b) Petri net model.
(c) Result comparison.134
7.2 Performance of Asynchronous Pipelines with Join
A Join pipeline, merging two pipelines into one by using Event AND module,
is shown in Figure 7.3(a). The equivalent input and output delays in this Join figure are
used to represent the other possible circuits. We will develop the performance of Join
pipelines using the equivalent delay technique. The result can be applied to Join pipe-
lines with any stages.
The equivalent input delay Diin(j + 2) can be represented as follows.
Dir(j + 2) = B i(j + 1) + + 2) + Dr(j + 2) + F i(j + 2), or (7.23)
Diin(j + 2) = B1(j + 1) + DI:p2(j + 2) + DPaq(j + 2) + F1(j + 2) (7.24)
By applying the Equal loopdelay theorem to Event AND Ca in Figure 7.3(a), we have
Fi(j + 1) + 1;0_4v i(j + 1) + Bi(j + I) + 142q(j + 2) + Dr(j + 2)
= Fi(j + 1) + D;3+4 1(1 + 1) + B i(j + 1) + D1p2(j + 2) + DiaNj + 2)
1_-YalP(j + 2) + Drq(j + 2) = Tnj + 2) + D1p2(j + 2)
Dr(j + 2) = Max(0,p2(j + 2)D42q(j + 2)) < Max(0, + 2))(7.25)
Dg(j + 2) = Max(0, rY4q2(j + 2)131p2(j + 2))Max(0, D12q(j + 2))(7.26)
Note that both Dr(/) and Dr(1) are not equal to zero in general and they will affect
exact throughput bounds. Their values can be obtained by finding the time difference
for the first token (one at each pipe branch) to travel from environment to the input ends
of Celement Ca. Also note that all of the delays in Figure 7.3 are variable delays.
From (7.25) and (7.26), we have
DqP(j + 2) = Max(0, D4p(j + 2)D12q(j + 2))
Dr(j + 2) + 142q(j + 2) = Max(D12q(j + 2), D4p2(j + 2)) (7.27)
/:)f)q(j + 2) = Max(0,q2(j + 2)D1,p2(j + 2))
+ 2) + D4,p2(j + 2)= Max(D4p2(j + 2), D1q2(j + 2)) (7.28)135
cross section q







Figure 7.3: A twodimensional micropipelineJoin.
(a) General representation of pipelines with Join.
(b) Equivalent circuit when looking into cross section r.
(c) Equivalent circuit when looking into cross sections q and p.136
Therefore, if we look into cross section r in Figure 7.3(a), the equivalent delay that will
be seen is Max(Dr!q(j + 2), DI p2(j + 2)), as shown in Figure 7.3(b). Based on expressions
(7.27) and (7.28), the equivalent input delay Diin(i + 2) represented by expressions (7.23)
and (7.24) can be rewritten as
f);"( j + 2) = B i(j + 1) + Max(Dg(j + 2), D4,p2(j + 2)) + F i(j + 2) (7.29)
If we are able to find the delay bounds for either (7.27) or (7.28), the delay bounds for
Diin(j + 2) can be obtained. By applying the Equal loopdelay theorem to Celement
C14,, we have
D4p2(j + 2)Max(0,
D;;1(j + 2)DF i(j + 1)DPaq(j + 1)logic delay)
Max(0,
+ 2)DF z(j + 1)DP2(j + 1))
Max(0,
Di1,1(j + 2)DF i(j + 1)) (7.30)
Similarly, we have
D'.12q (j + 2)Max(0,
1,
14(j + 2)DFi(j + 1)Dr(j + 1))
5_ Max(0,
Di4i(j + 2)DF i(j + 1))
Combining expressions (7.29), (7.30) and (7.31), we conclude
Diin(j + 2)Max( F i(j + 2) + B i(j + 1),
+ 2) + F i(j + 2)F i(j + 1),
Di1,1(j + 2) + F i(j + 2)F i(j + 1))




The output loop delay will then have the following form.




As will be shown in the next chapter, obtaining the equivalent output delays by
looking into cross sections q and p will help analyze a more complex circuit. From the
definition of equivalent output delay, we have
LY lq w(j + 1) = Dr° + 1) + Fi(j + 1) + 1:1(j + 1) + Bi(j + 1) (7.34) t,
j + 1 )Dicl,q(j + 1 )Fi(j + 1) + I41_1(j + 1) + Bi(j + 1) (7.35)
For the initial condition case, D'aiP(/) = Max(0, Dipn(2)Diqn(2)) and lef.1(1) = 0, we
have
DZV1)= Max(Fi(I)+ Bi(1),
DP(2) Di,i(2) + F1.(1) + Bi(1)
Max(Dr.:(1))11.,0= Max(FUwax + BI'fax,
Din,MaxDindnin + (7.36)
From (7.34), we have
DZqut(j + 2) = + 2) + Fi(j + 2) + D?"4_1 1(j + 2) + Bi(j + 2) (7.37)
By applying the Equal loopdelay theorem to Celement Ci+1, we have
/3:14F i(j + 2)
= Max(0,
+ 1)Bi(j + 1)Dr(j + 2)Fi(j + 2)logic delay)
BogP(j + 2) + Dr+I 1(1 + 2)S Max(Dr(j + 2),
D'in+nlj + 1)Bi(j + 1)Fi(j + 2)) (7.38)138
From (7.25) and (7.30), we know
Dr(j + 2) 5_ Max(0,
D;',1(j + 2)DFi(j + 1)) (7.39)
Max(Dr(j + 2))1iMax(0,
Dpin'maxDF`inin) (7.40)
From (7.37), (7.38) and (7.39), we have
.1)°,qut(j + Max(Fi(j + 2) + Bi(j + 2),
DP(j+ 2) Fi(j + 1) + Fi(j + 2) Bi(j+ 1) + Bi(j + 2),
+ 1) Bi(j + 1) + Bi(j + 2))
Max(Dr,;(j + 2))1iMax(Firc" +
Din,MaxF axFminBrax D M"
Dout,MaxByax
i +1 ifinin)
Combining (7.36) and (7.41), we conclude
Max(1)°,qut(j + 1))1jMax(Frax +BMax,
Din,MaxDin,min axBrax
P q




Max(D°%"(j + Max(Frax + Brax,







Figure 7.4(b) is a Petri net representation of Figure 7.4(a). To compare the output


















Figure 7.4: A Join pipeline and its Petri net model.
(a) General representation of a Join pipeline.
(b) Petri net model.
(c) Result comparison.
139140
obtained using CTSE, several simulations are done and tabulated in Table 7.2. The
graphical presentation is shown in Figure 7.4(c). Again, the Celement's physical delay
is assumed to be zero. The simulations show that our approach has the same values as
the exact bound. However, a general conclusion of our approach leading to the identical
result as the exact bound cannot be drawn unless a formal proof is made.
Table 7.2: Comparison of our approximations with the exact bounds
using a general Join pipeline as an example.
1 2 3 4 5 6 7 8 9 10
Diqn [17 25][27 47][17 20][17 20][11 21][21 31][21 51][1 11][1 11][10 21]
Di [2 15][12 35][2 15][12 25][4 31][4 20][4 40][4 40][4 10][4 40]
F; [5 20][15 25][5 15][15 25][6 25][12 25][22 25][22 25][22 25][12 25]
B, [13 18][13 18][13 18][13 28][8 14][8 14][8 24][8 24][8 24][8 24]
TV/[29 35][29 35][29 45][29 35][10 40][10 44][10 44][10 44][10 34][10 59]
Max40 57 45 53 50 44 54 49 49 59
UP 140 57 45 53 50 44 54 49 49 59
7.3 Performance Of Asynchronous Pipelines With Toggle/Xor Pair
In previous sections, the performance of two dimensional pipelines (Fork and
Join) constructed using Event AND (Celement) module have been discussed.In this
section, a different two dimensional pipeline built with Toggle and XOR (Event OR)
modules is demonstrated and its performance is derived. Figure 7.5(a) depicts a general
form of pipelines with Toggle and XOR modules.
Several approaches to acquiring the performance for a XOR/Toggle pipeline are
possible.It appears that using the approach of equivalent input delay by looking into












Figure 7.5: A twodimensional micropipelineToggle/XOR.
(a) General representation of pipelines with Toggle/XOR.
(b) Token index transformation.
(c) Equivalent circuit.142
resulting equivalent circuit is shown in Figure 7.5(c). Due to the function of Toggle,
the token index transformation, as shown in Figure 7.5(b), is required when tokens toggle
in two different paths. The token index c (c1) is different from j (j > 0) which we
used to adopt. Token index c represents a datum sequence which indicates that the datum
is being, or is about to be, processed. For example, the (j+2)th datum for Fi, conven-
tionally denoted as Fi(j + 2), is also represented as Fi(c), where c=j+2. With under-
standing of Toggle's operation (the first output token goes to the dotted output terminal),
the token index transformation becomes obvious and is selfexplanatory as shown in
Figure 7.5(b).
In turn, we would like to apply the Equal loopdelay theorem to Celements
Ci+ 14, and Ci+Note e that there is no logic delay term for Toggle and XOR modules.
They have only propagation delays. Assuming that the propagation delays for Toggle
and XOR are constant and equal to DT and Dx, respectively. By applying the Equal
loopdelay theorem to Ci with token travels through the looprstuvw
r, as indicated in Figure 7.5(a), we have
D42(je + 2) = Max(0,
Din
24 e/) ± 1 i(je + 2) [Fi(j, + 1) + DT + Di+ 1,q( ) +
Dx + B1(2 xUe + 1) + 11)l) 2
= Max(0,
Dri(je + 2)[Fi(je + 1) + DT +
Dx + Bi(je + 1)])
where je = 0, 2, 4, 6..., or
D42(even)= D42(2j + 2)
= Max(0,
Diz-ti 1(2j(2j + 2) [Fi(2j + 1) + DT + D34+ 1,9(j + 1) +
D24tie4.
t+1,q1 2143
Dx + Bi(2j + 1)]) (7.44)
where je = 2j and j = 0, 1, 2, 3, 4...
Using similar approach to Ci with token travels through the loop rsxyv
wr, as indicated in Figure 7.5(a), we have
D42(jo + 2) = Max(0,
min i(jo + 2)[F,(j +) + DT + Di:1,p(j" 2+ 1)+
Dx + Bi(jo + 1)l)
where jo = 1, 3, 5, 7..., or
D42(odd) = D12(2j + 3)
= Max(0,
Diin 1(2j + 3) [Fi(2j + 2) + DT + Df_41,p(j + 1) +
Dx + Bi(2j + 2)]) (7.45)
where jo = 2j + 1 and j = 0, 1, 2, 3, 4...
Caution should be taken when Ci+1,q is considered. Due to Toggle's operation,
the token will not travel through the loop uvwrst.In stead, it will follow
the path ofuvwrsxyvwrst. Similarly, the token will traverse
the loop ofyvwrstuvwrsxwhenCi+Lp is considered.
The resulting logic delays are as shown below.
Dr÷1 i,q(j + 2) = Max(0,
Dr(j + 1)[Dx + Bi(2j + 1) + D42(2j + 2) +
Fi(2j + 2) + DT + 141.1p( j + 1) + Dx + Bi(2j + 2) +
D42(2j + 3) + Fi(2j + 3) + DT])
Dr+I 1,p(j + 2) = Max(0,
D;',w(j + 1)[Dx + Bi(2j + 2) + D12(2j + 3) +
(7.46)144
F1(2j + 3) + DT + 1,q(j + 2) + Dx + B1(2j + 3) +
D42(2j + 4) + Fi(2j + 4) + DT]) (7.47)
where j = 0, 1, 2, 3, 4...
Now, we are ready to find the delay bounds for D'i"q(j + 2) and Diin,p(j + 2). From
previous discussion, we have
Diin,q(j + 2) = Dx+ B1(2j + 1) + D42(2j + 2) + F1(2j + 2) + DT + Dr+I p(j + 1)
+ Dx + B,(2j + 2) + D42(2j + 3) + Fi(2j + 3) + DT (7.48)
T:q(j + 2)Bi(2j + 1) + B1 2j + 2) + F1(2j + 2) + F1(2j + 3)
+ 2(DT + Dx) (7.49)
From (7.46), we have
Dr+I /,q(j + 1) = Max(0,
Dr(j) [Dx + Bi(2j1) + D12(2j) + Fi(2j) +
DT + felt. 1,19(j) + Dx + B1(21) +
D12(2j + 1) + Fi(2j + 1) + DT]) (7.50)
From (7.47), we have
Dr+i 1,p(j + 1) = Max(0,
Dr(j) [Dx + Bi(2j) + D42(2j + 1) + Fd2j + 1) +
DT + D_E4 1q(j + 1) + Dx + 13,(2j + I ) +
142(2j + 2) + Fd2j + 2) + DT]) (7.51)
It should be noted that the terms D12(2j + 2), D12(2j + 3) and Dflf 1,p(j + 1) in
(7.48) are not independent. Therefore, if we directly substitute expressions (7.44), (7.45)
and (7.51) into (7.48), we will get a larger upper bound approximation than if we can
identify their relationship and find the upper bound of the group. That is,145
Max(D12(2j + 2) + DI2(2j + 3) + D?1_1,p(j + 1))
Max(D42(2j + 2)) + Max( Dr2(2j + 3)) + Max( D?If 1,p(j + 1)) (7.52)
Their relationship can be described as follows. From (7.45), we have
D12(2j + 3)
= Max(0,
Dill 1(2j + 3)[Fi(2j + 2) + DT + 1,p(j + 1) + Dx + Bi(2j + 2)])
D42(2j + 3) + Do'c+1 1,p(j +1)
= Max( [4_4 i,p(j + 1),
Diin_ 1(2j + 3)[Fi(2j + 2) + DT + Dx + Bi(2j + 2)]) (7.53)
By substituting + 1) in (7.51) into (7.53), we get
/312(2j + 3) + Dr+I i,p(j + 1)
= Max(0,
Dr(j) [Dx + Bi(2j) + D12(2j + 1) + Fi(2j + 1) + DT + 1,q(j + 1)
+ Dx + Bi(2j + 1) + D12(2j + 2) + Fi(2j + 2) + DT],
Dii'L 1(2j + 3)[Fi(2j + 2) + DT + Dx + Bi(2j + 2)])
Bo42(2j + 3) ++1,p(1 + 1) + D12(21 + 2)
= Max( DI2(2j + 2),
IN,ut(j) [Dx + 3,.(2j) + D12(2j + 1) + Fi(2j + 1) + DT + D21_4E1q(j + 1)
+ Dx + Bi(2j + 1) + Fi(2j + 2) + DT],
Diin 1(2j + 3) + 142(2j + 2)[Fi(2j + 2) + DT + Dx + Bi(2j + 2)])
By substituting D42(2j + 2) in (7.44) into the above equation and combining with (7.48),
we have146
+ 2)
= Max(Fd2j + 2) + Fi(2j + 3) + Bi(2j + 1) + Bi(2j + 2) + 2(DT + Dx),
Dr 1(2j + 2)Fi(2j + 1) + Fi(2j + 2) + Fi(2j + 3) + Bi(2j + 2) + DT
+ Dxlogic delays,
Dr(j) Fi(2j + 1) + Fi(2j + 3)Bi(2j) + Bi(2j + 2)logic delays,
Diin1(2j + 3) + Fi(2j + 3) +2j + 1) + DTDx,
Diin 1(2j + 2) + 1(2j + 3)Fi(2j + 1) + Fd2j + 3)logic delays)
(7.54)
Note that although Bi(2j) and D'r(j) are not defined when j=0 (they appear in the
Bg_4F1 p(j + 1)), they will not affect the upper bound representation because when
Lo?_41,p(1)= 0, the representation ofDiin,q(j + 2) is a subset of expression (7.54).
Using a similar approach, we have
Diinp(j + 2) = Dx + Bd2j + 2) + D42(21 + 3) + Fd 2j + 3) + DTD?1_ q( + 2)
+ Dx + Bi(2j + 3) + D12(2j + 4) + Fi(2j + 4) + DT (7.55)
+ 2).Bi(2j + 2) + Bd2j + 3) + F/2j + 3) + Fi(2j + 4)
+ 2(DT + Dx) (7.56)
Dt,p(j + 2)
= Max(Fd2j + 3) + Fd2j + 4) + Bd2j + 2) + Bi(2j + 3) + 2(DT + Dx),
Din 1(2j + 3)Fi(2j + 2) + Fi(2j + 3) + Fi(2j + 4) + Bi(2j + 3) + DT
+ Dxlogic delays,
Dr(j + I)Fd2j + 2) + Fd2j + 4)Bi(2j + 1) + Bi(2j + 3)
logic delays,
Diin_1(2j + 4) + Fi(2j + 4) + Bd2j + 2) + DTDx,147
Din(2j + 3) +
11(2j + 4)F/2j + 2) + F/2j + 4)logic delays)
(7.57)
By looking into cross section v, there are two possible token travelling paths.
If the token follows the upper path (q path), we have
Dra(j +1) = F/2j + 1) + DT + Drli_j,q(j + 1) + Dx + 13,(2j + 1) (7.58)
From (7.50), we have
Dour( j1) = Max(Fi(2j + 1) + B/2j + 1) + DT + Dx,
Nut(j)F1 2j)B/2j1)B/2j) + B/2j + 1)Dx
DTlogic delays) (7.59)
On the other hand, if the token travels the lower path (p path), we have
Tutu+)F,(2j + 2) + DT + 1) + Dx + B/2j + 2) (7.60)
From (7.51), we have
+ 1)= Max(F/2j + 2) + B/2j + 2) + DT + Dx,
Dr(,j)Fi(2j + 1)B/2j) B/2j + 1) + B/2j + 2)Dx
DTlogic delays)
If we let Diinfin + 2) Max,
F/j + 1) < Frax,
Yn"' 5 B/j + 1) < Br",
Nutinin< Nut(j + 1) <nout,Maxand
Dpow,min< ryirt(j + 1) <Dpota,max,
then from (7.49), we have
min(Diing(j + 2))1j. 2BTin2Pnin + 2 (DT + Dx)
From (7.54), we have
Max(Dinq(j + 2))Ii
< Max(2Flklax+ 2131llax + 2(DT + Dx),
(7.61)
(7.62)Din'Max2Pi WaxF'Pin + Brax + DTDx, 1-
rlout,MaxFraxpininByax _Bmin
I '13
2Din'MaxFriax n ) 1
From (7.56), we have
min(Diinij + 2))Ii2137th + 2F'nin + 2( DTDx)
From (7.57), we have
Max(D'i7p(j + 2))Ii
< Max(2Fr'+ 2B Max +2(DT+ Dx),
Dtn Max+ 2FrxFTin + BMax + DTDx,
+ Fr i axPninBrax n,





The maximum value of Drt(j + 1) will be the maximum of (7.59) and (7.61).
That is,
Max(Dnj + 1))1i
< Max( Frax+ Brax + DTDx),
ta,maxF7nnBy=2137zinDTDx,
out,MaxPinin 2 ifininDTDx) ' (7.66)
To compare the output loop delays (UP14and UP 1p) based on our approach with
the exact upper bounds (Max,q and Max,p) obtained using CTSE, Figure 7.6(a) and
Figure 7.6(b) depict the circuit and Petri net model of Toggle/XOR, respectively. Com-
parison results are tabulated in Table 7.3. Based upon this table, the graphic result is
also plotted in Figure 7.6(c). Note that the physical delays of Toggle and XOR are
assumed to be zero in our simulations.Max,q and UP1q Max,p and UP
(c)
-10




Figure 7.6: A Toggle/XOR pipeline and its Petri net model.
(a) General representation of a Toggle/XOR pipeline.
(b) Petri net model.
(c) Result comparison.150
Table 7.3: Comparison of our approximations with the exact bounds
using a general Toggle/XOR pipeline as an example.
1 2 3 4 5 6 7 8 9 10
Din [2 15][2 10][2 7][2 20][2 30][2 18][2 30][2 27][2 10][2 20]
F, [1 10][8 9][8 9][1 10][2 9][6 18][6 13][6 13][8 9][6 20]
Bl [3 18][9 15][9 20][3 18][13 20][12 12][10 15][10 15][9 15][10 15]
Nut[29 35][29 35][29 35][29 35][19 25][19 62][9 12][9 35][29 35][9 55]
Dirt[21 29][21 69][21 49][21 29][10 19][10 19][10 45][10 40][21 29][10 53]
Max,q56 76 61 57 67 62 67 62 48 72
UP1q56 76 61 57 67 62 67 62 48 72
Max,p59 69 58 59 67 74 67 62 48 74
UPI p59 69 58 59 67 74 67 62 48 74
7.4 Performance Of Asynchronous Pipelines With Arbiter/Call Pair
Frequently, a circuit is needed that allows two different parties to compete for
one resource. When using asynchronous pipelines, this type of circuit can be implement-
ed with Arbiter and Call modules, as shown in Figure 7.7(b). The performance foran
asynchronous pipeline with Arbiter/Call pair is investigated in this section.
The Arbiter, introduced by Sutherland [9], is shown at the far left graph in
Figure 7.7(a).There are no Acknowledge terminals, Al and A2, attached to Arbiter at
all. Depending on the applications, the Acknowledge terminals can be either from Grant
terminals, G/ and G2, or Done terminals, Dl and D2. In our performance analysis the
Acknowledge signals, Al and A2, are assumed to be connected with G/ and G2.This
is shown on the middle graph in Figure 7.7(a). To avoid the confusion of wiringcross-
over, the appearance of Arbiter is redrawn as shown at the far right graph in
Figure 7.7(a). Assume that the propagation delays for Arbiter and Call are constant and








Figure 7.7: A twodimensional micropipelineArbiter/Call.
(a) Arbiter representations.
(b) General representation of a pipeline with Arbiter/Call.
(c) Equivalent circuit.152
ter is assumed to be very close to the output ends and the time for this mechanism to
resolve the competition is very short, such that it is negligible. As a result, the time
for one of the Done terminals (Dl and D2) to release the suspended token (if any) is
assumed to be zero.If e and .1 in Figure 7.7(b) are points right before the mutually
exclusive mechanism, the time for a token to travel from e to e and f to f is a physical
delay and equal to DA. On the other hand, the time for a token to be released at E (or
j) and to arrive at u (or y) is a waiting delay. This waiting delay is similar to the waiting
or logic delay of a Celement as defined before, but in more general sense.
As long as the delay bound of Diin(j + 2), equivalent input delay by looking into
cross section I as shown in Figure 7.7(b), is obtained, the performance of an asynchro-
nous pipeline with Arbiter/Call pair can be derived. There are four possible token paths
composing the delay of Dr(j+2). They are rstuvw, rsxyv
w,rstyvwandrsxuvw. The only difference in terms of
delay among these four paths is the delay from t to u, x to y, t to y, and x to u, denoted
as D'in(j + 1), D-7(j + 1), DiiY(j + 1) and DI (j + 1), respectively. All these delays are
the time for a resource to wait for a party to grant the permission. Since Arbiter is based
on the first come and first serve rule, the Dr( j + 2) is equal to the minimum of these
four delays. That is,
from Figure 7.7(b), we have
Diin(j + 2) = min(Bi(i + 1) + Dc + D;"(j' + 1) + Dc + + 2),
Bi(j + 1) + Dc + D7(1 + 1) + Dc + + 2),
Bi(j + I) + Dc + D'iY(1 + 1) + Dc + F i(j + 2),
B,(j + I) + Dc + D-r(j1 + 1) + Dc + Fi(j + 2)) (7.67)
+ 1) + Dc + Dc + Fi(j+ 2) (7.68)153
By applying the Equal loopdelay theorem to Celements Ci,q and Cip in
Figure 7.7(b), we have
Dg(j + 2) = Max(0,
Di; (j + 2)[DA + Dr(j + 1)]) (7.69)
+ 2) = Max(0,
Dr,?(j + 2)[DA + Dfl(j + 1)]) (7.70)
The token traveling of Arbiter/Call is unlike those of FIFO, Fork, Join and
Toggle/XOR. For FIFO, only one token traveling path is possible. As to Fork and Join,
there are two token traveling paths and a token will be split into two parts, with one
traveling in each path.The token traveling of Toggle/XOR is similar to that of Fork
and Join except that the routing path alternates. Therefore, exact expression(s) describ-
ing token routing can be derived for each of these modules. Since Arbiter operates on
a first come and first serve basis, no exact expression can be possibly deduced to repre-
sent its token traveling path .However, we only need to obtain the expression describing
the maximum delay in order to find the maximum bound. The notation Max(f( ))IP
denotes the maximum value of f( ) function with respect to all of the possible token
traveling paths, denoted as p. This is different from the notation of Max(f( ))1i, which
refers to the maximum value of f( ) with respect to j. The most complete representation
for the case we are interested in is Max(f( ))1which indicates the maximum value
./43,
of f( ) with respect to both index j and token traveling path p. Since the token traveling
path for FIFO, Fork, Join and Toggle/XOR is known or describable (e.g., the expressions
corresponding to all possible paths for Toggle/XOR is explicitly enumerated), the sub-
script p is ignored.Therefore, Max(f( .=ax(f( ))I. in these cases. Also note
that, in general, Max(f( ))1.ip = Max(Max(f( ))Ij)Ip = Max(Max(f( ))1p)Ij.154
The delay of VY(/ + 1) is considered first.D7(j' + 1) is the time difference
between two tokens arriving at point t and f As far as the Dt.Y(f + 1) is concerned,
the bottom part of Arbiter can be modeled as a Celement with f, t, h and y correspond-
ing to lq_1, A°, and R7 in Figure 3.4(b).If a token arrives at t earlier than at
f for j' = k, then V.Y(k + 1) = 0.If not, V.Y(k + 1) > 0. Therefore, to have maximum
value of DlY(/ + 1), the path delay to f should be as large as possible (longest path to
j) and the path delay to t should be as small as possible (shortest path to t). Considering
the case of Max(VY(1))1p, we know the token path to f has two different routes. One
is dbff and this corresponds to the consideration of the initial condition. The
other route is hff and this indicates that the bottom part of Arbiter (G2 in
Figure 7.7(a)) has at least granted permission once. We discuss these two routes sepa-
rately.
Route 1 (dbfj): The other matching token path to t is caee
N(uvwrst), where N, representing the number of times a token travels around
the upper loop, is a positive integer greater than or equal to 1.Of course, an infinite
number of routings is possible due to the infinite looping.Therefore, to make DtY(1)
as large as possible among these paths, a route for a token traveling the loop once (N
= 1) is chosen (note that the less time the token travels on the upper section to arrive
at t, the more time this token needs to wait for the incoming token on the lower section
to arrive at f, i.e., V.Y(/) is larger).That is, based upon the functions of Arbiter and
Call, and the initial condition, we have
Max(D7(1))1p route 1
= Max(0,
Dipn(1)Di,p4(1) [Diqn(1)D24q(1)+ DA + Dr(1)Dc
+ Fi(I) + DM(I)+ Bi(1)+ DO)155
= Max(0,
D';(1)[D`,;7(1)+ pr(1)+ Dc + F,(1) + Bi(I)+ Dc]) (7.71)
Route 2 (hff): The other matching token path to t which has the shortest
path delay isyvwrsxuvwrst. This implies that a token
must at least travel the bottom section (i.e., a token appearing on G2) once. We should
consider the timing right after a token at f is granted. The path yvw r s
x determines how much time the token has been traveling in advance in the hff
path before the token on e grants the permission and travels through the path uv
wrst.That is, Max(D7(1))1p route2
= Max(0,
D1(j + 2) + DA[Dc + Fi(j + 1) + B1_4Fi(j + I)
+ Bi(j + I) + Dc + Dr(I) + Dc
+ Fi(j + 2) +Di21_i(j + 2) + Bi(j + 2) + Dc])
Substitute D'Ip2(j + 2) in (7.70) into above equation, we have
MaX(D7(1))1p,route2
= Max(0,
DA4DcFi(j + 1) Fi(j + 2) Bi(j + 1) Bi(j + 2)
141_1(j + 1) Dr(1)Df4+11(j + 2),
DP(j + 2) 4DcFi(j + 1) Fi(j + 2) Bi(j + 1) Bi(j + 2)
.11_1(j + 1)Dr(1) f4ifi(j + 2)DI+ 1)(7.72)
A similar concept used in deriving Max(D(1))1p, route2can be applied to find
Max(Dt(j' + 2))1p.Therefore, the representation of Max(D7)(f + 2))Ip would be
Max(D7(1 + 2))1p
= Max(0,156
D42(j" + 2) + DA[DcFi(j + 1) + 4i.(j + 1)
+ Bi(j + 1) + Dc + Dr(j"' + I) + Dc
+ Fi(j + 2) + Bo:_41(j + 2) + B i(j + 2) + Dc])
Substitute D1,p2(j" + 2) in (7.70) into the above equation, we have
Max(D17(j1 + 2))11,
= Max(0,
DA4DcFi(j + 1) Fi(j + 2) Bi(j + 1)
B1(j + 2)Dr+1 i(j + 1)D?_4F1(j + 2)Dr(j"' + 1),
D7,1(j" + 2)4DcFi(j + 1) Flj + 2)Bi(j + 1)Bi(j + 2)
i(j + 1)D?1_1(j + 2)Dl (j"' + 1)Dri(j" + 1))(7.73)
Similarly, due to the symmetry of Arbiter, from (7.71), (7.72) and (7.73) we have
Max(Dr(1))1_,p route 1
= Max(0,
DT(1)[D`pn(1) + D7(1) + Dc + F,(1) + Bi(I) + Dc]) (7.74)
Max(Dr(I))1p,route2
= Max(0,
DA4DcFi(j + 1)+ 2) Bi(j + 1) Bi(j + 2)
Dr+I i(j + 1)DlY(1) + 2),
Di;(j + 2)4DcFi(j + 1)Fi(j + 2)Bi(j + 1)Bi(j + 2)
D:L(1 + 1)D'Y(1)D?1.1(j + 2)Dr(j + 1) (7.75)
Max(Dr(j1 + 2))Ip
= Max(0,
DA 4DcFi(j + 1)Fi(j + 2) Bi(j + I)Bi(j + 2)
D:+1(.1 + 1)Dr+1 i(j + 2)DliY(j"' + 1),157
Diqn(j" + 2)4DcFi(j + 1)Fi(j + 2)Bi(j + 1)Bi(j + 2)
+ 1)Dr+i j(j + 2)D7(j"' + 1)Dr(j" + 1))(7.76)
Now, we would like to consider the delay representation of Viu(.1 + 1).This
is straightforward, since the token traveling paths corresponding to maximum delay of




D4 42( j" + 2)+ DA[DcFi(j + 1) + D12+4 1(j + 1) + B i(j + I ) + 13c1)
Substitute Dpq(j" + 2) in (7.69) into the above equation and we have
Max(Dtiu(1 + 1))1p
= Max(0,
DAFi(j + 1) Bi(j + 1) 2DcD:`+11(j + 1),
Diqn(j" + 2)Fi(j + 1)Bi(j + 1) 2DcDi?"4 1(j + 1)
Dr(j" + 1)) (7.77)
Again, due to the symmetry of Arbiter, from (7.77), we have
Max(D7(j' + 1))Ip
= Max(0,
DAFi(j + 1) Bi(j + 1)2DcD:`+1 1(j + 1),
Dipn(j" + 2)Fi(j + 1) Bi(j + 1)2 D i(j + 1)
DIN" + 1»
where j, j', j" and j"' are independent and equal to 0, 1, 2, 3
The reason why there are so many different indices (j, j', j" and j"') is because the
token routing paths for this circuit are not fixed, leading to an independent index rela-
(7.78)158
tionship. However, the relationship between j in (7.67) and j is not independent, as will
be shown later.
If we let D
i n , m i n<Dig'?+ 2) ._c_Dqin,max,
Dpin,min < Dpin(12)Dpin,max,





i1 1+1 v i1'
and Dign(1) and Dpin(1) represent the initial conditions (assumeDin,min < Ding < Din,Max
q ql q
and Dpin'min SDpin'max), we can approximate the upper bound of Din(j + 2)
based on expression (7.67).
Bi(j + 1) + Dc + + 1) + Dc + Fi(j + 2)
Bi(j + 1) + Dc + Max(Dtiu(f + 1))1p + Dc + Fi(j + 2)
According to expression (7.77) and the routing path, we have .7= j. The above expres-
sion can be rewritten as
Bi(j+ 1) + Dc + D"(j' + I) + Dc + Fi(j + 2)
= Max(Bfi + I) + Dc + Dc + Fi(j + 2),
D A + Fi(j + 2) Fi(j + 1) 141.1(j + 1),
Di/l(j" + 2) + Fi(j + 2) Fi(j + 1) Di(j + 1) Dr(j" + 1))
Max(Bi(j + 1) + Dc + + 1) + Dc + Fi(j + 2))1i
s Max(Frax + Brax + 2Dc,
FMaxFmin+ D A,
D qin'MaxFMax n ) (7.79)
Similarly, we have
Max(Bi(I + 1) + Dc + + 1) + Dc + Fi(j + 2))1i159
Max(Fl4ax + Brax + 2Dc,
fruFmin+DA,
Dpin'frfax + FMaxFTth) (7.80)
To find the maximum delay of Bill + 1) + Dc + Dt'(j' + 1) + Dc + Fi(j + 2),
where j = 0, 1, 2, 3, a twostepapproach is adopted. First, we find the maximum
delay based on Max(ViY(1))1p,route IandMaX(DtiY(1))1p,route2.From (7.71) and the rout-
ing path, we have j = 0.
Bi(1) + Dc + D'iY(1) + Dc + F/2)
11/1)+Dc+ Max(D7(1))1_,p route I+ Pc + Fi(2)
Max(F i(2) + 13/1) + 2Dc,
Dipn(1) +/2)F/1)DT (1)Dtlq(1)Dr(1)) (7.81)
From (7.72) and the corresponding routing path, we have 1= j + 1, That is,
131.( + 1) + Dc + D7(1) + Dc + Fi(1 + 2)
= B i(j + 2) + Dc + D''(1) + Dc + Fi(j + 3)
513/j + 2) + Dc + Max(D7(1))1p,route2 + DcFi(j
Max(Fi(j + 3) + 13/1 + 2) + 2Dc,
F/j + 3)F/j + 1)F/j + 2)Blj + 1)
24 xu + DA2DcDi+ 1(J + 1)Di (1) i(j + 2),
Di/J(j + 2) + 1)Fi(j + 2) + F/j + 3)13/j + 1)2Dc
D?_4E1(j + 1)Dr(1)T4_4F i(j + 2)D7(j + 1)(7.82)
Second, we find the maximum delay based on Max(DY(j' + 2))I p.Note that this is
equivalent to finding 13 (j + 1) + Dc + DtY(j' + 2) + Dc + Fi(j + 2). From expression
(7.73) and the corresponding routing path, we know ./ = j + 1.That is,Bi(j + 1) + Dc + + 2) + Dc + F fj + 2)
= B i(j + 2) + Dc + DliY(f + 2) + Dc + Flj + 3)
B i(j + 2) + Dc + Max(ViY(1 + 2))1p + Dc + F fj + 3)
Max( F i(j + 3) + B fj + 2) + 2Dc,
Flj + 3)Fi(j + 1)Fi(j + 2)Bi(j + 1) + DA2Dc
Dr+1(j + 1)141_ i(j + 2)Dr(j"' + 1),
DP (j" + 2) + F i(j + 3)Fi(j + 1)F i(j + 2)Bi(j + 1)
2DcDi:11_1(j + 1)141"(j + 2)
Dp(jw + 1)D7(j" + 1))
Combining (7.81), (7.82) and (7.83), we have
Max( B i(j + 1) + Dc + DfiY(j' + I ) + Dc + F i(j + 2))1i
Max( Pill' + Brax + 2Dc,
Din'mai + FlaxFTin it/l
Frax2 FininB`inin + A2Dc,
Dpin'max + Frax2PininBtinin2Dc)
Similarly, we have
Max(Ba + 1) + DC + Dr(j' + 1) + Dc + Fa + 2))Ii
< Max( Frax + Brax + 2Dc,
D in,max axrminnin,min
I i '13
FMax2 F7lin + D A2Dc,





In conclusion, to find the lower and upper bounds of DiNj + 2), the following
expressions are observed. From (7.67) and (7.68), we have
min( D'in(j + 2))1i161
imin (7.86)
Max(Dir(j + 2))1i
= Max(min(B1(j + 1) + Dc + Dr (j' + 1) + Dc + Fi(j + 2),
B (j + I) + Dc + D7(1 + 1) + Dc + Fi(j + 2),
Bi(j + I) + Dc + DrY(j' + 1) + Dc + F,(j + 2),
Bi(j + 1) + Dc + Dr(1 + 1) + Dc + F i(j + 2)))1i
= min(Max(Bi(j + 1) + Dc + Dr(j' + 1) + Dc + Fi(j +2))1p
Max(Bi(j + 1) + Dc + D7(1 + 1) + Dc + Fi(j + 2))11,
Max(Bi(j + 1) + Dc + D7(1 + 1) + Dc + Fi(j + 2))1p
Max(Bi(j + 1) + Dc + D-r(j' + 1) + Dc + Fi(j + 2))1i)
min((7 .79), (7.80), (7.84), (7.85)) (7.87)
According to the work by Hulgaard and Burns [21], their algorithm and tool
cannot identify the exact bounds for a system which contains mutually exclusive compo-
nent (such as the Arbiter module) and whose performance is datadependent (such as
the Selector module for data selection, as will be shown in the next section). However,
they can still approximate the exact upper bound from the right direction, i.e., with the
result greater than or equal to the exact upper bound. Figure 7.8(a) shows the circuit
under simulation and Figure 7.8(b) demonstrates its corresponding Petri net representa-
tion.Table 7.4 compares our approximations (UP]) with the approximations imple-
mented using CTSE (Appro.) for the output loop delay of Arbiter/Call pipeline. The
results show that our approximation may approach the exact value better than CTSE.
This can also be depicted in Figure 7.8(c). Note that the physical delays of Arbiter and


















Figure 7.8: An Arbiter/Call pipeline and its Petri net model.
(a) General representation of an Arbiter/Call pipeline.
(b) Petri net model.
(c) Result comparison.163
Table 7.4: Comparison of our approximations with the other approximations
using a general Arbiter/Call pipeline as an example.
1 2 3 4 5 6 7 8 9 10
Di; [10 20][10 41][17 37][27 37][20 47][21 31][21 51][10 21][10 41][27 47]
Dipn [3 4][9 30][12 35][12 25][10 35][4 20][4 40][9 40][9 40][12 35]
F [12 20][12 30][20 25][20 25][12 25][12 25][22 25][2 15][2 25][15 25]
Bi [2 4][8 10][13 20][3 10][13 18][8 14][8 24][8 14][8 14][13 18]
D1'+'I1[10 29][20 24][19 35][19 30][9 20][10 44][10 44][10 34][10 24][29 35]
Appro.29 48 45 35 48 44 49 53 63 45
UP/29 40 45 35 43 44 49 34 53 43
7.5 Performance Of Asynchronous Pipelines With Select/Xor Pair
It is common to have an application that is datadependent and needs a mecha-
nism to steer the incoming token to an appropriate destination based upon the nature of
the data. This mechanism can be implemented using Select/XOR pair in asynchronous
pipelines, as shown in Figure 7.9(a). It is this section's purpose to find the performance
of asynchronous pipelines with Select/XOR pair.
We first define the function of Select and its token sequence on c terminal. When
the token value on c is High, the incoming token on s will be steered to the output
terminal marked T. On the other hand, if token value on c is Low, the token ons is
routed to F terminal. Among the values of token sequence on c, if we let ST represent
the number of consecutive Highs and SF the number of consecutive Lows, thenwe have
/VT 5_ ST s NV and NF 5 SF 5 4, where NT is the minimum number of consecutive
Highs, NV the maximum number of consecutive Highs, N'; the minimum number of
consecutive Lows and NM the maximum number of consecutive Lows in a control se-









Figure 7.9: A twodimensional micropipelineSelect/XOR.
(a) General representation of a pipeline with Select/XOR.
(b) The special case when NT = 0.
(c) Equivalent circuit.165
c:{ HHH L HHH LLLL HH LLLLLLLLL HHHH LL HHHHH }
we have NT = 2, NV = 5, NF = 1 and NV = 9, i.e., 2STand 1 5.5F 5_9.Note
that, if NV = 1 (implying A/7 = 1) and NV = 1 (implying n = 1), this is equivalent
to the case of Toggle/XOR pair as discussed before. If NF = 0 (implying N7 = 0), this
indicates that no any Lows in the token sequence and the resulting pipeline becomes a
linear pipeline as shown in Figure 7.9(b).It is also true if N'4,7 = 0. The delays Ds and
Dx in Figure 7.9(b) are the propagation delays for Select and XOR modules, respective-
ly.Since the above two special cases have been discussed, this section will focus on
the case when NVN'77!1 and NV NT. > 1 (precluding the case of NV = NF = 1).
Note that, no logic delay or waiting delay occurs in the Select and XOR modules.
We still employ the equivalent delay technique to approach the performance of
the Select/XOR pipeline. By looking into cross sections q and p in Figure 7.9(a), its
equivalent circuits are shown in Figure 7.9(c).Again, the notations Max(f( ))1p and
min(f( ))1p denote the maximum and minimum delays of f( ) function with respect to
all the possible token traveling paths, respectively.
Now, consider the case when NiY=/. The shortest token traverse route for
+ If APV_2, the shortest
path becomes uvwrst.Therefore, the traverse path can be represented
as follows in terms of propagation and logic delays.
min(Diinif + 2))1p when ATI-1./
= Dx + Bi(j + 1) +D42(j + 2) + F1(j + 2) +
[Ds + Do`.+.1 11,(j" + 1) + Dx + 13,(j + 2) + D12(j + 3) + Fi(j + 3) +
Ds +
1,p(j" + 2) + Dx + B ,(j + 3) + D42(j + 4) + F,(j + 4) +166
Ds + D?`4.1 i,p(j" + N') + Dx + B/j + 1) + Typo + NF + 2)
+ Fi(j + NF + 2)] + Ds (7.88)
NF -1-2 N7-1-1
F/j + 1) + B/j + 1) + I)Dx + (NT + 1)Ds (7.89)
1=2 1=1
min(Drif + 2))1p when NV
= Dx + B/j + 1) +D12(j + 2) + Fi(j + 2) + Ds
+ 2) + 11/j + 1) + Dx + Ds (7.90)
Similarly, when 411, Max(D ".q(j + 2))I1, has the traverse path of uvwrN'1F4(s
xyvwrs)t.It can be represented as follows.
Max(Diinif + 2))Ip
= Dx + Bi(j + 1) +D12(j + 2) + F/j + 2) +
[Ds + 1;1_4F i,p(j" + 1) + Dx + Bi(j + 2) + D42(j + 3) + F/j + 3) +
Ds + D?+4 1,p(j" + 2) + Dx + B i(j + 3) + D12(j + 4) + F/j + 4) +
Ds + 1-4_4F1,p(j" + NV) + Dx + B/j + NV + 1) + D42(j + NF + 2)
+ F/j + NF + 2)] + Ds (7.91)
The representation of /4+4 Lp(j" + 1) depends on the token traveling route.Since the
smaller Dii7p(1 + 2) is, the larger D?+41,p(j" + 1) is, we need to find the shortest token
traveling path of Diin,p(f + 2) in order to get largest value of Dr+11p(j" + 1). Among
different routes, the maximum value of D24i+1 p(j" +1)occurs when the token travels
the upper path (c = High) only once, right before it travels Al times on the lower path
(c = Low). That is, the control sequence is { ...LL H LL...LLL(ATAF/ times) H... } (assume
NT = 1). With this kind of control sequence, we have,167
Max(D'_'1" i,p(j" +1))1P
= Max(0,
Dr(j" )[Dx + Bi(j) + D12(j + 1) + Fi(j + 1) + Ds + D?_147(jm + 1)
+ Dx + Bi(j + 1) + 142(j + 2) + Fi(j + 2) + Ds])(7.92)
Dr2(j + 2)
= Max(0,
i(j + 2)[F i(j + 1 ) + Ds + BIL,q(j" + I ) + Dx
+ B i(j + 1)]) (7.93)
The following logic delays become obvious with NV times Low control signal.They
are
D24+ 1,p(j" + 2)
1
= Max(0,
Dr(j" + 1)[Dx + B i(j + 2) + D12(j + 3) + F, .(j + 3) + Ds]) (7.94)
Dr11_ 1,p(j" + NV)
= Max(0,
14"V" + NV1)[Dx + Bi(j + NV) + D42(j + NF + 1)
+ Fi(j ++ 1) + Ds])
D12(j + 3)
= Max(0,
Diin_ i(j + 3)[F i(j + 2) + Ds + + 1) + Dx
+ Bi(j + 2)])




Diin_ 1(j ++ 2)[F i(j + + 1) + Ds + D:+4 1,p(f +) + Dx
Bi(j + NV + 1) (7.97)
If NV is a large number, the complexity of representation of (7.91) in terms of
the expressions ranging from (7.92) to (7.97) grows rapidly. To reduce the complexity,
we let NV = 2 (NT = 1 is implied in deriving (7.92)) in (7.91).That is,
Max(Diinif + 2))1p
= Dx + Bi(j + 1) + D42(j + 2) + Fi(j + 2) +
[Ds + + 1) + Dx + Bi(j + 2) + D42(j + 3) + Fi(j + 3) +
Ds + D:Lp(j" + 1) + Dx + Bi(j + 3) + D12(j + 4) + Fi(j + 4)] +
Ds (7.98)
If we let D' Dr_+ 2) < Din Max
Frn 5. Fi(j + 1) < Fra,
Irinin < Bi(j + 1) 5 BM ax,
D pout,min < Duto + rlut,Max
P
Doqut,min< DQut(j + 1) c ut,Max
then according to (7.89) and (7.90), we have
min(Drif + 2))li when NV=1
min( min(Diin,q(j + 2))1p )1i
(1117F1 + 1 )Pnin(A7 + 1 )Bmin + (NF + 1)Dx + (Ni! + 1 )D s
min(Diin,q(f + 2))If when NT 2
> FtininBmin+ Dx + Ds
(7.99)
(7.100)
Similarly, by substituting expressions (7.92) to (7.97) with NV= 2 into (7.98) and doing
some simple mathematical manipulations, we have169
Max(D1i7q(j + 2))1i
Max(Max(Dii7q(j + 2))1p)li
< Max(3Frax + 3Brax + 3Dx +3Ds,
Din ,Max3 FraxPPM + 2 Brax2Dx + 2Ds,
1
Dout,Max2 Flitlax+ 2 Brax2Dx + 2Ds,
Din,maxnout,max+ 2Frax + 2Br'2 B"nin
t1 P t'
2 Din,Max2 FraxFrinin B MaxDXD
1 S,
2D"t'Alax + FMaxFr" + BYlaxBr,
Din,MaxDout,MaxFrax + 2Brax _B""1" + DX + DS t -1 P i
nout,MaxDin,Max+ 2 PillaxF"ni" + BY"' + DX + DS,
L' P i1 t




+Din,Max+ Pr' + Br'2frini"D xD s,
2 Din'AlaxDout,MaxFraxpunB axBmin
i1 '







Since the upper path and lower path in Figure 7.9(a) are symmetrical, we can immediate-
ly obtain the upper and lower bounds of Diinp(j + 2) from (7.99) and (7.101). That is,
min(D'in,p(j + 2))Ij when 4'1=1
+ 1 )Pnin(NT + 1 ) Brn(NT + 1)Dx + (1V7.! + 1)Ds (7.102)170
min(Di7p(j + 2))Ii when NV 2
Flinn + Trinth + DX+ Ds (7.103)
Max(Diin,p(j + 2))Ii
expression of (7.101) with the replacement of Drt'm' with Dq'ut'max (7.104)
The overall performance of Figure 7.9(a) can be obtained by substituting expres-
sions (7.99) through (7.104) into Figure 7.9(c).
Figure 7.10(b) shows the Petri net model of the general Select/XOR circuit shown
in Figure 7.10(a). Unfortunately, CTSE does not allow the user to specify finite values
for NT, NT, NV and Ni!; that is, the tool only simulates the Petri net with NV =
and NV = 00 for the upper bound. Therefore, the output loop delays for both upper
and lower sections are infinite. Table 7.5 summarizes the simulation results based upon
our approach with NV=2, N7=1, NV=2 and n=1. Note that the physical delays of
Select and XOR are assumed to be zero in our simulations. The result is also plotted
in Figure 7.10(c).
Table 7.5: The result of our approximation to a general
Select/XOR pipeline.
1 2 3 4 5 6 7 8 9 10
Dip [2 15][2 10][2 10][2 10][2 10][2 15][2 15][6 30][16 35][2 15]
F, [1 8][1 18][9 10][9 15][9 15][11 18][1 13][10 21][10 11][1 10]
B1 [3 8][3 8][3 10][3 10][3 10][3 8][3 10][14 17][14 17][3 8]
Nut[19 25][9 15][5 10][9 20][29 48][19 25][30 35][10 35][10 35][25 30]
Dr[11 15][11 15][11 15][12 45][2 35][11 35][29 34][22 24][22 44][31 33]
UP1q70 79 60 103 85 87 113 117 126 108














Figure 7.10: A Select/XOR pipeline and its Petri net model.
(a) General representation of a Select/XOR pipeline.




In this chapter, based upon the model of the linear pipeline developed in previous
chapter, the performance of twodimensional micropipelines, including Fork, Join,
Toggle/XOR, Arbiter/Call and Select/XOR pipelines, is discussed. Several simulations
are provided to compare our approximations with the exact bound (except for the cases
of Arbiter/Call and Select/XOR pipelines) obtained using CTSE. Based upon these sim-
ulations, it shows that our approximations may lead to the exact bounds in certain cases,
depending on the set of stagedelays. If the result based on our approach does not match
the exact bound, it must be greater than the exact bound. This indicates that, based on
the simulation results, our approximation has successfully approached the exact bound
from the right hand side.In the Arbiter/Call pipeline, simulations show that our ap-
proach has better approximation than CTSE. With certain assumption (NV, NT, NV and
NF are finite and known), our methods can also approach the upper bound of Select/
XOR pipeline, whose upper bound result can not be directly obtained using the CTSE
tool.173
8. PERFORMANCE ANALYSIS AND DESIGN EXAMPLES
OF TWODIMENSIONAL PIPELINES
In Chapter 5, the performance analysis of linear (onedimensional) pipelines (FI-
F0s) has been discussed thoroughly. Designing a FIFO that satisfies the delay bound
requirement has also been practiced. In Chapter 7, the performance of five twodimen-
sional pipeline constructs based on the asynchronous modules were analyzed individual-
ly. These twodimensional pipelines include Fork, Join, Toggle/XOR, Arbiter/Call and
Select/XOR.It is this chapter's purpose to analyze the performance of larger systems
consisting of FIFO, Fork, Join and Toggle/XOR based on previous analyses. The results
are compared with the exact values.Therefore, Arbiter/Call and Select/XOR are ex-
cluded in this system since their exact bounds cannot be obtained through CTSE. A
design example is also implemented.
8.1 A Performance Analysis ExampleOpenLoop System
A system which is called open loop and contains FIFO, Fork, Join and Toggle/
XOR is provided in this section for the performance analysis. The definition of open
loop system is best understood by demonstrating through an example and will be dis-
cussed later in this section. A closedloop system is also possible and will be described
in next section. Again, only the maximum output loop delay (maximum cycle time) is
calculated. Figure 8.1 shows the openloop system to be used here. Our goal is to obtain
the maximum value of output loop delays T1+21j + 1), 71+2,u(j + 1) and 71+ zp(j + 1).
However, to achieve the goal, several constrains, described as follows, are imposed on
the method of approximation developed in the previous chapter.
1. To find 71+4(j + 1), Dipj + 2) and Dabl(j + 2) should be known in advance.














+ 1)cross section g
cross section e
Figure 8.1: A twodimensional pipeline systemopen loop.175
3. To obtain 71+2,t(j + 1) and 71+2,u(j + 1), Di,;1(j + 2) should be known first.
4. To get &Ai + 2), Bodut(j + 1) should be obtained in advance.
5. To get D7t(j + 1), Di/(j + 2) should be obtained in advance.
6. Di(j + 2) and Drt(j + 1) can be obtained directly from given forward and
backward delays without requiring the knowledge of any other equivalent input/output
delays.
According to the constraints stated above, a flow chart to obtain
Max(71+21j + 1))1j, Max(71+2.u(j + 1))1i and Max(71+4(j + 1))1j is suggested and de-
picted in Figure 8.2. This flow chart also demonstrates the definition of "open loop."
This system is open loop since to obtain Max(71+2p(j + 1))1p we need to find Drt'max,
Din'max and Din'Max, three of which are a function of forward and backward delays only.
If, for example, Dbin'Afax is not only a function of forward and backward delays but also
the function of Din'max itself, then the system will be closed loop. The same argument
can be applied to obtainMax(71+21j + MI./ and Max(71+21j + 1))1i.
Based on the previous chapter's result, the expression corresponding to each
equivalent input/output delay in Figure 8.2 is represented as a function of forward and
backward delays and described as follows. The physical delays of Toggle (DT) and XOR















Find Max(T1+21j + 1))1i
Max(71+2,u0 +
Find Max(T1+2p(j + 1))1i
Figure 8.2: The flow chart for obtaining maximum output loop delays of Figure 8.1.2.Din'Max
Max( 2Fiqc"-F2B 9"x,
Din'Max2Fm'Pnineax i,q t,q i,q
D out,Max ,FMgax inBMaxBmin




Din'AlaxFiriaxDin,min2 Pnwin t,p i7ginFYax Max fr I+1,p t+ 1,p'
Din,MaxFlifax przin Max Df z+1,p i+1,p t+1,p









Din'Max + 2FMax q i,q
DoutAfaxpkfax
' 1,q
PninB Max i,q i,q
Fmin+BMax
i,q
















out,Max MaX pmin _1_ Dt +r
i+1,r i+1,r'
BmaxWan i+1,r i+1,r'
BMax Bmin i+1,r i+1,r'
178
(8.7)
D ain'mc" +FMax (8.8)
The system in Figure 8.1 is also modeled using Petri net and simulated using
CTSE. The results based on expressions (8.1) to (8.8) and CTSEare tabulated in
Table 8.1.This table shows that most of the results using our approachare actually
equal to the exact values. Appendix A demonstrates the processname of CTSE code
and corresponding Petri net model for some basic modules used to construct the system
of Figure 8.1. The actual CTSE code for each basic module and the CTSE code itself
of Figure 8.1 are shown at Appendix B.179
Table 8.1: Comparison of our approximations with the exact bounds
using Figure 8.1 (openloop system) as an example.
1 2 3 4 5 6 7 8 9 10
Dug/ [2 15][2 15][2 5][12 15][12 25][12 15][2 5][10 12][10 12][10 32]
Dipn [12 15][12 15][20 25][20 25][2 10][2 10][2 30][9 18][9 15][34 45]
F 4,7 [1 1][4 11][14 19][4 19][4 9][4 9][4 9][9 18][2 8][12 18]
B i,q [13 18][13 28][3 8][3 18][3 18][3 8][3 8][4 9][4 9][4 9]
F i,i, [9 10][9 10][9 10][19 30][5 10][5 10][5 20][5 12][5 9][5 11]
Bip [6 18][16 18][6 8][6 8][6 8][6 8][6 8][4 7][4 7][4 7]
Fi+1,,[10 10][10 20][1 8][10 28][5 18][5 8][5 8][9 18][9 15][4 10]
Bi+1,,[3 18][13 18][3 10][3 5][3 5][3 5][3 5][9 12][9 12][11 12]
Fi+ [11 20][11 20][5 10][5 15][5 15][5 10][5 15][7 10][7 10][7 9]
Bi+Lp[13 18][3 18][3 5][3 5][3 5][3 5][3 15][9 25][5 10][5 9]
Drt[9 35][9 25][39 55][29 45][5 10][15 60][15 40][15 30][5 10][15 20]
Mut[20 39][20 29][11 15][11 25][31 59][31 49][31 59][12 48][12 58][22 28]
Dr[21 49][21 39] [22 75][22 30]
Max,t59 88 61 98 74 60 80 78 97 76
UPo59 88 61 10074 60 80 78 97 76
Max,u59 88 69 98 70 65 64 78 97 76
UP 1u59 88 69 10070 65 80 78 97 76
Max,p68 87 77 87 91 77 81 68 75 72
UPLp68 87 77 87 91 77 81 68 75 72
8.2 A Performance Analysis ExampleClosedLoop System
As shown in the previous section, the results for each twodimensional pipeline
module developed in Chapter 7 can be immediately applied to obtain output loop delays
of an openloop system constructed by these modules. Table 8.1 shows that this direct
application turns out to have a good approximation. In this section, a similar approach180
will be applied to the closedloop system, as shown in Figure 8.3. The goal for this
system is to obtain the maximum output loop delay Max(71+ 3(j + 1))1i. The dependency
in the process for approaching this goal is depicted in Figure 8.4. Based on this depen-
dency flow chart, it is said that the system in Figure 8.3 is a closedloop system since
Dfin'mar, for example, is not only a function of forward and backward delays but also
the maximum delay Dfinmc" itself.In terms of token index, Difn(j + k) is a function of
Din(j + 1) in this closedloop system, where k > 1 > 2. The equivalent input and output
delays looking into different cross sections are described as follows.
Max(71+ (j + 1))1i 5_ Max( IY1+"max,
Fr+1 + 131;4+S
Din,MaxFMaxFmin f t +2 i+2'
Din,MaxFMaxFmin
i+2 i+21
Din,Max < f-Max( Plax BMax
1+ I,q -1,q,
Dibn'Max + Fnxa'i,qFmin
Di.h".max < Max( Pill' + Bra x,
D outmaxFrax
Din'MaxFMaxrnin) t -1
Deout,Maxmax( FrZ + Bnalp,
Diout,MaxBr+a: min
nout max < Max( Fr_i_ a+ Bn,




























L. cross section j -
alui(j + 1) cross section c cross section i
.4
Dry./ + 1)Dgn(j + 2)
cross section ecross section g
Figure 8.3: A twodimensional pipeline systemclosed loop.182
To find Max(71+3(j +1))1i1
Figure 8.4: The dependency flow chart of Figure 8.3.Using backward substitution for the expressions from (8.10) to (8.13), we conclude
Din,Maxf
5_ Max( Fr+a+ Br+qq,
P/1" + Brax + min
Fr+a-+ Br:if+ Frax + Bir ic"frninpMax min +1+1,g t+1,g'
183
2Plax Pnin-I- PYlax + BYax + BM' Irnin+ Fr=Firth t+1,g t+ 1,p t+2 t+2 i +1,p t+1,p
+ Br"Yi +ninPinini,q,
PYI_Fa + Bir_Faj + BM' Bmin FraxFTI" + BY'frinIn i+1,p i+1,p i 1
+ FrIqgF`inni,q,
BMax Bmin+ FM"Pnin + Br"frinin Dout,MaxpMaxpr.ni
i+3 t+ ,pt+ ,p
pMax Fmin +t+1,g i+1,g'
Din'max + FI/1"FTIn + FlaxPith t+l,q t+1,q' i-1
Dfin'MaxFMax Pn+in2 + Bnp
Pinin + Braxfrinin + FM"Pnin) (8.14) t+ 1,g i +1,q
From (8.14), by looking into the very last element, the value of Fr+a2F7z +
prnin pmin Flillaxpinin nminFr+a%Frninis always
' "i 1+1,g
positive. This suggests that Din'ma' could be infinite since the statement of Din'Max <
Din'M" + (positive number) always holds for arbitrary value of Dinf
'MaxHowever,
Din'Alax can not be infinite in this system. The reason why (8.14) shows the possibility
of having infinite value of Din'Max is because logic delay terms which have preceding
"" sign are dropped in this expression. Instead of bringing logic delay terms back into
(8.14) (since this will lead to a very complex expression), one more step is employed184
before regular approach is applied. This extra step is to find a relationship of the "loop"
and then break this loop down.
By looking into Figure 8.5, two different loops are observed. One is a datatoken
loop which is formed by two data token traveling paths, marked by black solid and gray
solid lines.The other is a spacetoken loop which is constructed by two space token
routing paths, highlighted by black dash and gray dash lines. Apparently, the time spend-
ing on these paths for these two data tokens should be the same. That is,
De+ 1,q(j + 1) + Fi+1,q(j + 1) + 14_4 2,q(j + 1) + Dr(j + I )
=D?L,p(j + 1) + Fi+ Lp(j + 1) + D?1_2,p(j + 1) + 1307(j + 1) (8.15)
Df_4F1,q(j + 2) + Fi+ 1,q(j + 2) + D2,q(j + 2) + Drj + 2)
= 1,p(j + 2) + Fi+l,p(j + 2) + Dfl_zp(j + 2) + + 2) (8.16)
Similarly, the traveling time for these two space tokens should be the same. That is,
D4 +2,q(j + 2) + B+ 1,q(j +2) + D4 +2 i,q(j + 3) + Dr(j + 3)
= D4F2,p(j + 2) + Bi+ i,p(j + 2) + + 3) + DIN + 3) (8.17)
If the Equal loopdelay theorem is applied to the Celement C we have
D1_1.2,q(j + 2) + Dr(j + 2)= 14_2 2,p(j + 2) + D7(j + 2) (8.18)
Similarly, if the Equal loopdelay theorem is applied to the Celement Cz, we have
f;?`+1 i,q(j + 2) + Dr(j + 2)=DffIF Lp(j + 2) + DP (j + 2)
Note that /4:41 q(1) = Df_41p(1) = Dr(1) = Dcq(/) = 0.
If (8.16)(8.18), we have
Df_F /,q(j + 2) + Fi+1,q(j + 2) + Df1.2,q(j + 2) f.2,q(j + 2)




Figure 8.5: Two loopsone datatoken loop and one spacetoken loop.Also, from Celement's model, we know
Dr+2,q(j + 2) * Dr_2,q(j + 2)= 0




From (8.20), (8.21) and (8.22), we can obtain Max(71±3(j + MI./ by discussing all of
the possible cases as will be shown below.
By rearranging (8.20), we obtain
/411F2,g(j + 2)
= Fi+ i,p(j + 2)F i+ 1,q(j + 2) + 130?_f_ Lp(j + 2) + Elf2,p(j + 2) +1_2,q(j + 2)
+ 2)D4 _2F2,p(j + 2) (8.23)
Substituting (8.23) into (8.21), we have
Df2f.2q(j + 2) = 0 or
D42(j + 2) = Fi+1,q(j + 2)F1+1(j + 2) + Dr+1 Lq(j + 2) + 13012(j + 2) i+2,q t+2,p
Dr+I i,p(j + 2)D?_4F2,p(j + 2) (8.24)
Similarly, by rearranging (8.20), we obtain
Dr+12,p(j + 2)
= Fi+ 1,q(j + 2)Fi+ i,p(j + 2) + 141_ 1,q(j + 2) + 14.24_21j + 2) + 141.2,q(j + 2)
Dfl_14,(J + 2)D1+2 2,q(j + 2)
Substituting (8.25) into (8.22), we have
D4
+2,p(j + 2) = 0 or
(8.25)
D4 +2,p(j + 2)= Fi+ i,p(j + 2)Fi+j,q(j + 2) + D?_E1,p(j + 2) + D4 +2,q(j + 2)
+ 2).DE2,q(j + 2)
From (8.24) and (8.26), four cases are possible as described below.
Case 1>: 14+2 2q(j + 2) = 0 and D4 +22,p(j + 2) = 0
(8.26)187
In this case, this implies inP(j + 2) = DV(j + 2) = 0.Hence, the equivalent
circuit is drawn in Figure 8.6(a). The Max(T!+3(j + 1))1i has the form of
Max(71+3(j + 1))1i = Max(DZmax,
+ Blitiap (8.27)
Case 2>: D1_2F2,q(j + 2) = 0 and
D4 _22,p(j + 2)= Fi+i,p(j + 2)Fi+1,,i(j + 2) + + 2)
Dr+I 1,q(j + 2)feEzq(j + 2) (8.28)
Since Dg2,q(j + 2) = 0, from (8.18) and the fact of Dr(j + 2) * DI:,,q(j + 2) =
0, we conclude DV(j + 2) = 0 and Dr(j + 2) = D4 21_21j + 2). The equivalent circuit
in this case is drawn in Figure 8.6(b). As long as Dcq(j + 1) (D/z)q(/) = 0) is derived,
Max(71+3(j + 1))1i can be obtained.It is the output loop delay of the FIFO with the
backward delay of the first stage being the sum of B,(j + 1) + IY;q(j + 1). From (8.19),
we know
Dcq(j + 2) = Max(0,
Dr+I 1,q(j + 2)Di+4 j,p(j + 2)) (8.29)
From (8.28), we have
1,q(j + 2)I4`+1 1,p(j + 2)
= Fi+Lp(j + 2)Fi+1,q(j + 2)rel_2,q(j + 2) + 2)
Combining (8.29) and (8.30), we obtain
Diz'q(j + 2)
= Max(0,









Figure 8.6: Equivalent circuits of Figure 8.3 corresponding to different cases.
(a) Equivalent circuit corresponding to case 1.
(b) Equivalent circuit corresponding to case 2.
(c) Equivalent circuit corresponding to case 3.
(d) Equivalent circuit corresponding to case 4.Max(0,
Fi+Lp(j + 2)Fi+j,q(j + 2))
Therefore,
Bi(j + 1) + Dr(j + 1) 5 Max( Br',
Max+ Fmax ) i+ I,p t+ 1,q
From Figure 8.6(b), we have
Max(71+ (j + 1 ))ii
5 Max( D°+"3Max,
Max+ Bnx,
Fr+a)Lp + B + min
FMax+ Br' +FMax + 1 FT:2,
FAlaxBrax rtylax Fmin Fmin FMaxFmin
`-` t+ 1,p i+ 1,p i+ 1,q i+2 i+ 2'




Case 3>: D12t+ 2 p(j + 2) = 0 and
D4 F207(j+ 2) = F i+ 1,q(j + 2)Fi+ 1,p(j + 2) + 14_E i,q(j + 2)
D24
+ 1,p(> + 2)1)?+2,p(j + 2)
In this case, due to the symmetry of this circuit, we can obtain Max(71+ 3(j +
by changing p with q and q with p in (8.32). That is,
Max(71+ 3(j + 1 ))ii
< Max( D'+ a4max
i3'
Fr_a + Brr2`,
FY' +Max+ Fr+ax2 min
1+ 1,q t+ 1,q
FraxBrax + Fi + Fra'Fmin




i+ I,p i+2 i+2'
Pnin Max min) t+1,q t+2 t+2
190
(8.33)
The equivalent circuit corresponding to this case is shown in Figure 8.6(c).
Case 4>: + 2) = Fi+ 1,p(j + 2)Fi+ 1,q(j + 2) + 14_4F 1,p(j + 2) + DIF2,q(j + 2)
D?+4 i,q(j + 2)D?:f2,q(j + 2) and
D4 t.24(J + 2) = Fi+ 1,q(j + 2)F i+ I,p(j + 2) + 1,q(j + 2) + Dg_zp(j + 2)
14_1,p(j + 2)1:;?1_2,p(j + 2)
In this case, if the above two equations are added together, the resulting equation
becomes D24i+2q(J + 2) + Di2+42p(J + 2) = 0.Since all logic delays, according to our
model, are greater than or equal to zero, L:14.4q(J + 2) = D?+4 + 2) = 0 in this case.
The equivalent circuit corresponding to this case is depicted in Figure 8.6(d).Since
Di_4E.2,q(J + 2) = Dr+i + 2) = 0 and Dr+12q(1) = D2 +2 p(1) = 0, we have










By looking into the Fork in Figure 8.6(d) and from (8.34) and (8.35), Dithi'max and Dcin.max
can be obtained. That is,




Max( Frax +BMax,Max+ + FMaxF'inin i+ 1,p z+ 1,p t
Din'Max + F 1111. axrinin) t -1
Brax _Bmin
Similarly, we have
D5 Max( Frax + Br',






Now, due to the equivalent FIFO structure, Dfin'Ma" can be easily obtained basedon
(8.36). Same procedure can be applied to obtain Din'max. Once Din'Alax and Din,Max
become known, Max(Ti1.+3(j + MI./ can be approximated through Joinstructure.That
is,
Max(Tli+ 3(j + 1))1j
< max( Dot,MCIX
+ 3'
F im+c'2Y + BMax
F';'141 + Bliw+17 + Fr4a2' min
Frax + Br" + Pnin+ FM=Pnth i +1,q i+1,q i+2 i+2'
Fn.pa-j+ +FMaxF'inin + Br" + FM' FTin t+1,q t+
+ FMaxPnin t+2 i+2'
Din,MaxFrixpninFMaxpnin FMaxFrnin
t -1 t+1,q t+1,q 1+2 i +2'
Frailp + BliY1:-"Lp + Fr_aF-Y min
Fr ax +BMax+ Fmin+ _Fmin
FrFajp + B+ Fr"pninBraxfrininFlitpFTFini,p
+ FMaxFmin
Din,Max+ Fnin + Fr + Fi FTin) L i:4 (8.38)192
As the result of discussing four different cases for the circuit of Figure 8.3, its
maximum output loop delay could be chosen as the maximum of (8.27), (8.32), (8.33)
and (8.38). That is,
Max(71+3(j + 1))Ii
Max(D°V:rx,
FMaxBMax i + 2 i + 2'





fpinBr naxBt in pnin
t+ 1,p z+1,q 1+1,q
FMaxFmin + 2 i+2'
FM ax+ Bra' + 2Fr+a% F'in+inLp + FT:12,
Din,MaxFMaxpininFMaxFain FMaxpmin
i 1 t+1,q t+1,q t+2 i +2'
FMax + FMi _r2Fmin i+1,p i + 2'
Fr ax +Br'+ Fr+qpFmin FMaxpmin
i+2 i+2'
FMaxBraL + + Br" + FY' FTin t+ 1,p t+ 1,p
FMaxFTin t+2 i +2'
FM axBrax + 2 Fr+1
+in1 ,pPin+inLq F111+6L2YFmin
Din'Yax + PraxFritzFrajp t +I+ Fr+a-'2'Ftinn2) (8.39)
Several simulations are done to compare our approximations based on (8.39) with
the exact values obtained through Petri net model and CTSE simulation. The result is
summarized in Table 8.2.193
Table 8.2: Comparison of our approximations with the exact bounds
using Figure 8.3 (closedloop system) as an example.
1 2 3 4 5 6 7 8 9 10
D" [6 10][2 15][2 5][12 15][12 25][12 15][2 5][10 12][10 12][10 32]
F, [4 9][4 11][14 19][4 19][4 9][4 9][4 9][9 18][2 8][12 18]
B, [3 5][13 28][3 8][3 18][3 18][3 8][3 8][4 9][4 9][4 9]
Fj +17[25 28][9 10][9 10][19 30][5 10][5 10][5 20][5 12][5 9][5 11]
Bi+ [13 15][16 18][6 8][6 8][6 8][6 8[[6 8][4 7][4 7][4 7]
Fi+1,p[20 30][10 20][1 8][10 28][5 18][5 8][5 8][9 18][9 15][4 10]
Bi+ Lp[19 25][13 18][3 10][3 5][3 5][3 5][3 5][9 12][9 12][11 12]
Fi+2[24 29][11 20][5 10][5 15][5 15][5 10][5 15][7 10][7 10][7 9]
B1 +2[23 28][3 18][3 5][3 5][3 5][3 5][3 15][9 25][5 10][5 9]
Dour[19 25][9 25][39 55][29 45][5 10][15 60][15 40][15 30][5 10][15 20]
Max60 59 55 58 53 60 42 39 30 46
UP 170 70 55 96 61 60 51 54 45 46
The results shown in Table 8.2 show that the approximations for a closedloop
system can be over one and half times larger than the exact value. These results are
not as close to the exact value as the results derived for the openloop system (see
Table 8.1) mentioned in previous section. However, if we treat the whole closedloop
system as a basic pipeline module and derive the results by applying the Equal loop
delay theorem to all Celements in this system (as done in Chapter 7), we may be able
to obtain a result even closer to the exact value. This is because when the Equal loop
delay theorem is applied, the relationship or dependency among logic delays, forward
delays and backward delays can be revealed through difference equations. This depen-
dency will bring a result closer to the exact value. For example, let
A(j+1) = F(j+3)F(j+2) (8.40)
B(j+1) = F(j+2)F(j+1) (8.41)where PlanF(j + 1) 5 Fm"
Find the upper bound of A(j+1) + B(j+1).
Approach 1> If we consider the dependency of A(j+1) and B(j+1), we have
A(j+1) + B(j+1) = F(j+3)F(j+2) + F(j+2)F(j+1) = F(j+3)F(j+1)
Max(A(j+1) + B(j+1))ij =FMaxFmin





From (8.42), (8.43) and (8.44), we have








From (8.42) and (8.45), we know there is a FMaxFn a" difference between these two
approaches.
The other kind of dependency also affects the accuracy of the approximation.
When each module was discussed in Chapter 7, all of the physical delays, including
environmental input and output delays, forward and backward delays were assumed to
be independent. For example, in Join module (Figure 7.3) Diqa(j + 2) and Dipa(j + 2) are
independent of Fi(j + 1) and Bi(j + 1).This assumption is valid when an openloop
system is considered.It is, however, not true for a closedloop system. For example,
when considering the Join module in Figure 8.3, the environmental input delays will be
Difn(j + 2) and DiN + 2). Also, from Figure 8.4, it is noticed that D f 'Max isa function
of Daut'Alax, and Dia'max is a function of Data'Alax.These imply that the environmental195
delay bounds depend on stagedelays FMax Finn BMax or 13""violating our assump- i+2'i+2'i+2 i+2'
tion.The concept of this type of dependency can be further illustrated using a more













Note that expression (8.46) is the approximation of Figure 8.7(a).Also, note that each
of the components in (8.47) is equivalent to the maximum output loop delay representa-
tion of a FIFO. Therefore, (8.47) is equivalent to the maximum value of two maximum
output loop delays corresponding to two separate FIFOs, as illustrated in Figure 8.7(b).
As a result of the approximation equivalency of (8.46) and (8.47), the circuits in
Figure 8.7(a) and Figure 8.7(b) are also equivalent as far as the maximum output loop
delay (approximation) is concerned. However, caution should be taken when splitting
a Join into two FIFOs.First, the system should be open loop.That is, the value of
Din'Max is independent of D?"ma" and the value of Dth'max is independent of Ira'max. f 1 g h
Second, this conclusion is applicable to obtain Max(71+ i(j + 1))1i only and nothing else.


























D710 + 1) D'1,0 + 2)
cross section i'cross section j"
Figure 8.7: A general system with Join module and its approximation equivalent.
(a) A general system with Join module.
(b) Approximation equivalent if the system is openloop.197
example, in Figure 8.1, to find Max(71+2p(j + 1))1i, the split of Join module into two
FIFOs is valid since the system is open loop. The result will be the maximum value
of these two individual maximum output loop delay.However, to find the value of
Max(71+21j + 1))1i and Max(71+2,(j + 1))1p splitting the Join module is disallowed.
This is because Dain'max is required in order to find Max(71+2Ij + 1))1i and
Max(71+2,(j + 1))1j and Dd"i'max is related to Dain'max when Toggle/XOR module iscon-
sidered. Based on the caution noted above, Dd"t'max will not be the same before and
after the Join is split.Therefore the approximation equivalency does not hold when
Max(71+2Ij + 1))1i and Max(71+2,,,(j + 1))1iare calculated.
8.3 A Design Example
Our approach provides both a method for analyzing the performance of asynchro-
nous circuits and guidelines (or approaches) for designing such circuits. This is achiev-
able because we use a symbolic, rather than a numerical, approach. The characteristics
of our method allow designers to analyze circuit performance while, at the same time,
providing design guidelines. A general FIFO example has been studied based on our
approach in Chapter 5.The design guidelines provided when the FIFO example was
discussed can also be applied to a more general pipeline circuit. In this section, we will
use the openloop circuit in Figure 8.1 as an example to demonstrate stage splitting, one
of the design approaches.
Given a circuit architecture as shown in Figure 8.1, the delays are assumed to
be bounded and described as follows.
10Dign(j + 2) .15,
10Dipn(j + 2) 5_12,
4Fi,q(j + 1)198
3 5 Bi,q(j + 1) 58,
9 5 Fi,p(j + 1) 530,
6 5 Bi,p(j + 1) 58,
10 5 Fi+1,,.(j + 1) 518,
3 5 Bi+aj + 1) <5,
5Fi+1,p(i + 1) 5-15,
3 5 Bi+ i,p(j + 1) s5,
9 5 Dra(j + 1) 5 15,
6 5 Dr(j + 1) 510,and
2 5 Dr(j + 1) 59.
With the delay bounds shown above, the result of output loop delays for 71+21j + 1),
71+21i + 1) and T1+2 p(j + 1) can be obtained through (8.1) to (8.8).That is,
1. Drt'max.5 Max(18 + 5,
10 +53,
15 + 53)
Max(23, 12, 17) 523 (8.48)
D f ,MaxMax(30 + 8,
12 + 309)
= Max(38, 33) = 38 (8.49)
2. Di: MaxMax(2*9 + 2*8,
15 + 2*94 + 8,
23 +94 + 83,
2*15 + 94)
5Max(34, 37, 33, 35)537 (8.50)3.Dout,MaxMax(15 +
12 + 30102*43 + 15 + 5,
38 + 155+ 53,
9 + 53)





Max(9, 20, 47, 48)48
4. Di:Max < Max(2*9+ 2*8,
15 + 2*94 + 8,
50 + 94 + 83,
2*15 + 94)
Max(34, 37, 60, 35)60
Max(71+2,t(j + Max(15,
18 + 5,
10 + 1810 + 53,
60 + 1810)
Max(15, 23, 20, 68)68
Max(71+2,u(j + 1))1,iMax(10,
18 + 5,
15 + 1810 + 53,
60 + 1810)








Max(T1+21j + 1))1i=UPI(T1+2,1j + 1))1i = 68.
Max(71+2,u(j + 1))Ii=UPI(T(+2,u(j + 1))1i= 68.
Max(71+2,p(j + 1))1i=UPI(T1+2p(j + 1))1i = 48.
For this specific design example, if the specification of Max(71+2,p(j + 1))1i is changed
to a smaller number, two options, shown in Figure 8.8(a), are available,according to
the maximum numbers in (8.49) and (8.52).
1. Reduce Ditmax: Dr" can be reduced by splitting the FIFO stage.Based
on our assumption and discussion of FIFO (Chapter 5), Dp lc" is limited to a minimum
value of 33 no matter how many stages it is split into.
2. Split Join: The intention of splitting the Join stage is to reduce the value of
Assume that Fi+1,p and Bi+1,p are split into two stages and the result-
ing pipeline has the following delay relationship. That is,
Fmax Fmax+ Fmax i+ 1,p i+1,p1 i+1,p2'
Pnin Pnin Pnin z+1,p t+1,p1 t+1,p2'
BliY1+qp Br_FeLlY+ Br+cl'ic
Yt.nin= BTin+ Irnin +1,p t+1,p1 t+1,p2"




Mjpi + efaxt+ 1,p1
FMax Frin
i+ 1,p2 t+ 1,p2'201






Meet Max(T1+21j + 1))1i
Max(71+2u(j + 1))1j
'ReduceDain'm" Split Fork








Figure 8.8: Design options for helping reach specifications.
(a) Flow chart for the improving of Max(7',/±2p(j + 1))1p
(b) Flow chart for the improving of Max(Tii+21/ + 1))1j




The maximum element in (8.56) is
Dfin'max + Fric"-4/FT+ini,p1 + Fr_ajp2 Ftin+ini,p2




As shown in (8.57), the maximum element is the same before and after splitting the Join
stage.Therefore, splitting the Join stage will not improve our approximation to
Max(71+4(i + 1))1/ in this particular example. In conclusion, basedon our approach,
the minimum value of Max(71+2,p(j + 1))1/ is (33 + 155) = 43 no matter which or
how many stages are split.
If the specification of Max(T1+21j + 1))1j and Max(71+2,u(j + 1))1i is changed
to a smaller value, several design options, demonstrated in Figure 8.8(b), are available,
according to the maximum value in each expression from (8.48) to (8.55).
1. Split Fork stage: The intention of splitting the Fork stage is to make the value
of Pin+in1,, smaller in order to have a smaller sum in (8.54) and (8.55).Again,
using an approach similar to (8.56) and (8.57), splitting the Fork stage will not help in
reducing the value of Max(Tf+2,t(j + 1))Ii and Max(71+2,u(j + 1))1j.
2. Reduce Dain'max by splitting the Toggle stage: This will help since D°dut'max is
fixed and the amount of the rest of the element F;r + Bti "'gin will be
reduced by splitting the Toggle stage.
3. Reduce D°"`-max by splitting the Join stage: This will improve the overall sum
of the maximum element in (8.51) since Din Alax is fixed and the amount of the rest of
the element FlitpPin+ini,p+ Br+qp Btin_tn1,p will be reduced by splitting the Join stage.203
4. Reduce Din'mar by splitting the FIFO stage: This stagesplitting helps, as de-
scribed before, but is limited to the minimum value of 33.
For example, if the specification is changed to
Max(71+211 + 64,
Max(71+2,(1 + 1))1i64,
Max(T( +2,p(j + 1))1j 5 47,
splitting the FIFO into two stages will help meet the new requirement.If we let the
two new stages have
Fmin= 7, Frpat = 20, F;71,i3 = 2, Ffil;a1 = 10, and i,pl
Bmin5, Bli!Ipat = 6, B 7z:2z1, Brpa2x = 2, ipl
the new bounds become
Max(71+21j + 1))1i= UP1(T1+2,/j + = 64.
Max(11+2,u(j1))Ii=UP1(71+2,u(j + = 64.
Max(71+2,p(j + UP1(71+2,p(j + 1))1i= 47.
Because our method is only an approximation, it should be noted that all of the design
options stated above do not always reflect the real situation accordingly. In some cases,
several terms dropped in our method may be taken into play in obtaining the maximum
output loop delay.However, our approach is always conservative at all times.Also,
if the Toggle is split and the maximum element is 2Dgin'max + Ptin, extra effort
should be made to obtain a more accurate result. The reason is because the coefficient
2 of Din 'm" makes a worse approximation by ignoring the dependency between, for
example, Dign(j + 2) and lYqn(j + 3). Similar statements have been discussed in previous
section.It is worth mentioning that, if the element with maximum value in any one of
the expressions from (8.48) to (8.55) changes, the flow chart of design options
(Figure 8.8) needs to be changed accordingly.For example, if the maximum element204
equal to 60 being the third element in (8.53) is somehow changed to the second element
due to different stage bounds, the design options must also change. In other words, the
flow chart of design options is not only architecture dependent but also stagedelay
dependent.
8.4 Summary
In this chapter, the performance of a larger system consisting of FIFO, Fork, Join,
Toggle/XOR is discussed. The simulation result shows that our approach has a closer
approximation for an openloop system than for a closedloop system. Our method
provides not only the performance analysis, but also the design guidelines. The design
guidelines (e.g., stagesplitting) do not always reflect the real circuit behavior, since our
approach is only an approximation. However, they do give designers some clues for
improving overall circuit performance in a systematic way, rather than on trial and error
basis. Meanwhile, they also provide some information to save design cycle time (e.g.,
the stagesplitting for specific stages does not help in overall performance).It is best
to use CTSE simulations together with this approach to verify the intermediate and final
result.Finally, it should be noted that the design options are functions of both circuit
architecture and stagedelays. That is, same circuit architecture may result in a different
flow chart of design options if the stagedelays are different.205
9. CONCLUSIONS AND FUTURE WORKS
The Micropipeline is one of many asynchronous design methodologies. It is also
the design method we chose to study for its performance. One of the features that
micropipelines have is modularity and composibility. We make use of this feature in
our performance analysis and design.Conclusions are made in this chapter based on
the result and discussion in previous chapters. A comparative summary of our approach
and the CTSE is also included in the conclusions.Several future works based on and
extending our approach are suggested at the end of this chapter.
9.1 Conclusions
There are two types of performance measurements in asynchronous circuits. One
is the average throughput and the other is throughput bounds. Since stagedelay is
totally random, it is not possible to find an average throughput from the deterministic
approach adopted in this thesis.Instead, a probability approach is suggested. Even so,
our approach reveals that not only total stagedelay (delaysum), but also stagedelay
pattern (asynchronous parametersforward delay slope and logic delay) affects the
average throughput. Based upon a finite number of input data, this leads to the conclu-
sion that asynchronous pipelines may have better, worse or equivalent performance (av-
erage) than synchronous pipelines.
Most of this thesis is devoted to finding the throughput bounds for FIFO, Fork,
Join, Toggle/XOR, Arbiter/Call, Select/XOR and a system composed of some of these
modules. Our approach has two steps.First, several basic modules are chosen. They
include FIFO, Fork, Join, Toggle/XOR, Arbiter/Call and Select/XOR. The output loop
delay, equivalent input delay and equivalent output delay for each basic module are
derived based on the Equal loopdelay theorem. The result is a set of difference equa-
tions. The performance approximation can be obtained with simple mathematical opera-206
tion on the difference equations, given the bounds of stagedelays. That is, the bounds
(to save space, lower bound is not derived for most of the modules) of output loop delay,
equivalent input delay and equivalent output delay can be represented as the bounds of
stagedelays. Second, for a larger system consisting of those basic modules, its perform-
ances bounds can be derived directly from the bounds of output loop delay, equivalent
input delay and equivalent output delay of those basic modules, which have been ob-
tained at first step.This approach allows a fast and easy calculation of performance
bounds since rederiving the difference equations for the whole system is avoided.
Our approach to the output loop delay can be from the lefthand or righthand
side (as when FIFO is discussed in Chapter 5). However, most of our efforts are devoted
to approximating the maximum output loop delay from the right hand side, i.e., our
approximation is greater than or equal to the exact maximum bound. The simulation
result shows that our approach has a closer approximation for an openloop system than
for a closedloop system. Since our method is a symbolic approach instead of a numeri-
cal approach, it allows designers to analyze circuit performance while providing design
guidelines/approaches at the same time. These design guidelines/approaches do not al-
ways reflect the real circuit behavior, since our approach is only an approximation.
However, they do give designers some clues for improving the overall circuit perfor-
mance in a systematic way, rather than on trial and error basis.Meanwhile, they also
provide some information which may save design cycle time. This thesis in not intended
to replace the CTSE tool. On the contrary, it is best to use CTSE together with our
method to verify the intermediate and final result.Also, it should be noted that design
options are functions of both circuit architecture and stagedelays. That is, the same
circuit architecture may result in different design options if the stagedelays are different.
In the design of an asynchronous circuit, there are many solutions (stagedelay
bounds) that can meet the design specifications. Which solution is chosen depends on
the current and available technology for implementing the circuit being designed. The207
other factor that will affect the choice of stagedelays is the average throughput. Choos-
ing different stagedelay bounds or changing stagedelay bounds to meet the specifica-
tion of throughput bounds may reduce the average throughput. Therefore, the best ap-
proaches or choices to adopt are those that will make both average throughput and
throughput bounds meet the requirements.






CTSE: needs advanced mathematics and algorithm
Ours: uses difference equations and simple operations
3. Application
CTSE: general asynchronous circuits modeled with a Petri net
Ours: micropipelines
4. Accuracy
CTSE: exact bounds for circuits that are dataindependent and without
mutually exclusive mechanism
Ours: an approximation, except for the maximum output loop delay of a
initially reset FIFO
5. Modularity
CTSE: the whole circuit should be simulated again whenever there is a
change, even a small change on this circuit
Ours: only the parts that are affected by this change need to be recalcu-
lated
6. Design208
CTSE: provides no direct information to help with design
Ours: provides design procedure and guidelines (approaches)
9.2 Future Works
Several extensions of this research and the results presented in this thesis are
possible.
1.> Stagedelay dependency: In the problem setup, we assume that all stage
delays are independent. That is, Fp(j + 1) and Fq(j + 1), Bp(j + 1) and Bq(j + 1),
Fp(j + 1) and Bp(j + 1), and Fp(j + 1) and Bq(j + 1) are independent where pq,
I._p_.c_m and 1 ..5qm. In other words, given stagedelay bounds, the delay of
any stage at any moment could be any value as long as it is within the given bounds,
disregarding the delays of adjacent stages.This is not true in real life.For example,
if a combinational logic is partitioned into three stages, the delays in these three stages
at any moment have some relationship which is defined by the implementation and func-
tion of this combinational logic. That is, they are not independent. One thing for sure
is that the system, taking into account the stagedelay dependency, will have better per-
formance (throughput bounds) than the same system which ignores the dependency.
2.> More logic delay terms: As shown in Chapter 5, if more logic delay terms
are taken into account when doing the approximation, the result is closer to the exact
bounds, with the price of adding more complexity. Tradeoff analysis between complex-
ity and accuracy is worthy.
3.> More basic control modules: In this thesis, we confine our performance inves-
tigation to several basic control modules, like FIFO, Fork, Join, Toggle/XOR, Arbiter/
Call and Select/XOR, and the system consisting of some of these modules. To have more
general applications and to facilitate design freedom, more basic control modules could
be investigated.For example, Merge, True gate and False gate. Merge is similar to209
Select except that input and output terminals are reversed. True (False) gate allows an
input token to pass through only when control signal is true (False).
4.> Approximationequivalent form: Sometimes it is easier to find performance
bounds and any other equivalent input and output delays of a system if some parts of
this system can be transformed into the other equivalent form.Since micropipelines
communicate through a handshaking protocol, it looks as if most of the exact value for
the equivalent input and output delays of any cross sections is a function of all of the
stagedelays in a system.It would be hard to find an exact equivalent form of original
circuit or system. For example, finding a form which is exactly equivalent to a Join
circuit could be a challenging task. This is because to obtain an exact expression of the
performance bounds of a Join circuit in terms of stagedelays (variable) is itself a chal-
lenge, not to mention obtaining its equivalent form. However, it could be easier to find
its approximationequivalent form, the form which has the equivalent result of approxi-
mation. For example, the approximationequivalent form of a Join is shown in
Figure 8.7. Some restrictions may be applied to the use of this equivalent form. Deriv-
ing properties may also help in constructing an approximationequivalent form for a
general circuit.
Of course, there are many other possible extensions to this research, besides the
ones stated above. It is noted that, no matter what the research is, the features of simplic-
ity and modularity are always desirable.210
BIBLIOGRAPHY
[1]D. Pountain, "Computing Without Clocks," BYTE, pp.145-150, January 1993.
[2]C. Seitz, "System Timing," Chapter 7,Introduction to VLSI Systems, Addison
Wesley Publishing Company, 1980.
[3]Fumiyasu ASAI et al, "SelfTimed Clocking Design For A DataDriven Micropro-
cessor", IEICE Trans. Vol. E 74 No.11, pp.3757-3764, November 1991.
[4]S. Hauck, "Asynchronous Design Methodologies: An Overview," Proceedings of
the IEEE, Vol. 83, No. 1, pp.69-93, January 1995.
[5]S.H Unger and CJ Tan, "Clocking Schemes for HighSpeed Digital Systems,"
IEEE Trans. on Computers, Vol. C-35, No. 10, pp.880-895, October 1986.
[6]G.M. Jacobs and R.W. Brodersen, "Selftimed integrated circuits for digital signal
processing applications," VLSI Signal Processing, III, R.Brodersen and H. Mosco-
vitz, Eds. IEEE Press, pp.197-208, 1988.
[7]S.L. Lu, "Implementation of Micropipelines in Enable/Disable CMOS Differential
Logic," IEEE Trans. on VLSI Systems, Vol. 3, No. 2, pp.338-341, June, 1995.
[8]M.E. Dean, D.L. Dill, and M. Horowitz, "SelfTimed Logic Using CurrentSens-
ing Completion Detection(CSCD)," Journal of VLSI Signal Processing, 7, pp.7-16,
1994.
[9]I. E. Sutherland, "Micropipelines," CACM, Vol. 32, No. 6. pp.720-738, June 1989.
[10]J.C. Ebergen, Translating Programs into DelayInsensitive Circuits. Centre for
Mathematics and Computer Science, Amsterdam, CWI Tract 56, 1989.
[11]J.C. Ebergen, "A formal approach to designing delayinsensitive circuits," Distrib-
uted Computing, Vol. 5, No. 3, pp.107-119, July 1991.
[12]A.J. Martin, "Programming in VLSI: From communicating processes to delayin-
sensitive circuits," in UT year of programming Institute on Concurrent Program-
ming, C.A.R. Hoare, Ed. Reading, MA:AddisonWesley, pp.1-64, 1989.
[13]P. Kudva and V. Akella, "A Technique for Estimating Power in Asynchronous
Circuits," International Symposium on Advanced Research in Asynchronous Cir-
cuits and Systems, Salt Lake City, Utah, USA, pp.166-175, November, 1994.
[14]J. A. Tierno and A. J. Martin, "LowEnergy Asynchronous Memory Design, "Inter
national Symposium on Advanced Research in Asynchronous Circuits and Systems,
Salt Lake City, Utah, USA, pp.176-185, November, 1994.
[15] S.B Furber, P. Day, J.D. Garside, N.C. Paver and J.V. Woods, "AMULET1: A
Micropipelined ARM," proceedings of CompCon'94, IEEE Computer Society
Press, CompCon'94, San Francisco, March 1994.211
[16]J. Tierno, A. Martin, D. Borkovic, and T. Lee, "A 100Mips GaAs asynchronous
microprocessor," IEEE Design and Test of computers, Vol. 11, No. 2, pp.43-49,
1994.
[17]R.F. Sproull, I.E. Sutherland and C.E. Molnar, "Counterflow Pipeline Processor
Architecture," IEEE Design & Test of Computers, pp.48-59, Fall 1994.
[18] C.M Chang and S.L. Lu, "Design of a Static MIMD Data Flow Processor Using
Micropipelines," IEEE Trans. on VLSI Systems, Vol. 3, No. 3, pp.370-378, Sep-
tember, 1995.
[19] M.R. Greenstreet, STARI: A Technique for HighBandwidth Communications,
Ph.D thesis, Princeton University, January 1993.
[20]J.N. Seizovic, "Pipeline Synchronization," International Symposium on Advanced
Research in Asynchronous Circuits and Systems, Salt Lake City, Utah, USA,
pp.87-96, November, 1994.
[21] H. Hulgaard and S. Burns, "Bounded Delay Timing Analysis of a Class of CSP
Program with Choice," International Symposium on Advanced Research in Asyn-
chronous Circuits and Systems, Salt Lake City, Utah, USA, pp.2-11, November,
1994.
[22]H. Hulgaard, S. Burns, T. Amon, and G. Borriello, "An Algorithm for Exact
Bounds on the Time Separation of Events in Concurrent Systems," IEEE Transac-
tions on Computers, Vol. 44, No. 11, pp.1306-1317, November 1995.
[23]S. Burns, Performance Analysis and Optimization of Asynchronous Circuits, Ph.D
Thesis, Computer Science Department, Caltech, 1991.
[24]T.Y. Wuu and S. Vrudhula, "Synthesis of Asynchronous Systems from Data Flow
Specifications," ISI Research Report, ISI/RR-93-366, December 1993.
[25]C.V. Ramamoorthy and G. Ho," Performance Evaluation of Asynchronous Concur-
rent Systems Using Petri Nets," IEEE Transactions on Software Engineering, Vol.
SE-6, No. 5, pp.440 119, September 1980.
[26]R.E. Muller, "Sequential Circuits," Chapter 10, In Switching Theory, Vol 2, Wiley,
NY, 1965.
[27]T. E. Williams, "Performance of Iterative Computation in SelfTimed Rings,"
Journal of VLSI Signal Processing, 7, pp.17-31, 1994.
[28]L. Thiele, "On the analysis and optimization of selftimed processor arrays," In-
tegration, the VLSI journal 12, pp.167-187, 1991.
[29]S.L. Lu and M. Ercegovac, "A Novel CMOS Implementation of DoubleEdge
Triggered FlipFlops", IEEE Journal of SolidState Circuits, Vol.25, No.4,
pp.1008-1010, August 1990.
[30] Matlab User's Guide, The Math Works, Inc., 1991.212
[31]J.Y. Lin and D. Ionescu, "Asymptotic behavior of output feedback for a class of
nondeterministic discrete event systems," Int. Journal on Control, Vol. 54, No.
4, pp.903-920, 1991.
[32]G.J. Olsder, J.A.C. Resing, R.E. De Vries, M.S. Keane and G. Hooghiemstra, "Dis-
crete Event Systems with Stochastic Processing Times," IEEE Trans. on Automatic
Control, Vol. 35, No. 3, pp.299-302, March 1990.
[33] S.H Unger and CJ Tan, "Clocking Schemes for HighSpeed Digital Systems,"
IEEE Trans. on Computers, Vol. C-35, No. 10, pp.880-895, October 1986.
[34] C.M. Chang and S.L. Lu, "Performance Issues on Micropipelines," IEEE TCCA
Newsletter, Fall 1995.213
APPENDICES214
APPENDIX A: PETRI NET MODELS AND PROCESS NAMES
FOR BASIC MODULES
ainaout
CTSE process name: (rin,aout) c_element (ain)






CTSE process name: (in) edge_no_token (out)






CTSE process name: (in) edge_with_token (out)






CTSE process name: (rout,ainq,ainp) fork (rinq,rinp,aout)













CTSE process name: (routq,routp,ain) join (rin,aoutq,aoutp)














CTSE process name: (rout,ainq,ainp) toggle (rinq,rinp,aout)
Figure A.6: Process of a Toggle.
ainp
2L216
APPENDIX B: A CTSE CODE FOR A SYSTEM SHOWN IN FIGURE 8.1

































// define module c_element













// define edge without initial token







// define edge with initial token








// define module fork
define process (rout,ainq,ainp) fork (rinq,rinp,aout)
{
t(rout,ainq,ainp,rinq,rinp,aout); // These transitions are for interface



















// define module join






















// define module toggle



























'r(i+l)ri'a(i+l)ri' a(i+ 1 )ro',
'r(i+l)si'a(i+l)si',' a(i+ 1 )so',




instance eq ('aiqi') edge_with_token ('riqi');
delay (eq/pl, din_q_min, din_q_max);
instance ep ('aipi') edge_with_token ('ripi');
delay (ep/pl, din_p_min, din_p_max);
instance et ('a(i+2)ti') edge_no_token ('a(i+2)to');
delay (et/pl, dout_t_min, dout_t_max);
instance eu ('a(i+2)ui') edge_no_token ('a(i+2)uo');
delay (eu/pl, dout_u_min, dout_u_max);
instance ei2p ('a(i+2)pi') edge_no_token ('a(i+2)po');
delay (ei2p/pl, dout_p_min, dout_p_max);
instance cq ('riqi','aiqo') c_element ('aiqi');
instance cp ('ripi','aipo') c_element ('aipi');
instance cr ('r(i+l)ri','a(i+l)ro') c_element ('a(i+l)ri');
instance cs ('r(i+l)si','a(i+l)so') c_element ('a(i+l)si');
instance cilp ('r(i+l)pi','a(i+l)po') c_element ('a(i+l)pi');
instance ct ('r(i+2)ti','a(i+2)to') c_element ('a(i+2)ti');
instance cu ('r(i+2)ui','a(i+2)uo') c_element ('a(i+2)ui');
instance ci2p ('r(i+2)pi','a(i+2)po') c_element ('a(i+2)pi');
instance toggle_m (' aiqi' ,'a(i+l)ri' ,'a(i+l)si') toggle ('r(i+l)ri','r(i+l)si', ' aiqo' );
delay (toggle_m/pl, fi_q_min, fi_q_max);
delay (toggle_m/p6, bi_q_min, bi_q_max);
instance fifo_f m ('aipi') edge_no_token ('r(i+l)pi');
delay (fifo_f m/pl, fi_p_min, fi_p_max);
instance fifo_b_m ('a(i+l)pi') edge_no_token ('aipo');
delay (fifo_b_m/pl, bi_p_min, bi_p_max);
instance fork_m ('a(i+1)ri','a(i+2)ti','a(i+2)ui') fork ('r(i+2)ti','r(i+2)ui','a(i+1)ro');
delay (fork_m/pl, fil_r_min, fil_r_max);
delay (fork_tn/p6, bil_r_min, bil_r_max);
instance join_m ('a(i+l)si','a(i+1)pi','a(i+2)pi') join ('r(i+2)pi','a(i+l)so','a(i+1)po');
delay (join_in/p3, fil_p_min, fil_p_max);
delay (join_m/p4, bil_p_min, bil_p_max);
}
























//set po_alg = 2;
//set max_iter=1;
print analyze(analysis_t_upper);
print analyze(analysis_u_upper);
print analyze(analysis_p_upper);