A complete family of untimed asynchronous 4-phase pipeline protocols is derived and characterised. This family contains all untimed protocols where data becomes valid before the request signal rises. Starting with a specification of the most parallel such protocol, rules are provided for concurrency reduction to systematically generate the family of all 137 related protocols that can be pipelined. Graphical and textual nomenclatures are developed to represent protocol properties and behaviours. The protocols are categorised according to their behaviours when composed into linear and structured parallel pipelines. Six basic categories emerge, along with several properties such as a single state that determines whether a protocol is fully or half buffered. When equivalence classes are calculated for parallel pipeline behaviours they are dominated by 15 shapes (all of which are delay-insensitive) which are related by a simple lattice. Several published circuits are shown to map to 16 of our 137 family members. This work enhances the understanding of handshake protocols, their properties, and relationships between different implementations in terms of concurrency and behavioural properties.
Introduction
Asynchronous request acknowledge protocols have been employed for years. Yet it is surprising how little is understood of the fundamental behaviour of the protocols when they are composed into systems. This work formally and exhaustively investigates all possible untimed asynchronous latch controller protocols. The behaviour of each protocol is then investigated in linear and parallel configurations to study its concurrent behaviour. A number of properties emerge such as protocol equivalence classes, protocol compatibility sets, behavioural properties such as the ability to latch data in every latch, control the latch without extra state logic, and full lattice representation.
We have found Milner's CCS (Calculus of Communicating Systems) [10] to be very apt notation for studying protocol families in this way: it is expressive enough to model signal protocols; its semantics conveniently capture event orderings rather than specific timings; and it is compositional which makes it straightforward to model both linear and parallel pipelines. Further, latch protocols and pipeline structures can be compressed down to the minimal canonical state graph and property checked on CCS's supporting software, the public domain CWB (Concurrency Workbench) [11] . CCS has also been extended to directly support circuit realisations with speed-independent broadcast communication [12] .
Our technique is quite straightforward: the most parallel behaviour of a 4-phase latch controller is first described in CCS and the CWB is used to generate its equivalent state graph (32 states). Using a few concurrency reduction rules, states are systematically cutaway on the incoming and outgoing channels to generate all less concurrent state graphs that will still obey some related latch controller protocol. The cut-aways are exhaustive: all possible protocols in the family are generated as minimised state graphs. Notice that this paper only describes concurrency reduction for untimed protocols. The rules for timed protocols (burst-mode or relative timed) will be presented elsewhere.
The CCS notation allows us to compose parallel specifications of the channels with their concurrency reducing synchronisation, and reduce these to canonical state-graph specifications. The composition of parallel protocols and their systematic reduction to a minimal canonical representation renders comparison between implementations trivial and assists in validating completeness. Such transformations are not readily possible with STG and Petri-net specifications. Figure 1 shows LC, a 4-phase latch controller, and its associated latch where the data is stored. The input (upstream) channel handshakes with lr (the left request) and la (the left acknowledgment), and the output (downstream) channel with rr (the right request) and ra (the right acknowledgment). Each channel employs the simple protocol of interleaving request and acknowledgment signals. By convention we overline output signals but not input signals. Note that in this work we have abstracted out the data-path, and only model the proto- We quickly review the relevant results presented in [1] . That work made no attempt to be complete and considered only four published 4-phase latch controllers and three idealised protocols. Besides modeling these seven latch controllers singly, it also considered their behaviours when composed into structured pipelines:
Previous Work
1. LP d : a linear pipeline of latch protocols of depth d (see the top part of Figure 2 ). 2. PP w,d : the structured composition of w d-deep pipelines running in parallel (see the lower part of Figure 2 ). The fork module F 2 broadcasts lr to both linear pipelines and waits until all have replied before responding with la. The join module J 2 is the inverse of F 2 .
Notice that the specifications of LC, LP d and PP w,d are all in terms of lr, la, lr, ra and can thus be compared and contrasted directly.
The Manchester group has published several 4-phase latch controller circuits, some faster, some more power efficient. One source of variety is the amount of overlap permitted between the recovery phase on the rr↓/ra↓ side and the notification of the arrival of the next data value (lr↑). It is thus not surprising that the Manchester 4-phase latches studied vary in state size (from roughly 18 to 26 states). When combined into linear pipelines LP d , their minimised behaviours settled into predictable pattern of state sizes from pipeline depth 2; whereas the parallel pipeline patterns PP w,d were regular from depth 1, but did not agree with the linear pipeline pattern (PP w,d was always more state rich). Three new mathematically inspired 4-phase protocols were also examined, two of which did exhibit stable behaviour in that PP w,d ≡ LP d for positive w and d. One of these protocols has 32 states and is the most parallel 4-phase latch protocol achievable.
With a little modeling and analysis on the CWB, it was easy to show the following results for the latches considered: These 4-phase results were checked by running CCS models on the CWB for w,d = 1..8. In addition several equivalent 2-phase results were given formal proofs. None of these are deep, rather they are case rich, shallow and tedious. Preliminary work on the 4-phase proofs shows them to be similarly structured and yet more case rich and tedious. It would be nice to get the proofs mechanised and verified with a proof checker such as HOL.
Structure of the Paper
The structure of the rest this paper is as follows. In Section 2, we present our specification notation CCS and construct a specification of the most concurrent 4-phase latch protocol LC max which has 32 states when expressed in our normal form as a minimised state graph. In Section 3 we show how the whole family of less state rich (less parallel) 4-phase latch protocols can be derived from LC max through concurrency reduction. Each sub-behaviour is expressed as a state graph and given a unique characterisation. We also tabulate the behaviours when pipelined singly and in parallel. In Section 3.3 we discuss six protocol categories that emerge, including the 15 protocols that are stable: for them LP d ≡ PP w,d . An important side effect for designs with this behaviour is that we can replace quite complicated formal models of parallel PP w,d datapaths by the much simpler LP d model when reasoning about concurrent pipelined designs such as a microprocessor. In Section 4 we partition the state space into protocol equivalence classes when pipelined in linear and parallel configurations. Section 5 presents the stable circuits (and hence their equivalence classes) into a lattice based on concurrency. Section 6 ties in related work and Section 7 lists some published designs and places them in our family lattice. Finally we summarise the work done so far and some future directions.
LC max : The Maximal 4-phase Protocol
In this section, we model the behaviour of LC max , the 4-phase latch protocol of maximal concurrency, and display its regular behaviour when composed into pipelines. In CCS, our first step in specifying LC max is to describe as the composition of L which deals with the incoming channel and R which deals with the outgoing channel.
The definition of L simply spells out the order of one cycle of input signals and then repeats forever. The signals are separated by '.' which we may interpret informally as signal precedence. CCS specifies the order of events, but they occur with arbitrary delays rather than strict timings, resulting in all possible concurrent signal interleavings. The protocol it describes may accordingly take an arbitrary time between these signals. The definition of R follows the same pattern. The above specification minimises to a 4 × 4 block of states which (with loop back) and via the semantics of CCS, covers every possible interleaving of the 8 signals, from one extreme just L running and to the other just R running and all intermediate possible interleavings.
However L and R can not run untrammeled. We now add synchronisations that will (1) stop L from accepting fresh data when the previous data value has not been accepted downstream, and (2) • stop R from emitting an rr↑ until a fresh value has been latched. The second version of the specification of LC max indicates how to handle these interplays between L and R:
: after R has received a signal ra↑, it is sure that the current data value has been captured downstream. R will now unblock L (if it were blocked). Both L and R may continue on.
2.
• : with space assured, (the unblocked) L is free to capture next fresh data value and then unblock R (if it were blocked). Both L and R may continue on.
Notice the ordering lr↑. .•. . . in process L. Channel L must have an empty latch (make sure that R has received ra↑) before it stores the next value on dIN in the latch. For channel R the conditions are reversed.
We may remark that whereas the placing of the receiving in L and • in R are crucial, it is quite in order to shuffle the awakening • to the right in L and the awakening to the right in R. All such shufflings are captured by our cut-away method described in Section 3.
L:
R: These two synchronisations may be modeled in several ways. One reusable style 1 to model (see Figure 3 ) is define two tokens one for each of the synchronisations:
1. by S, the space token S = gS.pS.S
2.
• by V, the value token, V = gV.pV.V each of which is taken (by a get handshake, gS or gS) and replaced (by a put handshake, pV or pS ). Importantly, puts never delay the sender; but gets will block a requester until permission is granted. This leads to the final form of our specification:
The last line of the specification defines the behaviour of LC max as the composition of the upstream channel process L; the downstream channel process R; and the synchronisation between the two channels with space and value tokens S and V. The handshakes between L, R and S and V are made private (hidden) with \ {gV,pV,gS,pS} so that no other process can tamper with them.
The only synchronising constraint on the input channel of LC max is that L must wait until a slot is free (gS) to accept the value on dIN; and the only constraint on the output channel is that fresh data must be latched (gV) LC max as defined is the most concurrent protocol possible for a latch protocol where data is valid before the rising request on the left channel.
This protocol has 32 states, and d-deep pipelines and parallel pipelines have 16d + 16 states. Figure 4 depicts the minimised state graph of LC max . A middle 4×4 block of states can be iterated for deeper pipelines to give rise to minimised versions of LP d . Runs on the CWB confirm that PP w,d ≡ LP d for w, d = 1..8. Thus LC max exhibits stable behaviour. By inductive argument, we can reason about the overall control signal behaviours of structured widening and thinning parallel pipelines as though they were linear pipelines of the same depth -a much simpler model to grasp.
The Family Derived from LC max
The possible design space for 4-phase protocols is bounded above by LC max which exhibits the largest possible parallelism. A formal method is developed to derive all less concurrent protocols from LC max . This is achieved by creating and applying rules which systematically reduce concurrency by minimal increments. The behaviour of all protocols when pipelined is then tabulated.
A convenient, more a compact notation of the minimised state graph of the most concurrent protocol LC max of Figure 4 has been developed. Since all the transitions follow simple patterns, we choose to present our ideas using what we call a shape:
The initial state is marked '+'; other reachable states by 'o', and unreachable states by '.'. Each shape is a graphical representation of a specific handshake protocol, fully specifies its behaviour, and differentiates it from all other protocols. The graphical representation provides intuition about the concurrency and specific behavioural and pipeline properties of all protocols.
Concurrency Reduction Rules
The rules for generating members of the untimed 2 4-phase family are:
1. The initial idle state must be reachable from all states in the graph. This has the following consequences:
(a) This will restrict the number of states that can systematically be removed from the "left" and "right" side of the state graph. For example, the following is the maximum left cut-away that preserves reachability of the initial state:
Each row in the graph must contain at least one state, otherwise the graph will deadlock (represented as D).
2. Internal holes in the state space are disallowed. Thus the following state graph is deemed illegal:
Such graphs are found to generate very irregular behaviour when pipelined. This rule cuts the search space from over 400,000 protocols to 250.
(a) Disallowing holes in shapes has the consequence that we can generate all possible subbehaviours by listing all viable ways of cutting states away on the left; similarly on the right; and then mechanically generating all combinations of cut-aways.
3. In untimed protocols, inputs lr and ra must always be accepted.
4. The protocol can restrict when outputs rr and la are possible.
(a) The Speed-independent set of protocols is a concurrency reduction of the delayinsensitive set after employing output ordering.
The Cut-Away Notation
The following notation is adopted for cut-aways:
1. Labcd means from LC max remove the leftmost a live states (circles) from R1; the leftmost b live states (circles) from R2; etc. Thus cut-away L2112 from LC max results in the shape:
in which each cut-away state is denoted by '.'. Since this shape has 7 reachable states in row 1, 4 in row 2, 8 in row 3, and 7 in row 4, we use the short hand 7487 where it suits (the notation is occasionally ambiguous, whereas the cut-away notation is not). The following cut-away patterns emerge from the rules:
Similarly
1. LEFT: L0000, L1001, L1111, L2002, L2112, L3003, L3113, L2222, L3223, L3333.
There are 10 in all. Any cut-aways of depth 4 would make the initial state an orphan and are rejected. Cut-aways consisting entirely of even numbers (L0000, L2002, L2222) are of the delayinsensitive (DI) class. The set with odd numbers (L1001, etc.) are the speed-independent class and employ output ordering.
2. RIGHT: R0000, R0020, R0040, R0022, R0042, R2022, R2042, R2222, R2242, R2262, R0044, R2044, R4044, R2244, R2264, R4244, 4264.
There are 25 in all but after experimentation only the 17 listed here turn out to yield protocols that implement pipelining. The delay-insensitive class of right cut-aways exist when both the first two numbers agree and the last two numbers agree (R0000, R0022, R2222, R0044, R2244). The others are of the speed-independent class.
Protocol Categories
These cut-aways allow us to classify pipeline protocols into three families. The delay-insensitive family consists of both left and right DI cut-aways. The speedindependent family consists of protocols where the left or right cut-away employs output ordering. The timed family (not included in this paper) consist of cut-aways that restrict the arrival of inputs lr or ra based on local timing assumptions.
We have mechanised the task of generating all possible delay-insensitive and speed-independent pipelined protocols. All 250 have been evaluated on the CWB by running them in linear pipelines of depth 1..8 and parallel pipelines of depth 1..8 and width 1..8.
When the 250 protocols were examined, 6 categories emerged: The category of constant protocols are all concurrency reduced versions of the DI protocol L 0000 •R 2266 . This consists of cutting off the right six columns of the LC max shape of Figure 4 . Thus, at least one of the states in R2266 are required for pipelining.
Certain protocols can only store data in every other latch when the pipeline is stalled 3 , called half-buffering [8] . Any protocol that does not contain the state marked with × in Figure 4 (or that remove any states in R2) cannot store data in every latch when stalled. Thus, we define this state as the pipeline state. The O(8) category is a subset of this set since these states only occur with R2244, R2264,R4244 and R4264 cut-aways. However, note that even certain delay-insensitive protocols, such as L 0000 •R 2222 , cannot store data in all latches. Protocols that do not include the pipeline state are not useful for certain implementations such as FIFOs.
Protocols in the stable category retain their native shapes when composed in parallel. Further, the linear and parallel pipelines are equivalent:
This means that a linear portion of such a pipeline may be replaced by a parallel pipe of the same length; and vice versa. Thus such structured pipelines may be thinned or fattened with no effect visible to the external observer of their control signals. This is a very useful guarantee and a handy simplification when 3 Assuming pulse latch clocking is not employed. 
converge on a specific concurrent protocol with more concurrency. Thereafter the behaviour maintains the same native protocol shape. Much of the concurrency that is regained in these categories is the return of concurrency that was removed through output ordering.
For example, consider the semi-regular speedindependent protocol L 1001 •R 0000 . It contains 30 states and implements output ordering where la↑ precedes rr↓. The protocol interface becomes identical to LC max when composed in linear pipelines of depth 2 or more. However, it is identical to LC max in all parallel pipelines. This is shown in the following table that gives the number of states in various parallel configurations:
32 48 64 80
This shows that some of the concurrency removed from a protocol is recovered in a regular way when protocols are placed in parallel configurations. Thus a protocol may behave identically to a more concurrent protocol when placed in parallel configurations. This implies that protocol equivalence classes could emerge, as is shown to be true in Sections 4.1 and 4.2. This also implies that inside a protocol equivalence class, certain concurrency reductions might result in more efficient implementations than others. Our results in this area will be reported in future publications.
The family of untimed protocols is rather large. Removing the deadlock and constant categories as being uninteresting for implementing pipelines leaves a family of 137 distinct and useful protocols. This family is tabulated in Table 1 
Additional Properties
Additional important distinguishing properties of the protocols can be graphically represented on Figure 4 and using the cut-away notation:
• Only protocols that contain the state marked with × in Figure 4 will latch data in every pipeline stage when using 4-cycle protocols. The two-phase and O(8) protocols can latch every stage if using a pulsed clock or handshaking through the register.
• The states in which the latch must be transparent and opaque can be represented by a coloring. Based on these colorings, it can easily be shown that certain states require a state variable to control the latch due to the state spaces. Some protocols don't cover the states that require a state marking, and thus result in simpler latch control logic that can be encoded directly from a combinational function of the handshake signals, and even the rr and la signals.
Parallel Protocol Equivalence Classes
Huygens invented the pendulum in 1658. In 1665 he noticed that if he put two of his clocks side by side then their pendulums would always synchronise within 30 minutes whatever their out-of-phase initial settings. We have an analogous convergence between different protocols when placed in parallel configurations.
The parallel behaviour of the family of protocols configured in parallel pipelines is represented in Table 2 . Linear pipelines are presented in Table 3 . These tables are divided by three vertical and five horizontal blocks. The top left of each block is a stable state that is the result of composing two delay-insensitive cut-aways. In Table 1 no particular pattern emerges if we examine by rows or by columns; there is no predictable pattern of row or columns of just 4's or 5's. This indicates that neither the L or R cut-aways are a dominant factor in the pipelined behaviour of our protocols.
Examining block-by-block, the best behaved is the center block with stable shape L 2002 •R 0022 . This shape is very symmetric in its left and right cut-aways, as a pair of rr↓ transitions are pruned by the left cut-away and a pair of la↓ transitions by the right cut-away. Table 2 , all structured parallel pipelines result in the equivalent behaviour of the most parallel shape. Thus in a parallel pipeline, if a less concurrent protocol is implemented, it is indistinguishable from the most parallel delay-insensitive protocol. This implies that any of the protocols that apply concurrency reduction might result in a more efficient implementation that results in the same delay insensitive behaviour.
Parallel Pipelines
D 5353 D 5353 D D D D R4244 7355 7355 D 5353 D D D D D D R4264
Linear Pipelines
LP d were evaluated for d = 1..8 over all 137 category 3-6 protocols. All single pipeline protocols showed predictable growth and shape for pipelines of depth 2 and deeper. Thus Table 3 shows state sizes for depth 2, and group together equivalent protocols.
Three different equivalence sets emerge:
1. There are four 2 × 2 groups of adjacent cut-aways which have identical protocols for LP d where d ≥ 2. In each of the four cases, these protocols converge to the most parallel protocol, that in the top left position of the group. These sets consist of the four shapes that converge to protocols L 0000 •R 0000 , L 0000 •R 2242 , and L 0000 •R 2044 in the first column and L 3223 •R 0000 in the ninth.
2. There are 12 vertically arranged pairs of shapes that exhibit unique LC behaviours but are equivalent when pipelined at depths 2 or greater. In each case they converge to the most state rich shape, the higher of the two. These pairs consist of the protocols in the first and second rows, ninth and tenth rows, and 12th and 13th rows in columns three through eight. Notice, for example, that in rows one and two of 26 26
24 24 Table 3 . Linear Pipeline Protocols LP 2
The Family Hierarchy
The cut-away representation L•R of the protocol family provides a direct method of ordering the entire family into a lattice based on protocol concurrency. The protocols are ordered based on state richness: protocol X ≤ protocol Y iff every state in shape X is also a state in shape Y. The easiest way of carrying this out is simply to compare the cut-away definitions of X and Y.
Let Similarly for the class of right cut-aways. Then protocol L abcd •R ef gh is a proper sub-protocol of shape
The process is very simple to mechanise without the need to generate and compare the minimised state graph shapes.
The 15 combinations of delay-insensitive cut-away classes that produce stable shapes are displayed in a lattice in Figure 5 . A shorthand notation is used in the lattice to represent the protocols by listing the number of states in each row of the shape. The top of the lattice is 9599 (LC max ) with 32 states, and the least concurrent stable protocol is 5133 with 12 states. Notice, however, that this notation is not unique as two different protocols in the lattice share the shorthand notation of 7377 and 7355. The unambiguous L•R notation can be derived from the figure to identify the protocol shape.
Related Work
Asynchronous designers are well aware of concurrency reduction as a means of modifying protocols to generate more efficient implementations. Some concurrency reduction algorithms have been automated and implemented in CAD tools [2] . The formalisation of a set of concurrency reducing transformations and rules have been previously published. Lines started with a concurrent handshake expansion in CSP, and then applied four reshuffling rules to the handshake signals to reduce concurrency [8] . This produced nine valid protocols, eight being reshufflings of the most concurrent MSFB protocol. McGee and Nowick developed a graphical framework based on signal transition graphs [9] . They formalised three correct-by-construction arc transformation constraints to reduce concurrency, and produced a lattice of protocols.
One significant difference to previous work is the completeness and coverage of the protocol space. The previous work implements subsets of the work presented here. Our formal process based transformations are complete and exhaustive. All protocols, starting with the most concurrent LC max , are part of our set. The most concurrent protocol in these publications is in the L 0000 •R 0044 protocol equivalence class. This covers only the bottom six protocol equivalence classes in our lattice; the nine more concurrent protocol equivalence classes are not included. Additionally, our work is completely general. We don't impose any constraints on the implementation, and even abstract out the latch control signals. McGee's work focused on characterising a particular implementation style based on dynamic gates and relied upon internal signals such as reset, precharge, and evaluate for their model. This work also derives many characteristics of pipelined protocols that were previously unknown or not clarified elsewhere. For example, Lines characterises protocols in terms of their ability to store data in each latch; the half buffered protocols (such as PCHB) can only store data in every other latch whereas the fully buffered protocols (such as PCFB) store data in all latches upon a pipeline stall [8] . However, no specific property was defined that results in this characteristic behaviour. Section 3.3 defines this property as being directly dependent on the pipeline state in row R2 of right cut-aways. The PCFB protocol L 1001 •R 4044 is fully buffered since no states are removed in R2 of its cut-away R4004; the PCHB L 1001 •R 4264 is half buffered because two states are removed from row two of its right cut-away. Given the pipeline state property we have defined one can observe the shape of any protocol and immediately determine if the protocol is half or fully buffered. Thus one can quickly prove that Sutherland's Micropipeline [13] is a half buffered protocol, and should not be used in a FIFO. Many other protocols and properties not previously known are presented here, such as the 15 equivalence classes that result when protocols are placed in parallel configurations.
Published Circuits
A selection of published circuits have been examined as shown in Table 4 . Of the 28 listed there are only 16 distinct protocols implemented.
In the protocol family investigated in this paper, all but the control handshake signals lr,la and rr,ra are formally hidden from the protocol behaviour. This work does not consider power, area, speed, or whether the latch is normally open or closed. Thus each protocol has a multitude of possible implementations. What the protocol does tell you is how every corresponding implementation will behave at the interface when composed together in a single or parallel pipeline. These circuits can also be placed into the lattice and tables to determine properties of the protocol and study alternate implementations which may be improvements over the current version.
Contributions
In this paper we have presented the family of 4-phase latch protocols with data valid before rising request: their control signal properties and behaviours, and how they compose into homogeneous structured linear and parallel pipelines. We have fully specified every protocol that exists in the family of 4-phase pipeline controllers where data is valid before the rising edge of request. The most concurrent protocol LC max is specified, from which all less concurrent untimed protocols are derived.
A canonical state graph representation for protocols is presented and called a shape. This easily allows us to demonstrate properties of handshake protocols and the result of formal concurrency reduction transformations.
The behaviour of all 250 possible protocols was characterised in linear and parallel pipelines. Six fundamentally different categories emerged. We labeled these as stable (12 of O(16)), regular (60), semi-regular (43), regular 2-phase (22) of which 3 are stable, constant (21), and deadlock (92). Stable behaviours have shapes that are not modified in linear and parallel configurations. This set has an interesting property of defining protocol equivalence classes as noted below, and these protocols were used to define the protocol lattice. Regular protocols are not modified when placed in linear pipelines, but their behaviour is more concurrent when placed in parallel pipelines. Semi-regular protocols exhibit increased concurrency in both linear and parallel pipelines. For all protocols, their maximum concurrency is reached after only two pipeline stages.
Additional properties are derived and mapped to our protocol shapes. We defined the condition that must hold for a controller to be pipelined. This condition is dependent on right cut-away R2266 which overly restricts responses on the upstream channel.
While the interaction between the protocol and the latches was not explicitly modeled, two additional key properties were defined in this work that relate to the latching behaviour of the protocol.
First, an important pipeline property in the presence of stalls is the ability to store data in every latch. This work defined one specific state, the pipeline state, that must exist in any 4-phase protocol in this family to allow it to store data in all latches when stalled. Thus any fully-buffered protocol will contain this state in the shape, whereas half-buffered protocols will not.
Second, this work classifies protocols into two sets: those that require a state variable to control the latch and those that can control the latch using a function on the input and output signals of the gate (possibly using only one handshake signal). This can be a complexity parameter for implementations, as well as provide a reduction in protocol delays or timing requirements in circuit implementations. A coloring on the shape can be derived indicating the states that require an additional latch control state variable for either normally open or normally closed control. Details of these colorings are not presented here due to space limitations.
The protocol shapes were placed into equivalence classes based on the protocol behaviour presented at the interfaces. We found that for parallel pipelines there are 15 equivalence classes of up to 16 different protocols, each dominated by one of the stable protocols. Thus stable protocols have a central role in pipelining. Since linear, independent latches rarely occur, designs that use concurrency reduction techniques to improve performance and power, yet map to a pipeline equivalence class might result in very productive optimisation techniques. Our results in this area will be presented later.
Linear pipelines were also evaluated and placed into equivalence classes. These configurations showed a much finer granularity in equivalence classes, as the largest sets contained only four protocols.
A definition for categorising protocols into a lattice was defined, and the 15 parallel protocol equivalence classes were placed into a lattice. The lattice, nomenclature, and shape models presented in this paper provide several different methods to compare and contrast protocols and their realisations as circuits. The unique textual representation of the protocols encodes restrictions on the left and right channels and also encodes timing assumptions built in the circuit including delayinsensitive and speed-independent protocols, those with output ordering, and protocols that have inherent timing in the protocol. The stable protocols, which serve as basins of attraction to the other protocols, is also derived from this naming convention.
A large set of published circuits were then mapped to our protocol family. These circuits include designs using combinational logic, dynamic logic, and C-elements.
The evaluation of tradeoffs between concurrency reduction, energy, and performance across an entire protocol family can now be made. This tradeoff largely occurs due to circuit improvements based on circuit timing, such as output ordering, against the reduced system level concurrency that occurs based on the concurrency reduction. The small number of stable configurations (15) that serve as basins of attraction allow the choice of fixed design zones. There are also subsets of the family that present particularly interesting tradeoffs. Namely, all R0044 and larger cut sets retain full forward concurrency but result in substantially simplified protocols by removing concurrency when recovering from a stall. From a system level perspective this type of concurrency reduction can be extremely beneficial especially if stalls are rare, such as in a data-path. However, this optimisation may perhaps not provide the best protocol when designing FIFO buffers.
The completeness of this work provides information to help designers build circuits that meet their power, performance, and storage needs. This also provides a uniform representation for comparing various implementations of equivalent and similar protocols. This work defines the protocol used for current published circuit implementations.
There is still much to be done in furthering the understanding of asynchronous handshake protocols. Space precludes us from mentioning work completed or underway on mathematical proofs of our results and mathematical transformations that result in the cut-aways, 2-phase latch controllers, rules for timed protocols such as burst-mode and relative timed, and the efficiency of circuits synthesized for a variety of protocols for which there are no known published implementations.
