Abstract: A system-on-a-chip is an interconnection of different pre-verified IP hardware blocks, which communicate using complex protocols. The integration of IP blocks requires some glue logic to interface otherwise incompatible datapaths. This glue logic is called a protocol converter and its manual design proves to be a tedious and time-consuming task. Automatic synthesis is therefore important, but for optimal system-level design it is necessary to consider not just the correctness, but also the quality (in terms of bandwidth and latency of data transfer) of the converter. A good solution to this problem will allow greater use of protocol-level abstraction as a design tool in system design and synthesis. Results are presented on automatic synthesis of a converter between two protocols. It is shown how converter logic which is bandwidth-optimal can be synthesised for datapaths with an arbitrary number of data ports each of which has arbitrary-size first-in first-out (FIFO) storage. An extension of the product FSM converter synthesis algorithm to include FIFO data-paths is presented. In addition the converter bandwidth is identified as a mean cycle graph problem which is solved using maximum mean cycle graph algorithms.
Introduction
In recent years the synthesis of hardware interfaces for protocol-oriented data input=output ðI=OÞ has been a topic of research interest. Protocols represent a convenient abstraction for the specification of module I=O; and progress has been made in automatic synthesis of hardware interfaces from the corresponding protocol definitions.
One key problem is the automatic synthesis of the necessary logic to connect together two hardware blocks with different I=O protocols. This interconnection logic, which comprises a datapath with possible data storage, and appropriate control, is called a protocol converter. Another key problem is the automatic synthesis of finite state machine (FSM) protocol controllers for frame-based protocols such as MPEG and ATM from regular grammar-type languages. These are control-dominated designs, which include large state machines. Expressing the protocols using compact regular grammar-type languages instead of FSMs significantly increases design efficiency.
The second problem is related to the first problem in the following way. Merging the grammars' describing two different protocols together and then synthesising the resulting grammar into an FSM implements a protocol controller. This is feasible because of an important property of an interface protocol: it is possible to compose new protocols by merging two existing ones and decompose a protocol into two new protocols by splitting an existing one. In this case, the newly formed grammar describes the protocol converter. In protocol converter synthesis the objective is to derive the grammar (or FSM) for the protocol controller that will synchronise the two incompatible modules. Additional issues arising in protocol converter synthesis are optimising the producer-consumer interconnecting buffer size, and performance (i.e. latency and bandwidth).
This work focuses on protocol converter synthesis, and specifically on a technique that allows automatic synthesis with a wide variety of datapaths and bandwidth optimisation of the resulting converter. The main contribution of this work is an algorithm for automatic synthesis of the converter control logic that is optimal in terms of interface bandwidth for a wide variety of possible protocols and datapaths. The running time of the algorithm is polynomial in the size of the two protocols to be interfaced, when they are expressed as FSMs.
We extend the elegant product FSM construction first published by [1] and later used by [2, 3] . The protocol specifications are modelled as two FSMs, and the converter FSM is derived from the product of the two FSMs. The contribution in this paper is three-fold:
1. We present a precise mathematical framework for expressing the problem of protocol converter synthesis which allows a class of performance issues to be addressed as graph-theoretic questions. 2. Optimal bandwidth control is discovered by finding the maximum profit per time path through the product of these FSMs subject to the constraints. This is the first time that bandwidth, rather than latency, has been optimised in converter synthesis. We show in Section 4.6 that the two are not necessarily identical.
The product FSM construction of [1 -3] is modified to allow the analysis of arbitrary first-in first-out (FIFO) datapaths. In general the necessary sizes of FIFOs in a system depend on global stochastic system properties, a topic we do not address. However, the storage necessary to optimise interface bandwidth, given that both protocols are transferring data as fast as they can, is a well determined problem that we do address. Our work can be used to determine the minimum datapath storage for an optimum-bandwidth interface.
Related work
Protocol controller synthesis for complex frame-based data communication has been considered in [4 -6] . Seawright and Brewer [4] were the first to report the synthesis of hardware from such specifications. In their approach, regular grammar constructs are identified directly with hardware patterns, which were shown to perform better experimentally both in terms of area and delay than standard FSM synthesis. Seawright et al. [8] present a graphical user interface as part of their initial synthesis tool that explicitly uses the structure of the complex data (e.g. ATM cell) as input. This was later commercialised by Synopsys in the Synopsys Protocol Compiler (SPC) [9] .
Oberg et al.
[5] present a grammar and synthesis tool environment called protocol grammar (PROGRAM) which has the freedom to choose the best possible implementation in terms of area and throughput independent of port size specification. So, whereas SPC facilitates a clock true description, PROGRAM specifies the whole sequence that is associated with the input and the input and output port sizes are derived according to the throughput constraints posed in the grammar description. PROGRAM was later applied by other members of Oberg's group in another approach called maths to asic (MASIC), to specify protocol glue logic (i.e. a protocol converter) between digital signal processor cores [10, 11] . Global control, configuration and timing (GLOCCT) is specified in a grammar notation in an attempt to automatically build virtual prototypes of the chip early on in the design phase. A technique to estimate the performance of GLOCCT as part of the embedded system is described in [12] .
Siegmund and Müller [6] presents a protocol specification formalism, SystemC SV ; that is an extension to the SystemC language in a controller synthesis environment (COSYNE) which is ideal for simulation. SystemC SV also allows embedded sequential behaviours to be specified (e.g. CRC computations inside bit transfer messages). In particular, all protocols that can be expressed with regular grammars and context-free grammars with stack automaton of stack size one can be modelled. Another difference to the approaches of [5] and [8] is that a complete communication architecture consisting of both the interacting transaction producer and consumer controllers, as well as the interconnect between them, are synthesised from one protocol specification (as opposed to two) in the same run of the synthesis algorithm into synthesisable SystemC code. The controller size and maximum operating frequency were comparable to the ones produced by SPC [9] .
Somewhat less progress has been made in the area of protocol converter synthesis. Protocol converter synthesis was first attempted by Borriello and Katz [13] whose work is now of limited use because it does not address the 'data correspondence problem': automatically synchronising the timing of the two protocols. Borriello and Katz coined the term transducer to describe a protocol converter, however, this term has not since been widely used in the literature, and we therefore prefer the more natural, but less explicit nomenclature.
Akella and McMillan [1] first suggested pruning a product FSM to synthesise synchronous=asynchronous protocol converters, but their algorithm required manual decisions to reach the final converter. Passerone et al. [2] and Passerone [3] showed how this technique could be automated to generate protocol converters with register datapaths for synchronous systems, and optimised these for latency. The work here is an extension to that reported in [2] and [3] , which considers only acyclic FSMs. In contrast this work allows arbitrary cyclic FSMs, which is necessary to compute the bandwidth that is achieved by the protocol converter. The algorithm presented here is also different from that in [2] and [3] , where the product FSM is constructed using a depth-first search. The notation used in [2] and [3] has been changed in some cases to simplify the exposition and allow for easier generalisation: in particular to encapsulate data transfer information.
Passerone et al. [14] combined and extended the results of both [2] and [15] to produce a more formal, rigorous and mathematically sound interpretation. In particular, they applied the game-theoretic interface paradigm of [15] and [16] to check for interface compatibility and also to synthesise protocol converters. The ability for the algorithm to generate cyclic FSMs is also an extension to [2] and [3] , but it achieves this manually through a third automaton which specifies the constraint on the converter.
Other kinds of interface synthesis approaches are presented in [17] [18] [19] . Filo et al. [17] model concurrent inter-process communication using blocking and non-blocking messages with detailed timing constraints and the synthesis aims to increase performance by making as many communications as possible non-blocking. Madsen and Hald [18] present an approach to interface synthesis based on an algebra which manipulates an abstract communication behaviour between two units by applying transformations that allow both data segmentation and data combination. Finally, Coussy et al. [19] present an IP integration methodology that deduces data exchange delay information during integration of the IP to a shared on-chip-bus. It then uses this, and further data ordering information to generate a detailed bus-functional model of the IP towards co-simulation.
Converter synthesis
This Section introduces the protocol converter synthesis algorithm using a pass-through datapath. Figure 1 shows how the required converter interfaces to two hardware blocks with protocols P and P 0 respectively. Each protocol has tuples specifying the binary value of control input lines (I and I 0 ), control output lines (O and O 0 ) and a data port (data) carrying arbitrary data items. It is assumed that the two data ports have the same type. The direction of the data port in P is output, and P 0 is input. The input to the synthesis algorithm is a FSM description of P and P 0 : The output of the algorithm is a product FSM to implement a correct converter. The outputs of this FSM drive the control inputs I and I 0 of P and P 0 : The inputs of this FSM are the control outputs O and O 0 of P and P 0 : The data ports of the two protocols are connected together, and not to the converter.
Protocol specification
Passerone [3] has argued that regular expressions are more easily understood by designers than FSMs Specification languages equivalent to FSMs have been proposed [7] . In this work we will assume that protocol compilation, for example from regular expressions to the equivalent finite state automata (FSA) is a separate problem, and consider optimal synthesis from an appropriate low-level protocol specification.
It is nevertheless relevant to ask whether the specification that we choose is sufficiently powerful. The choice made here, to use protocols defined by deterministic finite automata, has the merit of theoretical simplicity, and covers a wide range of synchronous protocols. The extensions considered in Section 5.2 are all possible within this framework.
The behaviour of a protocol at any one time is defined by the states of a set of Boolean input and output signals, together with a vector of D, consisting of one integer for each data port, which determines whether the port is producing (positive) or consuming (negative) data items, and is otherwise zero. Formally we define the protocol behaviour to be over the product of the control input space I, control output space O, and D ¼ Z p : The number p specifies the number of data ports, and is one in the case of a single data port, as in Fig. 1 . In the normal case, each port produces or consumes only one item at a time, so possible values of the components of D are one À1; and zero, representing data production, consumption, and neither respectively. The extensions considered later in Section 4.6 will use protocols which produce or consume more than one item.
The product S ¼ I Â O Â D is called the protocol alphabet. The set of finite strings over this alphabet is, using conventional notation, S Ã : Synchronous protocols can be defined as the subset of S Ã containing precisely those strings that are initial segments of traces accepted by the protocol. A protocol is thus formally a grammar.
We use the set of grammars (and hence protocols) that can be generated by deterministic FSA. For convenience, without loss of generality, the deterministic FSA are transformed into the equivalent Mealy FSMs [7] , with input and output spaces ðI Â I 0 Þ and ðO Â O 0 Þ respectively. We say that an FSM is deterministic if it can take no more than one transition for a given input and present state. The FSMs representing FSA are deterministic when both the next state and output space are uniquely defined for a given input transition. However, they are in general nondeterministic if the next state is uniquely defined for a given present state, input and output. The FSMs with this later property are called pseudo-nondeterministic in that the corresponding FSA are deterministic.
We therefore define protocols as pseudo-nondeterministic Mealy FSMs, equipped with an additional output function that determines the output D from the protocol state and inputs. The converter synthesis algorithm can be understood by noting that since the FSMs corresponding to both protocols are in general pseudo-nondeterministic, the state of the FSM representing each protocol can be determined by a corresponding FSM in the controller. The product of these two FSMs thus contains complete information about the state of the system, and from this the required converter FSM can be derived.
Since the datapath of the example shown in Fig. 1 contains no storage, the task of the converter FSM is to synchronise the times when the producer protocol produces data with the times when the consuming protocol consumes data. This has been called the 'data correspondence problem' in the literature. The following Section shows how this can be accomplished.
Synthesis algorithm
The source protocol P is represented formally by a tuple which specifies a pseudo-nondeterministic Mealy FSM, and an additional function d, as in (1). The destination protocol P 0 is represented in an identical way but with its components denoted by single primes. Throughout this Section we will assume that P and P 0 each have a single data port, so the functions d and d 0 have integers which specifies whether the port acts as a producer or consumer of data:
In (1) S is the set of FSM states, with initial state r. I and O are the finite protocol input and output control spaces. An element of I or O thus represents the state of all the protocol hardware inputs or outputs at a given time, and is itself a binary tuple of arity equal to the number of inputs or outputs respectively. For convenience, in the algorithms that follow we do not explicitly decompose this tuple into its individual components, representing the individual protocol control inputs and outputs. Thus, in the expression: 8i 2 I the variable i enumerates all 2 k possible sets of control inputs on k wires. T is the set of FSM transitions. hi; x; y; oi 2 T represents a transition from state x to state y taken when the FSM input tuple is i, and output tuple o. In a convenient abuse of notation we write hR; x; y; oi 2 T when a transition is taken for a set of inputs R I: The Mealy FSM behaviour is captured by having multiple transitions from a state, each labelled with a different output tuple, and with possibly the same destination state. The conventional label on the transition hR; x; y; oi 2 T of the FSM from x to y would be R=o:
In addition d is the protocol data function that specifies data transfer for each state. We assume the protocol has a single data port, in which case d will be one for a state that produces data, and À1 for a state that consumes data. d will be zero for a state that both produces and consumes data or a state that neither produces nor consumes data. Note that d can in general depend on I as well as S.
The FSM is input non-deterministic so transition predicates R from a given state to different destination states need not be disjoint. However, the FSM is output deterministic, so where this is the case the corresponding outputs must be different.
We now define formally the construction of a product FSM from the two FSMs representing the two protocols connected by a converter. Throughout this construction we use zero, single and double 0 symbols to designate quantities in the first and second protocol FSMs and the product FSM respectively.
Given two protocols: P and P 0 ; the corresponding product FSM P 00 ¼ P Â P 0 is defined in (2):
where T 00 is defined as: Informally, the execution of this non-deterministic product machine represents all possible interleavings of the behaviours of the two protocols. A correct converter must determine values of, I and I 0 at all times to control the product FSM state so that datapath constraints are met. This condition is ensured by pruning from the product FSM all states which violate datapath constraints, or may lead to future violation.
The synthesis algorithm can now be described as four steps, each operating on the product FSM P 00 constructed from P and P 0 as in (2) . By construction this product machine tracks the state of both P and P 0 ; however, it does not ensure that datapath constraints are met.
Step 1 constructs a subset of the product machine P 00 in which all transitions which would violate the datapath constraints (overwrite a data item before it is consumed or consume data before it has been produced) have been removed.
Step 2 ensures that all states or transitions that could lead to datapath constraint violation are removed, recursively. In general the protocol outputs, O and O 0 can be arbitrary, whereas the protocol inputs, I and I 0 may be controlled to ensure datapath constraints. Therefore, a good path must exist for every set of outputs and at least one set of inputs. By construction, at the end of Step 2, all transitions in the product machine represent guaranteed safe paths for the two protocols to follow, however, there may be more than one such safe path. The product machine thus represents all possible ways of controlling the two protocols that are feasible.
Step 3 therefore determinises this in such a way that data transfer bandwidth is optimised. Finally in step 4 the converter control FSM is generated with outputs the required protocol inputs, and inputs the given protocol outputs. Formally: Input: Protocols, P and P 0 : Output: Converter FSM with control input space O Â O 0 ; control output space I Â I 0 that implements a converter between P and P 0 ; if one exists. Notation: We define set variables F s S Â S 0 ; F tr T 00 : We say that states and transitions of the product machine are dead if they are in F s ; or F tr respectively, otherwise they are alive. Two auxiliary functions will simplify the description of the algorithm:
PðtÞ ¼ fhi; s; t; oij9i; s; o such thathi; s; t; oi 2 T 00 g
Dðs; oÞ ¼ fhi; s; t; oij9i; t such thathi; s; t; oi 2 T 00 g ð4Þ
Thus, P defines the set of transitions from immediate predecessor states to state t in the product FSM P 00 and Dðs; oÞ defines the set of transitions in P 00 from state s with output o. We define:
Step 1:
Step 1b: F s ¼ fs 00 j8i 00 ; d 00 ði 00 ; s 00 Þ 6 ¼ 0g:
This makes states that violate the datapath constraints dead. It is also necessary to deal with transitions that violate this constraint in states that are not dead, so we set:
F tr :¼ fði 00 ; s 00 ; t 00 ; o 00 Þ 2 T 00 jd 00 ði 00 ; s 00 Þ 6 ¼ 0g
In step 1 of the algorithm, FSM protocols P and P 0 are merged together by taking their product. The product FSM is generated recursively by starting from no states and progressively adding states using a depth-first search (DFS) strategy similar to that of [2] and [3] . The difference to the construction in [2] and [3] is that a cyclic FSM is constructed. Because of the properties of DFS on cyclic graphs, the product graph cannot be pruned at the same time as performing DFS as in the case [2] and [3] . So further steps (steps 2 and 3) are required for the final pruned FSM.
Two data structures are created to assist product state machine computation: a stack, and an FSM.
The stack is a last-in first-out data structure which timestamps each state. A state is pushed onto the stack when it is first created and popped from the stack when the search finishes examining its adjacency list. It therefore prevents endless computation. Every time a new state is pushed onto the stack, the state becomes the root of a new tree in the depth-first forest. A transition from a state to a state earlier in the stack queue is called a backward transition because the resulting path is a cycle.
The FSM data structure is a cache which stores states that have been explored during DFS. It is used to prevent the algorithm from re-exploring states (hence paths) which have already been explored thus resulting in computational saving. Transitions which point to states on the FSM are called forward transitions. The FSM will eventually include all states in the product FSM. Figure 2 illustrates a DFS search on an arbitrary graph structure. The states are annotated by: time added on stack= time removed from stack (or time added on FSM). The transitions are labeled B or F according to whether they are backward or forward transitions.
During DFS, the product states, are unrolled so that states are uniquely defined by s 00 and d 00 ði 00 ; s 00 Þ: A negative value assumed by this integer, indicates a violation of causality (i.e. implies data received which has not yet been sent). A positive one implies that data has been lost because no storage is available on the datapath. During the DFS, states which violate data dependencies in this way are marked dead on the FSM, and no further exploration is performed from them.
Step 2a: Make dead all P 00 transitions that lead to dead states.
PðtÞ
Step 2b: Make dead all P 00 states that have transitions that are all dead for any given output: ðDðs; oÞ 6 ¼ fÞ and ðDðs; oÞ F tr Þ )
This condition means that there is no protocol input that can stop a potential transition to a dead state from s, and therefore s must become dead).
Step 2c: Repeat Steps 2b and 2c until F s does not change. The output from step 2 is a pruned product FSM P 00 0 ; equal to P 00 with F tr and F s deleted from transitions and states respectively. After forming the product FSM, it is necessary to remove all unsafe states and transitions which lead to the violation of 
The backtracking procedure begins from an arbitrary transition in F tr which violates data dependencies. It backtracks, visiting states in the predecessor path, until it reaches a state which is either already in F s or which remains in P 00 as a result of the visit. As in the iterative procedure, the fate of a state's visit is evaluated by determining whether its transitions are all dead for any given output in which case the state is added to F s ; and backtracking resumes from this state. The above process is repeated for all remaining initial transitions in F tr : The total number of backtracks performed is bounded by the total number of transitions, so the running time is O(m).
Since step 2 of this procedure combines all protocol and datapath constraints, P 00 0 will be non-empty if and only if it is possible to design a converter. Figure 3 illustrates the backtracking procedure. Transitions YÀ> X; WÀ> X and WÀ> Z violate data dependencies and are initially in F tr : Even although in state Y, transition YÀ> X causes a data dependency violation, there is still a valid transition for output 0 so no backtracking is performed from state Y.
State W is added to F s because there is no valid transition for output 0. Backtracking is performed form state W and F tr is updated with transition VÀ> W: Even after the exclusion of transition VÀ> W; there are still valid transitions for both 0 and 1 outputs in state V. Therefore, no further backtracking is performed from state V.
Step
deleted to resolve output non-determinism, P 00 1 :
Step 4: P 00 1 is relabeled as an FSM with inputs O Â O 0 and outputs I Â I 0 : This is the required converter control FSM. This reversal of outputs and inputs is necessary because a protocol output corresponds to a converter control FSM input and vice versa. It should therefore be noted that throughout the construction of P 00 ; O Â O 0 represent the possible inputs to the control FSM from the outputs of the two protocols, etc. The output non-determinism resolved in step 3 represents different controller designs that respect the two protocols, but may have differing performances. Specifically, a nontrivial choice may exist in the protocol inputs allowed in a given product state for a given protocol output.
Let T 0 be the set of all transitions in the step 2 output P In the case of simple protocols it is possible to resolve this by assuming one item is transferred in each non-trivial cycle of the product FSM transition graph and calculating the minimum path length from u to u through either s or t.
The shorter path is chosen. This optimises data transfer latency. Section 4 describes a more general way to make this choice.
Example of synthesis algorithm
This example demonstrates the synthesis of a converter between two protocols P and P 0 : Figure 4 shows the FSM representation of the two protocols, a producer P and a consumer P 0 : In this case, data transfer function d is independent of I, so data transfer (d ¼ 1; À1 respectively for producer and consumer) is indicated by appending a star to the state label. Both protocols transfer data in state 2, and have reset state 0.
P has one control input, and with values indicated on transitions. Unlabelled transitions are taken unconditionally. In this case P is input deterministic. The protocol will produce a new data item every two or more cycles, depending on the input in state 1. In contrast, P 0 has no input. Its output is indicated on transitions in the form=output. The output in state 3 must be observed to determine whether the path from state 3 to state 2 takes one or two cycles, and this path is not controllable by the converter.
The product FSM P 00 ¼ P Â P 0 has 12 states, five of which are shown in Fig. 5 . The forks in P 00 depend on either the output of P 0 ; which the converter cannot control, or the input of P, which the converter can control. The four transitions from state (1,3) thus split into two groups, depending on the P 0 output. Within each group, the transition taken is controllable. Fig. 4 Two protocols Fig. 3 Step 2 of the algorithm implemented by backtracking from dead states
The two states with dotted edges fail in step 1 due to datapath constraints. The (2,2) state must clearly be in the converter FSM, since it is the only state in which data can be transferred. Inspection shows that states (1,3) and (1,1) are also needed. The state (0,0) is also needed, but not shown. All other states fail in step 2; for example (0,3) has a transition to (1,2) if P 0 output is one, and therefore fails.
In this example there is no non-determinism to resolve in step 3 because states (2,1) and (1,2) have been removed by datapath constraints. If this had not happened a decision would be made in state (1,3) between transitions to (2,1) and (1, 1) , and between transitions to (2,2) and (1,2).
Protocol converter synthesis with FIFO datapaths
This Section will show how converter controllers for a wide range of datapaths, characterised by the amount of FIFO storage in the datapath, can be synthesized.
The controllers are optimal in the sense that for the specified datapath they deliver maximum bandwidth, and the synthesis algorithm is polynomial in the size of the pruned product FSM. The bandwidth optimisation does not rely on stochastic aspects of the system.
Datapath considerations
In this Section we assume that the converter to be synthesised has a single port through which data flows unidirectionally. The extension to multiple ports, and bidirectional flows, is discussed in Section 4.6. In this case the datapath consists of a single FIFO, characterised by a maximum number of items stored. Two special cases are when this number is one or zero. A FIFO of length one is equivalent to a single buffer register, a FIFO of length zero corresponds to the previous case of direct connection between the input and output ports.
In order to minimise data latency it is assumed that data falls through the FIFO in zero cycles. Figure 6 shows a typical edge triggered implementation for a single buffer register with this property: an equivalent construction can be used to implement zero cycle fall-through from a one-cycle fall-through FIFO.
Control synthesis
In this Section the algorithm of Section 3.2 is extended to allow arbitrary-sized FIFO datapaths.
Assume a data FIFO size of N > 0: The converter FSM must in general keep track of the value of N. This is implemented by unrolling the product machine P 00 from Section 3.2 N þ 1 times, so that the extended product FSM P 00 e has state space S 00 e which is a product:
The 
n is an integer and its value encodes the state of the FIFO in terms of the number of items. It can be arbitrarily chosen subject to:
These conditions enforce the datapath constraint that the FIFO must neither underflow nor overflow. The FIFO overflows when the number of items, n, exceeds the capacity of the FIFO, N. FIFO underflow implies a violation of causality, meaning that data is read from the FIFO which has not yet been written. P 00 is replaced by P 00 e in step 1, and step 1b is therefore no longer necessary.
Step 2 proceeds exactly as before, to generate an output P 00 e0 . Resolving non-determinacy during step 3 is now more complex. We want to optimise the rate at which items are transferred, since this is the bandwidth of the converter. This is in general not the same as optimising the converter latency i.e. the time between production and consumption of a given item of data. The converter bandwidth can be calculated without loss of generality over cycles in the extended product FSM. In any such cycle the number of data items produced and consumed is equal, and can be calculated as the sum of jdðsÞj along the cycle. In general the cycle taken is protocol dependent, and cannot be predicted. It is, however, reasonable to optimise the converter so that at any time, if both producer and consumer run as fast as possible in the future, data rate is maximum. The exact solution to this problem is presented in Section 4.3.
Maximum cycle mean calculation
Determining the best transition when there is nondeterminacy in P 00 e0 at the end of step 2 is a modified case of the maximum cycle mean graph problem [20] .
Let P Assign a weight w(u, v) to edge ðu; vÞ 2 E to be the maximum number of data items transferred by P in the transition: Fig. 6 A zero cycle fall-through register 
The cycle mean is defined to be the sum of the edge weights divided by the path length, over the cycle. The maximum mean of weights over one cycle then corresponds to the maximum data transfer rate. There are a number of algorithms, surveyed in [20] , that find the global maximum cycle mean efficiently. The requirement here is, however, to find a maximum cycle mean separately for every nondeterministic choice. The possible transitions for one such choice are restricted by having a given starting state, s, and transition output value, o. If cycles are restricted to those including a given transition, and restricted maximum cycle means calculated, the transition with the largest restricted maximum cycle mean should be chosen. Efficient solution of this problem has not been addressed directly in the literature, although algorithms to find the global maximum cycle mean exist, mostly based on the work of Karp [21] . Karp's algorithm contains a recurrence which finds D k ðs; vÞ; the maximum summed weight of any path length k starting with state s and ending in state v.
We use a modified version of this recurrence to find D k ðs; t; vÞ; the maximum summed weight of any path length k starting with states s and t and ending in state v. This then allows us to compute the restricted maximum cycle means as follows: The maximum cycle mean through s in direction t can then be calculated as:
Aðs; tÞ ¼ max k2½1;n D k ðs; t; sÞ k This calculation must be repeated for all non-deterministic transitions from s to t and the transition with largest A chosen. The diagram in Fig. 7 presents how Karp's algorithm works by giving table entries starting from the source s. Each row (column) of circles corresponds to a row (column) of the table D where each row is identified by an integer and each column by a node. The symbol 'e' represents À1: The numbers just to the right of each circle represents the values stored at the corresponding table entries, e.g. D½2; c ¼ 9 and D½3; a ¼ À1: There are two cycles in this graph: hs; a; b; c; si and hs; b; c; si The maximum cycle mean is then maxfð3 þ 4 þ 7 þ 2Þ=4; ð2 þ 7 þ 2Þ=3g; and hs; a; b; c; si is the critical cycle. Figure 8 presents how Karp's algorithm is modified to determine the restricted maximum cycle means for the digraph in Fig. 7 . Figure 8a shows the table entries starting with state s in the direction of state a. There is one cycle in this graph: hs; a; b; c; si with maximum cycle mean ð3 þ 4 þ 7 þ 2Þ=4: Figure 8b shows the table entries starting with state s in the direction of state b. There is one cycle in this graph: hs; b; c; si with maximum cycle mean ð2 þ 7 þ 2Þ=3: Table 1 summarises the worst-case time complexity of this algorithm, in terms of the number of nodes, n,and number of transitions, m, in the product state machine.
Time complexity
Running time is dominated by step 3, due to the complexity of the repeated maximum cycle mean calculation. It should be noted that step 3 operates on a product FSM which has been pruned by step 2. The typical running time is therefore less than this worst-case figure. Further optimisation of step 3 can be implemented by noting the special status of self-transitions in P 00 e0 : These must either have non-zero weights, in which case they comprise the maximum cycle mean path, or zero weights, in which case they cannot be part of any maximum mean cycle path and may be removed before calculation of maximum cycle means.
Evaluating the quality of the synthesised converters
The digraph that is output from step 3 is not necessarily connected. We decompose the resulting digraph into individually subgraphs called strongly connected components (SCCs). An efficient algorithm exists with two DFSs which performs the above decomposition in linear time yðm þ nÞ [22] . We use part of the decomposition algorithm in [22] which determines and then generates the sink SCC from the resulting digraph. This represents the set of possible behaviours into which the protocol converter eventually settles when a stimulus is applied. We define the average performance bandwidth of the synthesised converter as: the number of items transferred per cycle averaged over all possible cycles in the sink SCC of P 00 1 : Table 1 : Worst-case time complexity
Step 1 O(m)
Step 2 O(m)
Step 3 Oðnm 2 Þ
Step 4 O(n) Fig. 8 The two tables produced by the restricted maximum cycle mean algorithm starting from s a In the direction of state a b In the direction of state b Fig. 7 The entries for D and the arcs Karp's algorithm visits for a four-node diagraph from source node s Evaluating the average bandwidth over all possible cycles requires determining the number of simple cycles in the SCC. DFS from an arbitrary node in the SCC can be used to detect all possible cycles. The problem with DFS is that it may detect the same cycle more than once because the nodes constituting the cycle can be detected from more than one path. The number of cycles can therefore be determined by performing DFS from an arbitrary node and storing cycles which have already been detected thus avoiding counting the same cycle a multiple number of times. The bandwidth for each cycle is then simply the number of items produced (or consumed) per cycle divided by the cycle length. The average bandwidth is equal to the sum of all individual cycle bandwidths divided by the number of cycles.
Elaboration
The algorithm has been extended to a number of related protocol converter synthesis problems with small changes to the data transfer function d.
1. Protocols with data ports of different widths. The values of d represent the width of the ports, in some unit equal to their greatest common divisor. For a more detailed discussion of this, addressing the issue of data reordering, see [23] . 2. Protocols with multiple data ports. The value of d must be an integer-valued vector, with one component for each data port. The producer and consumer data ports must correspond, and therefore vectors for producer and consumer must be the same length, and ordered to determine the correspondence between producer and consumer data ports. In order to maximise bandwidth with multiple data ports an objective function is required that assigns to each value of d a positive scalar weight. This can then be used in the maximum cycle mean driven optimisation of Section 4.3. 3. Protocols with output and input ports. Ports must correspondingly match input and output ports on the other protocol, or no converter will be possible. The direction of a port is indicated by reversing the polarity of d, and multiple ports are handled as above.
Data-dependent protocols, for example where a specific data token is recognised by the two protocols, can be represented by adding appropriate protocol control signals. The presence of the recognised token on a data port would then drive an extra output or input signal on the corresponding protocol. Figure 9 shows an FSM representation of a burst transfer protocol The protocol transfers data on entering states 1 and 2 (indicated by Ã ) and has reset state 0. It has two control signals: acknowledge and burst size. The acknowledge signal indicates data transfer when input and data reception when output. The burst size signal indicates the size of the burst (i.e. in this case 1 or 2) which results in different protocol cycles. When output, its value is non-deterministic; when input, its value is determined by the synthesised converter.
Experimental results
In experiments 1, 2 and 3, the burst transfer protocol of Fig. 9 is applied to both the producer P, and consumer P 0 : The protocols differ from one another with respect to the direction of the control signals:
In experiment 1, P consists of input burst size and acknowledge control signals. P 0 consists of an input burst size signal, and an output acknowledge signal. In Experiment 2, both P and P 0 consist of input acknowledge and output burst size signals. Experiment 2 is repeated in experiment 3, only this time; converters are synthesised between protocols with incompatible port sizes using the data path architecture shown in Fig. 10 . In experiment 4, we applied the methodology presented to model part of the communication interface between AMBA's high performance AHB bus master and low performance APB bus slave. In fact, the AHB bus has a max bus width of 1024 bits and maximum transfer rate of 1 transfer per clock cycle. The max bus width of the APB is 32 bits and has a max bus transfer rate of 1 transfer per 2 clock cycles. The AHB bus is modelled to use both a nondeterministic=deterministic burst transfer mechanism (4=8=16 beat bursts) and the APB bus uses a back-to-back write transfer mechanism. A FIFO with a depth of 16 was chosen as the datapath. It assumed that the busses are of the same width. The experiments are repeated for an AHB bus with wait states. Wait states are required in the case of bus contention.
4.7.1 Experiment 1: P resumes its data transfer when an acknowledge signal has been issued. This signal is controllable which means that a protocol converter will exist in the absence of a FIFO data path.
The number of identical SCCs comprising the protocol converter increases linearly with increasing FIFO storage size. The average bandwidth therefore remains unaffected when adding a FIFO to the data path.
In step 3 of the algorithm, the non-determinacy is resolved by choosing the edge with the maximum cycle mean. The selected transition optimises the rate at which items are transferred. For example in Fig. 11 , the state labelled (0,0,0) has to choose between three non-deterministic transitions, which are indicated with dotted edges, to states (1,1,0), (1,2,0) and (2,2,0). The maximum cycle mean calculation results in (1,1,0) being chosen, whereas in contrast minimum data transfer latency scheduling would result in state (2,2,0) being chosen. Resolving the nondeterminacy with respect to maximum cycle mean generates a converter with an average bandwidth of 0.33 data items per clock cycle as opposed to one of 0.25 data items per clock cycle; the result of resolving the non-determinacy with respect to minimum data transfer latency. The former consists of three states and the latter of two states when N ¼ 0 (i.e. with a direct connection between the data busses). Table 2 shows the results obtained for synthesising converters with varying FIFO sizes. In particular, the average bandwidth A is indicated. Increasing the FIFO size increases the state space, and thus increases the number of options (hence cycles) in the protocol converter. In this case, optimising with respect to minimum data transfer latency produces the same synthesis results because the converter does not determine the burst size signals.
Experiment 2:
The minimum data transfer latency converter was deduced by performing the Floyd-Warshall algorithm which is Oðn 3 Þ [22] on P 00 0 which solves the all-pairs shortest path problem, and labelling edges with the minimum cycle length. The scheduling heuristic that was used resolved the resulting non-determinism at the end of step 2 by choosing the transition which transfers data and minimises the cycle length. This decision results in optimising the data transfer latency (i.e. waiting time for the data bus to assume new data) for both producer and consumer.
The sizes of the converter FSMs are also indicated. After synthesis it is possible to roll up the converter FSM by using a separate counter to store the number of items in the FIFO queue, and implementing the states in P 00 rather than P 00 e : In this case the FSM transitions depend on the value of the counter. Table 2 shows the converter FSM sizes for the unrolled and rolled up cases. Table 3 shows results obtained for synthesising converters between protocols with specified port sizes d and d 0 : The minimum number of registers in the datapath required to achieve optimum bandwidth along with the unrolled FSM controller size are indicated. The number of cycles in the sink SCC and the bandwidth A are also given.
Experiment 3:
When d ¼ d 0 ; either the producer or consumer can be faster in a protocol cycle. Therefore, increasing the number of registers increases the state space, the number of cycles and bandwidth. Increasing the number of registers beyond unity, results in negligible increase in bandwidth ( Ã indicates that A is non-optimal), which becomes smaller with increasing queue size.
When d < d 0 ; the consumer is always faster than the producer in a protocol cycle. The speed of the converter is restricted to that of the producer from reset because of the causality constraint (i.e. FIFO cannot underflow). The result is a converter with one SCC. As expected, increasing the ratio d 0 =d increases the number of registers required to achieve optimum bandwidth.
When d > d 0 ; the producer is always faster than the consumer in a protocol cycle. The speed of the converter is restricted to the speed of the consumer because of the FIFO overflow constraint. The resulting converter consists of a multiple number of SCCs. The sink SCC is identical to the only SCC in the d < d 0 cases. This is because the producer and consumer protocols are identical. The remaining SCCs Table 4 illustrates the results obtained for synthesising converters between the AMBA AHB and APB busses. The algorithm was run on a 3.2 GHz Xeon and the synthesis times are given. The times required to execute steps 1 and 2 are negligible compared to that required to execute step 3 of the algorithm. The numbers in brackets indicate the corresponding results for minimum data transfer latency scheduling.
Experiment 4:

Conclusions
We have defined bandwidth-optimal protocol converters, and posed the converter controller synthesis problem in a way that allows uniform treatment of a variety of datapaths. We have presented a synthesis algorithm that allows fast determination of bandwidth-optimal controllers. The synthesis algorithm has the capacity to solve without structural change a number of converter synthesis problems, as indicated in Section 4.6. The algorithm described in Section 4 can synthesise a converter which is optimal for any given datapath: iterative application would allow brute-force optimisation of the datapath storage. This paper makes a contribution beyond the previous work of [2, 3, 14] in two major ways: (i) unrolling the product state machine, as detailed in Section 4.3.2, is shown to model datapaths with arbitrary storage in a very elegant way; and (ii) protocol bandwidth is measured using restricted maximum cycle means. An algorithm is presented to calculate these and resolve controller non-determinacy accordingly to optimise bandwidth.
Our work shows that the FSM construction first suggested by Akella and McMillan [1] , and modified in [2, 3, 14] , is a powerful technique for automatic synthesis. Its advantage is that where applicable it poses the synthesis data correspondence problem in a way that allows compact exploration of all possible solutions, and choice of the optimal solution. We have shown here that its range of applicability can be extended significantly, at the expense of a small increase in state space size.
In terms of furtherwork we note that the calculation of maximum cycle means given here is not particularly elegant. A compuzational optimisation based on the typically sparse nature of d (i.e. it is zero in nearly all states) might be worthwhile. More interestingly, it would be useful to find a fast method that calculated all the required maximum cycle means simultaneously while reusing intermediate results whenever possible.
Protocols as defined here, with multiple input and output ports, can be used to describe the temporal relationship between inputs and outputs in synchronous hardware blocks. This might extend the possible use of this technique to optimise multiple blocks of hardware simultaneously.
We are investigating a further extension of the product FSM synthesis technique in which transitions of the protocol FSMs are annotated with integer time intervals, representing the time between entering the state and the transition being taken. This can in principle be treated with the algorithm described here, at the cost of state proliferation. We propose to reduce the complexity of this system by using a direct representation of time intervals in the product FSM. 
