We consider the problem of synthesizing the asynchronous wrappers and glue logic needed for the correct GALS implementation of a modular synchronous system. Our approach is based on the weakly endochronous synchronous model, which defines high-level, implementation-independent conditions guaranteeing correct desynchronization at the level of the abstract synchronous model. We can therefore factor the synthesis problem into (1) a high-level, implementation-independent phase insuring the weak endochrony of each synchronous module and (2) the actual wrapper synthesis phase, highly simplified by the high-level assumptions, yet flexible enough to produce various, efficient implementations. We focus here on the synthesis of delay-insensitive asynchronous wrappers from weakly endochronous synchronous modules, and show how this can be done for a simple DLX processor model.
design flows that allowed the exponential increase in speed and complexity for more than 30 years. The reason to this is twofold:
(i) The synchronous abstraction 1 facilitates the specification and the analysis of complex systems. Provided that a few high-level constraints insure compliance with the synchrony hypothesis, the designer can forget about timing and communication issues and concentrate on functionality. The synchronous model features deterministic concurrency and simple composition mechanisms facilitating the incremental development of large systems. Also, synchronous models are usually easier to analyze/verify/optimize compared to asynchronous counterparts, often because the state-transition representations are smaller.
(ii) Until recently, the fundamental ingredients of a synchronous implementation (global clock distributed on the whole chip with small skew, circuitwide communication within a single clock cycle) were easily mapped onto the various silicon technologies.
However, the increase in speed and complexity, and the decrease in feature size made technology mapping ever more difficult. As a result, an important research effort has been directed to fields such as clock distribution, skew control, and on-chip interconnect design. The problem of the resulting techniques is that they are often global and rely on a stronger integration of various design phases (such as logic synthesis and placement). This increases the interdependency between functionality and communication and is contrary to the current trends aiming at modular development based on off-the-shelf IPs.
One solution to what seems to be an evolutionary dead-end may come from asynchronous circuit design [6] . Indeed, modularity and component-based implementation figure among the potential benefits of asynchronous design methodologies (along with increased efficiency, lower power consumption and lower electromagnetic interference). The weak point of asynchronous design methods is complexity. Unlike synchronous circuits, asynchronous circuits cannot dissociate (in the most general case) between combinational behavior, sequential behavior, and timing aspects. The state explosion occurs very fast, so that only small circuits can be handled. Moreover, regardless of efficiency considerations, a radical paradigm shift towards asynchronous design is unlikely to occur in the near future, given that most CAD tools are fundamentally synchronous, and that few engineers have adequate training. 1 Cyclic, clock-driven execution. During each clock cycle, the behavioral propagation is causal, so that the status of every wire is defined prior to being used in computations. Note that the last requirement empowers the conceptual abstraction that computations and communications are infinitely fast ("zero-time") and take place at discrete points in time, with no duration.
Gathering advantages of both the synchronous and asynchronous approaches, the Globally Asynchronous, Locally Synchronous (GALS) systems are emerging as an architecture of choice for implementing complex Systems-on-Chips. In a GALS system, locally-clocked synchronous components are connected through asynchronous communication lines. Thus, unlike for a purely asynchronous design, the existing synchronous tools can be used for most of the development process, while the implementation can exploit the good modularity properties of asynchronous communication.
The problem. Contribution
This paper addresses the problem of correctly and efficiently implementing a modular synchronous specification as a GALS circuit where inter-component communication is done through asynchronous lines (in our case, FIFOs). This operation, also called desynchronization [14] or GALSification [10] , involves the construction of asynchronous wrappers that control input, output, and clock generation for each synchronous module. The wrappers have two functions:
(i) reconstruct, for each synchronous module, the input synchronization fronts from asynchronous events
(ii) preserve, in a certain sense, the semantics of the synchronous specification in the GALS implementation (which may involve a form of distributed control to insure the needed global synchronization properties).
The exact problem we consider is that of synthesizing the asynchronous wrappers starting from the specification of the synchronous modules. Our approach is based on the weakly endochronous synchronous model, detailed in section 3, which defines high-level, implementation-independent conditions guaranteeing correct desynchronization at the level of the abstract synchronous model. The synthesis problem is factored into a high-level, implementation-independent phase insuring the weak endochrony of each synchronous module (not covered in this paper) and the actual wrapper synthesis phase, simplified by the high-level assumptions. We focus here on the synthesis of delay-insensitive asynchronous wrappers from weakly endochronous synchronous modules. The choice of delayinsensitive logic as implementation domain is determined by its excellent modularity properties, its ability to support concurrency (and thus more efficient implementations), and by the existence of state-of-the-art tools allowing the specification and synthesis of delay-insensitive circuits.
Our main contribution is the introduction of a clear formal framework that allows us to guarantee the correctness of GALS implementation models involving synchronous and asynchronous formalisms used in digital circuit design (synchronous Mealy machines and Petri Nets).
Previous work
The distributed, asynchronous, or GALS implementation of synchronous specifications is a subject that draws more and more attention. Although stated in a purely synchronous framework, the latency-insensitive design of Carloni et al. [2] has been a major source of inspiration in our work. There, the goal is to modify the modules of a synchronous system in such a way as to guarantee the preservation of the semantics when the implementation-level communication lines have arbitrary latencies. Like in our case, producing a latency-insensitive implementation consists in synthesizing for each module a synchronous wrapper that controls input, output, and clock generation. The approach guarantees the correctness of the resulting system, but it is inefficient, as the wrappers effectively simulate a unique global clock which runs as slow as the slowest computation or communication in the system. This approach is only applicable to single clock SoC's and cannot be extended to multi-clocked systems. Another disadvantage of this scheme is that the module waits for all its incoming data from its input channels, before it generates its output on each output channel. The designer doesn't have control over his inputs to gated clock. Therefore, a data not required for a particular computation, or any output channel not ready to accept data, can unnecessarily stall the synchronous module by gating the clock. Hence, this kind of communication scheme is undesirable.
Several papers, like that of Singh and Theobald [16] , extend Carloni's approach by allowing latency-insensitive circuits support execution modes and concurrency, and thus allow multi-rate, on-demand execution. These approaches allow an improvement in efficiency, in terms of power consuption or speed. However, the new approaches do not guarantee the correctness of the implementation w.r.t. the initial synchronous specification. High-level criteria covering the correctness aspects of such implementations have been defined by Potop, Caillaud, and Benveniste [14, 13] but the current paper gives the first hardware implementation to the new concepts.
Leaving the purely synchronous model, we first mention pausible clocking by Yun and Donohue [18] . The goal is here to insure correct synchronization in the transmission of data between synchronous modules using asynchronous FIFOs. The approach focuses on the elimination of metastability problems for single FIFOs. It cannot deal with synchronization constraints involving several FIFOs, so that (1) the correctness of the implementation must be insured by other means and (2) the synchronous module has a free running clock, which increases consumption.
On the asynchronous side, we mention the work of Cortadella et al.
[1] on the fully asynchronous implementation of synchronous specifications. The problem with this approach is that the method is global and intrusive, leaving little place to modular design based on off-the-shelf IPs. Our approach directly aims at modular development, by relying on high-level conditions guaranteeing that a synchronous IP can be embedded in any environment.
More generally, we mention the large number of attempts to combine advantages of synchrony and asynchrony. Among them, we mention the pioneering work of Seitz [15] on systems with several clock domains, and the thesis of Chapiro [3] which coined the term GALS. More recently, the burstmode circuits of Yun and Dill [17] represent an interesting intermediate model, but their "fundamental" execution mode, which lacks concurrency, makes the definition of modular design methodologies difficult.
The remainder of the paper is organized as follows: Section 2 presents the formal framework supporting our approach. It defines the microstep model allowing us to represent both synchronous and asynchronous systems, and the introduces weak endochrony (and the related correct implementation results). Section 3 gives a background on the modeling framework used and the theory of Regions. Section 4 presents the delay insensitive architecture and explains how the new method can be applied to a very simple DLX-like processor. Section 5 sums up the steps in the synthesis process and elaborates on the translation of weakly endochronous finite state machines into Petri net models. Section 6 outlines the verification process of the Petri net model developed. Section 7 gives a short conclusion. It also gives the directions we currently follow to complete and extend our work.
Weakly endochronous synchronous systems
This section resumes the results of Potop and Caillaud [13] . It first defines the microstep model that allows us to reason in a unified framework about synchronous and GALS systems. Then, it introduces weak endochrony and the related semantics preservation results. Simple examples show how simple synchronous/GALS systems are be modelled.
Microstep transition systems
We start the presentation of the microstep formalism with a small example -a synchronous system with two input channels (b and d) and two output channels (a and c). Channels a and d carry no data, being used only for synchronization (alternatively, data is uninterpreted). The system emits a message on channel a and then awaits for one message from either channel b or d (e.g. for whichever comes first). If signal b arrives with value 0, then the system awaits the next clock cycle, where it emits c with value 42 (then, it does nothing forever). If b is received with value 1 or d is received, then the system does nothing forever. The behavior of the system is not specified for b different from 0 or 1. The "clock variable" of the system, denoted with τ , functions as a separator between synchronous cycles.
In a more classical macrostep framework, like that of [14] , this system would be represented by:
However, this compact form hides both I/O and computation causality which are essential aspects of any asynchronous implementation. Hence the need for a new formalism that would bridge between the abstract, macro-step synchronous models and asynchronous formalisms. By analogy with the macrostep model, the initial state and the destinations of clock transitions will be called synchronizing states, and the sequences of microsteps ending with clock transitions shall be called reactions.
General definitions
We model every system, component, and communication line using finite state machines of the form Σ = (S,ŝ, V, •→ Σ ), where S is a finite set of states, s ∈ S is the initial state, V is a finite set of variables, and •→ Σ is the transition relation. The label l of a transition s• l / / s is a partial valuation of the variables in V . Formally, if D v denotes the domain of a variable v, and if ⊥ is a special symbol denoting the absence of a value, then the set of all possible labels over V is L V = v∈V (D v ∪ {⊥}). We denote with supp(l) = {v ∈ V | l(v) = ⊥} the support of a label l, and we denote with ⊥ V the transition of empty support. Our state machines are composed by classical synchronized product. If
Our systems and components communicate with each other and with the environment through directed FIFO channels, each channel being represented as a pair of directed variables. We emit a value on channel c by assigning the variable !c, and we receive a value by reading variable ?c. Note that the variables !c and ?c must have the same domain D c .
To represent synchronous and GALS systems, we shall represent the clock signals using special clock variables that carry no data (their domain contains a single value, denoted -the clock tick). In our small example, tau is such a clock variable.
We denote with T races Σ (s) the set of traces of the transition system Σ starting in state s. Two traces ϕ i , i = 1, 2 are asynchronously equivalent, denoted ϕ 1 ∼ ϕ 2 if their projections ϕ i | {?c,!c} on every communication channel c coincide. For every channel c, ϕ 1 | {?c,!c} is a prefix of ϕ 2 | {?c,!c} , then we write ϕ 1 ≤ ϕ 2 . We say that ϕ 1 and ϕ 2 are non-contradictory, denoted ϕ 1 ϕ 2 if for all c, one of the projections is prefix of the other.
Microstep synchronous transition systems
To represent synchronous systems, we shall use finite state machines having exactly one clock variable (the system clock), having only directed and clock variables, and satisfying a number of axioms, which include the synchronous hypothesis. Formally, if τ is a clock variable and D is a set of directed variables, then the transition system Σ = (S,ŝ, V = D ∪ {τ }, •→ ), is a microstep synchronous transition system (µSTS) if it satisfies:
STS2 (synchrony hypothesis): two assignments of a same variable must be separated by a clock transition. More exactly, if
The first axiom identifies the clock transitions which separate the synchronous reactions/clock cycles. The clock transitions are the only ones where the clock variable is present. The second axiom is the actual synchronous hypothesis, which states that during a clock cycle (i.e. between two clock transitions) a communication variable can be assigned at most once.
In addition to these two fundamental axioms, we shall require that our systems satisfy 3 more conditions. The first two simply facilitate the definition of our formal framework:
Axiom STS3 facilitates the definition of the composition by synchronized product, by allowing different systems to evolve independently (do nothing while the other advance) when no synchronization is needed. We shall assume these void transitions present in all the examples of this paper, but we shall not graphically represent them. Axiom STS4 tells us that all transitions can be decomposed into atomic transitions assigning exactly one variable. In our framework, transitions assigning several variables are used to express in a static fashion the concurrency between the composing atoms. The last assumption is more important, as it constrains the class of representable systems to stuttering-invariant ones, meaning that between two synchronous reactions the system can spend any number of clock cycles doing nothing. This hypothesis departs from the classical synchronous model, but we see stuttering-invariance as a prerequisite for the efficient multi-rate GALS deployment.
Note that the previous example is not stuttering-invariant. Here are two simple stuttering-invariant systems:
Synchronous and asynchronous composition
Both our modular synchronous systems and GALS implementations are built from microstep synchronous automata using two different composition mechanisms. In both cases, we simplify the model by only allowing point-topoint communication through lossless FIFOs. We use FIFO models, which are transition systems themselves, to represent communication through such synchronous and asynchronous channels.
To represent synchronous communication, we use 1-place synchronous FIFOs (which are µSTSs). The FIFO model associated with a channel c is:
where the transition relation is defined by: c 0
Asynchronous communication involves infinite asynchronous FIFO models (which are not µSTSs):
where the transition relation contains all the transitions of the form:
be composable µSTSs and let τ be a clock variable. Then, the synchronous composition of Σ 1 and Σ 2 over the base clock τ is:
where Σ[τ/τ ] represents the system Σ where the clock variable has been renamed from τ to τ , and C(V ) = {c |?c ∈ V ∨!c ∈ V } is the set of channels associated with a variable set V .
The synchronous composition of the µSTSs Σ 1 and Σ 2 over the base clock τ is a µSTS of clock τ . The result of the synchronous composition is unique upto a renaming of the clock variable (so that we can discard τ from the notation). Moreover, the operator | is associative and commutative (again, modulo clock renaming).
Note that (1) the local clocks are synchronized/renamed in the product over the new global clock and (2) the synchronizing states of | 
AF IF O(c)
The || operator is associative and commutative. The asynchronous composition of two µSTSs is not a µSTS(because it has two clock variables).
Example
Using the small µSTSs Σ 2 and Σ 3 , we illustrate our definitions and give the intuition behind our criteria for correct GALS deployment of synchronous specifications.
The result of the synchronous composition of Σ 2 and Σ 3 is:
Note that we simplified the notation by not representing the state of the two FIFOs SF IF O(a, τ ) and SF IF O(b, τ ) (the initial state (s 0 , t 0 ) having void FIFOs, the status of the FIFOs is fully determined in each state). Also note that the composed system is blocked in state (s 3 , t 3 ) because SF IF O(b, τ ) cannot take a clock transition (data has been written on it, but not read). The system Σ 2 | Σ 3 can deadlock.
The asynchronous composition of Σ 2 and Σ 3 is:
It is essential to note that Σ 1 || Σ 2 has traces, like !a; ?a; τ 2 ; !b; ?b, that are not asynchronously equivalent to any of the synchronous traces of Σ 1 | Σ 2 . Such traces are not covered by the verification done on the synchronous model, meaning that the GALS implementation does not preserve the semantics of the specification. It is also important to note that requiring a one-to-one correspondence between synchronous and asynchronous traces is not a good idea, because for large classes of systems it can be highly inefficient (exploiting the concurrency between different computations to allow the systems to evolve at different rates is a desirable feature because it minimizes communication and power consumption). Indeed, the good correctness criterion for desynchronization is the preservation of the asynchronous traces. Formally, the GALS implementation is correct if any of its traces (executions) can be extended with a finite number of transitions to a trace that is asynchronously equivalent to a synchronous trace. Unfortunately, this criterion is undecidable even for finite systems, but in the next section we shall give sufficient conditions which are decidable.
Weak endochrony
Microstep weak endochrony (or, simply, weak endochrony) is the property guaranteeing that a synchronous component (µSTS) reads its inputs in a fashion that remains predictable even in an asynchronous environment. Weak endochrony requires that every internal choice of the component is visible as a choice over the value (and not presence/absence status) of a directed variable (either input or output). Thus, the behavior of the system becomes predictable in any asynchronous environment, because choices can be determined or observed. With this requirement, the implementation space delimited by weak endochrony is nonetheless very large: Concurrent behaviors are not affected by the previous rule, so that independent system parts can evolve at different speeds. Weak endochrony does not require I/O determinism. Instead, a weakly endochronous component must inform the environment about nondeterministic decisions (the variable used to do so behaves like an oracle that is visible from outside).
Formally, we say that the µSTS Σ = (S,ŝ, V = D ∪ {τ }, •→ ) is weakly endochronous if it satisfies the following four axioms:
From now on, we shall denote with s.ϕ the unique state of Σ having the property s• ϕ / / s.ϕ , and the notation is extended to traces.
WE2 (independence):
In a given state, transitions with disjoint labels commute. Formally, if l 1 and l 2 are disjoint and if 1 , l 2 = τ , then:
⇒ ∃s 3 :
WE3 (clock properties): Non-contradictory reactions in a given state can be united to form a composed transition. Moreover, a strong confluence property holds. Formally, if s 0 • τ / / s 1 and ϕ ∈ T races Σ (s 0 ) with τ ∈ supp(ϕ), then:
(ii) if ϕ; τ ∈ T races Σ (s 0 ), then ϕ; τ ∈ T races Σ (s 1 ) and s 0 .(ϕ; τ ) = s 1 .(ϕ; τ ) (iii) if ϕ; ψ; τ ∈ T races Σ (s 1 ), then there exists ψ ≤ ψ such that ϕ; ψ ; τ ∈ T races Σ (s 0 ). (iv) if ϕ; τ, θ; τ ∈ T races Σ (s 0 ) and ϕ θ, then there exists ρ such that ϕρ ∼ θ and ϕ; ρ; τ ∈ T races Σ (s 0 )
WE4 (choice):
The same choices must be available on non-contradictory paths starting in a given state. Formally, if ϕ i ; v = x i ∈ T races Σ (s), i = 1, 2 and ϕ 1 ϕ 2 , then ϕ 1 ; v = x 2 ∈ T races Σ (s).
While their form may seem complex, the axioms of weak endochrony simply require confluency, both inside a reaction and at the level of general traces, in the case where no choice has been made over the value (not presence/absence status) of a communication variable.
Weak endochrony covers a large class of systems, which is closed to synchronous composition (thus, incremental design is facilitated): Theorem 2.3 (compositionality) Let Σ i , i = 1, n be composable weakly endochronous µSTSs. Then, | n i=1 Σ i is weakly endochronous.
Correctness results
While example Σ 3 is weakly endochronous, the same is not true for Σ 2 . There, the choice between reading b and reading r in state s 1 is not visible from the exterior. If the environment provides both b and r, input reading is nondeterministic. On the other hand, if ?b and ?r were concurrent, then the system is weakly endochronous:
Moreover, the GALS implementation model Σ 4 || Σ 3 preserves the semantics of Σ 4 | Σ 3 :
As expected, the asynchronous composition binds tighter than the synchronous one, but for any trace of Σ 4 || Σ 3 going from (s 0 , t 0 ) to (s 4 , t 4 ) we can find an asynchronously equivalent trace in Σ 4 | Σ 3 . Such a GALS implementation is obviously correct, because it does not introduce new behaviors. In fact, a stronger relation exists between weak endochrony and correct GALS implementation. The weak endochrony of the components and the global correctness of the synchronous specification (absence of deadlocks) imply that the GALS implementation is semantics-preserving (i.e. correct).
Theorem 2.4 (correctness) Let
This theorem gives the basis for the synthesis method proposed in the next section. Indeed, if the components of a deadlock-free synchronous specification are weakly endochronous, then the synthesis of the GALS wrappers can be done locally for each module, without knowledge about the global system. Then, the implementation can be derived by connecting the resulting modules with asynchronous FIFOs of arbitrary length.
Theory of Event Models and Regions
This section throws light upon the background of Petri net model used in this paper. It also introduces the theory of Regions.
Petri nets
A Petri net is a model used to represent systems with concurrency. It is a quadruple N = {P, T, F, M 0 }, where P is a set of places, T is a set of transitions, F is a flow relation denoted by F ⊆ {(P × T ) ∪ (T × P )}and M 0 is the initial marking. A transition is enabled when all its predecessor places have a token. The enabled transition can then fire, removing all the tokens from its predecessor places and adding one token to each successor place. A labelled PN is a PN with a labelling function λ : T → A associating each transition of the net with a name. A labelled Petri net can have a combination of implicit places, where the input and output transitions are named using symbols from the alphabets, connected by arcs and transitions which are labelled with signal transitions (a+, a−).
Important properties of a Petri net (i) Liveness: if any transition can fire infinitely often, from any reachable marking. Liveness ensures complete deadlock freedom.
(ii) Saf eness: if no reachable marking from M 0 can assign more than one token to any place.
Theory of Regions
The theory of Regions for elementary system was developed by Nielsen et. al. [12] . It was subsequently adapted by to give a practicaly useful synthesis procedure (implemented in tool Petrify), for 1-safe nets, by Cortadella et.al [4] . Subsets of states in a transition system, that correspond to a set of places in a Petri net are called Regions. If r 1 and r 2 are regions of a TS, such that r 2 ⊂ r 1 , then r 2 is a subregion of r 1 . r 2 is a minimal region if it contains no subregions of the TS. A region r is a pre-region of event a if transition labelled a exits r. A region r is a post region of event a if the transition labelled a enters r.
Proposed Latency Insensitive Architecture with Intermittent Clock Triggering
This approach is primarily based on asynchronous handshake protocol. As shown in the Fig.1 the locally synchronous system is encapsulated by an asynchronous wrapper. This asynchronous wrapper consists of communication channels and a clock generator. The communication channels consist of a set of input and output FIFOs (shown in Fig.4) . We consider that each signal is transmitted from one synchronous island to the other using a dedicated FIFO. Therefore, we have as many FIFOs as there are signals in the system. When data is available at the input the FIFOs, they are read by the synchronous module. The clock generator, then triggers (clk+) the local clock for computation and generation of outputs. After the output has been generated, it is written to the output FIFOs. The clock is released (clk-) by the clock generator and the synchronous module is ready to read its next set of inputs.The activation of clk+ and clk− transitions, mark the start and end of a synchronous computation, respectively. This scheme gives rise to two advantages over the previously mentioned communication schemes.
(i) In contrast to the prevalent clock pausing schemes, we do not have a free running clock. The clock is triggered when the data required for a particular computation is read and is waiting for some operation to be done on it. The clock is released after the completion of the computation. This leads to a signification reduction in power consumption.
(ii) In contrast to the prevalent clock gating schemes, the synchronous module is not unnecessarily stalled by the unavailability of an input not required for a particular computation. This leads to an increased efficiency.
DLX architecture
In this paper we de-synchronize the DLX-like datapath architecture to exemplify the proposed transition from weakly endochronous systems to latency insensitive circuits. Here, we consider a simple unpipelined DLX-like architecture. Our approach can be directly extended to a pipelined DLX architecture. Fig.2 shows a simplified and abstract view of the overal architecture. The globally synchronous system is partitioned into five main synchronous islands, Instruction Fetch(IF), Instruction Decode(ID), Execution(EX) and Write Back(WB). These islands operate at different clock speeds. The vertical dotted lines separate different clock domains. The dashed lines group two synchronous islands, namely, Instruction Decode(ID) and Memory(MEM). In our paper we will concentrate on the ID block. This block receives instructions from Instruction Fetch(IF) block and communicates with the MEM block, with exchange of data between them. The instruction is decoded into any one of the following types: Load (ILoad), Store (IStore), ALU or Move (IMov). In this paper we will only deal with IF-ID interface and ID-MEM interface. Hence, we will ignore the last two instruction types. The transition system specification is shown in Fig.3 
Synthesis Methodology
The following steps sum up the synthesis process.
(i) Identify the modules in a synchronous system, which when partitioned from the main system, would enable high performance if their speed is increased independently.
(ii) Build weakly endochronous FSMs for each module. An example of a transition system for such a FSM is shown in Fig.3 .
(iii) Identify the transitions from the FSM, that will become actual transitions of the circuit. This can be done at the level of the control automaton by identifying, in each synchronizing state(destination of an "T"), shortest sequences of transitions that end with an "T". Divide these shortest sequences into greatest sequences where emissions take place after receptions. These greatest sequences will be the reactions of the actual circuit.
(iv) Modify the automaton by inserting "hardware clock transitions" and I/O signalling in the middle of the greatest sequences and by removing the "synchronous clock" transitions, as well as the transitions corresponding to concurrent execution of greatest sequences.
(v) Translate FSM to PN using Petrify [5] .
(vi) Extend the translated Petri net to handle intermittent clock transitions.
(vii) Choose an efficient asynchronous inter-domain communication scheme(e.g. Asynchronous FIFOs).
(viii) Implementation of the controls.
Step (i) is illustrated in Fig.2 and described in Section 4.1 and (ii) have already been discussed in Section 3. In this section we will illustrate steps (iii) to (viii). The Req signals in Fig.1 represent the incoming requests from another module in the system. In Fig.3 , the signals ?ILoad and ?IStore of the synchronous automaton correspond to these Req signals. The T signals represent the clock transition that leads to a synchronizing state, discussed in Section 2.1. Similarly, the Ack signals in Fig.1 correspond to the Ack signals in Fig.3 . The clock transitions T 1 and T 2 of the synchronous FSM, that lead to the initial state are replaced by the above mentioned asynchronous Ack1 and Ack2 handshake signals that return the Petri net model to its initial state. Fig.3 shows the FSM of the ID module illustrated in Fig.2 . S0 is the synchronizing state, identified in step (iii), since it marks the destination of T 1 and T 2. Therefore, the shortest sequence is the sequence of transitions:
After applying step (iii) we get a sequence
, that is the greatest sequence where emission of !W F = 0 takes place after the reception of ?ILoad. The clock transitions are inserted between these transitions in step (iv). This is done in such a way, that the clock is only triggered (clk+) when clken is asserted after all the input signals, required for a particular computation, have been received. Our current approach assumes that the computation is completed in one clock cycle. The clock is released and clken signal de-assreted after the emission of the output signals and is not triggered till another input or set of inputs are read.
Following are steps undertaken to translate the Transition System to a PN (i) For each event a in the TS a corresponding transition labelled a is generated in the PN.
(ii) For each minimal region r i , a place p i is generated (iii) Place p contains a token in the initial marking M 0 , iff, the initial state of the TS is an element of the set of states r.
(iv) The flow relation is constructed as follows:
a ∈ p i • iff r i is a preregion of a and a ∈ •p i iff r i is a postregion of a. The process defined above is fully automated by the tool Petrify. It takes a textual description of the synchronous automaton as its input. The results obtained are depicted in Fig.3 . This figure illustrates the translation of the ID module that interfaces with IF and MEM modules, from weakly endochronous FSM to PN. The set of states r1 = {S1, S3} is a region, since all transitions labelled !MData exit the r1 and all label ?IStore enters r1. Similarly, r3 is a region since all transitions labelled !MData enter r3 and transition labelled T 2 exits r3. In contrary, the set of states S = {S0, S1} is not a region. This is because, though T 1 and T 2 enter r2, transition S1 → S7 labelled !MData exits the this set, but transition S3 → S6 with the same label does not. The regions r1, r3 and r4 are minimal regions. Hence, region r1 can be mapped to place p1, r3 to p2, r4 to p3, and so on. The region r1 and r3 form a preregion and post region to event !MData, respectively. Hence, p1 is the predecessor place and p2 is the successor place for transition labelled !MData. Similarly, place p2 leads to transition Ack2, with T 2 replaced by Ack2, as mentioned above.
For the sake of clarity, we have omitted the clock transitions from the synchronous FSM model. Step (vi) extends the Petri net obtained from Petrify to handle intermittent clock. The clken transitions, shown in Fig.3 , control the actual circuit clock transitions. The theory of regions applied in step (v) cannot be directly extended to handle these transitions. This is because the semantical significance of the clock transitions were not identified and treated like any other input, output or internal signals. Hence the model had to be extended by hand to meet the semantical requirements of the clock in the locally synchronous modules. This is done by identifying the available inputs, required for a particular computation. When these inputs are received, on their respective channels, the clock enable signal clken is triggered. The circuit clock is triggered on assertion of the clken signal. On completion of the computation, detected by completion detection signals introduced, namely, CD1 and CD2, the clock enable is de-asserted, preventing further clock ticks. This extension is illustrated in a dotted box named "clock generation" in Fig.3 . For the sake of clarity, circuit clock is not shown in the figure. The above task is a direct outcome of step (iii) of the synthesis methodology, imposed by the weakly-endochronous correctness criterion, that the clock is triggered after all the inputs are read and it is released after all the outputs are emitted. The clock remains paused, otherwise. We have used the modeling tool PEP [11] to extend the Petri net model obtained from Petrify. The dummy signals Dum1 and Dum2 are introduced, at this stage, for synchronizing the inputs to trigger the clock enable (clken) signal.
Step (vii) elaborates on the choice of the inter domain communication scheme.
In our design we have chosen asynchronous FIFOs to connect two clocked domains, working at different speeds. Several papers have presented different types of FIFOs. These approaches include clock skew handling, robust interface for mixed timing systems and reduction of penalties for long interconnect. Any of the above can be applied to our design depending on the requirements of the system. In the model we use a very straightforward design of a standard FIFO, which is a basic requirement of the system. The model and implementation of such a FIFO is shown in Fig.4 . The signal flags are communicated via dual rail or otherwise encoded, e.g. 1-of-4, FIFOs. For other control signals, a single rail FIFO is used. Fig.4 (a) and (c) represent the PN models of dual rail and single rail FIFO, respectively and Fig.4 (b) and (d) represents the implementation, using C-elements, of the respective FIFOs.
In contrary to the basic latency insensitive approach that assumes point-topoint network topology, our approach can be extended to any simple topology. These topologies include, ring architectures [7] , simple forks and joins [19] etc, thus increasing the efficiency of our approach.
Verification of the PN model
As mentioned in Section 5, the tool Petrify was used to translate the weakly endochronous transition system to Petri net for logic synthesis. Since, a part of the model is developed by hand and glued to the model generated by Petrify, the final PN was verified to ensure it satisfied the overall system specification. Two main properties, as defined in Section 3.1, were verified: Safeness and Liveness.
PEP was used to verify the safeness property of the original net. We have used in-house tools, PUNF [9] and CLP [9] for reachability analysis and verification. PUNF was used to obtain a finite and complete prefix of the Petri net's unfolding. The output from PUNF was fed to CLP to further verify the functional properties of the net. The choice of the verification tool was based on its expressiveness and analysis power. The Petri net satisfied the safeness and liveness properties. The net statistics (|s| and |t|) and unfolding statistics (|B| and |E|) are shown in Table 1 . A Signal Transition Graph is obtained from the Petri net model. STG specification is fed to Petrify for logic synthesis leading to circuit implementation.
Conclusion
This paper sets the guidelines for a new methodology for the synthesis of the delay-insensitive asynchronous wrappers needed for the correct-by-construction GALS implementation of a modular synchronous system.
The approach is based on the recent results of Potop and Caillaud [13] , which define high-level, decidable criteria for the correct GALS implementation of modular synchronous specification, namely the weak endochrony of the modules and the absence of deadlocks in the global synchronous specification. The synthesis problem is thus reduced to that of synthesizing the asynchronous wrappers for weakly endochronous synchronous modules. This problem can be solved on a local basis, without knowledge about the properties of the global system.
We used an example -a simple model of a DLX-like processor -to intuitively present and give implementation hints on the different phases of the proposed methodology.
Future work
A formally defined algorithm for the proposed synthesis methodology will be developed. This methodology will include the extension of the theory of regions to handle intermittent clock transitions. The extension will be incorporated in the automatic synthesis tool, like Petrify, to enable the translation of weakly endochronous synchronous automata into synthesizable Petri net models.
We also intend to extend the underlying theory in order to simplify the generated logic by taking into consideration:
• closed-system assumptions, for instance under the form of sequential caresets.
• the fact that synchronous specifications are often meant to run in asynchronous environments, under specific input arrival hypothesis (e.g. one event per clock cycle)
