Multi-Clock Latency-Insensitive Architecture and Wrapper Synthesis  by Agiwal, Ankur & Singh, Montek
Multi-Clock Latency-Insensitive Architecture
and Wrapper Synthesis 1
Ankur Agiwal and Montek Singh2
Department of Computer Science
The University of North Carolina at Chapel Hill
Chapel Hill, North Carolina, USA
Abstract
This paper presents an architecture and a wrapper synthesis approach for the design of multi-clock
systems-on-chips. We build upon the initial work on multi-clock latency-insensitive systems by
Singh and Theobald [1], and provide a detailed system architecture with the following capabilities
and beneﬁts: (i) modules are stalled only when needed, thereby avoiding unnecessary stalling, (ii)
adequate metastability resolution is provided, (iii) handshake interfaces between modules are high-
performance and low-latency, i.e., capable of transfering data packets on every clock cycle, (iv) IP
cores with large clock distribution delays are correctly handled, and (v) an automated approach
is provided for wrapper synthesis from formal speciﬁcations. For wrapper synthesis, we chose the
Component Wrapper Language (CWL) from Hitachi/Fujitsu [2] as the speciﬁcation language. Our
synthesis approach has been implemented in a prototype tool. Synthesis results for a small set of
examples are provided.
Keywords: Globally asynchronous locally synchronous systems (GALS), latency-insensitive
systems, synchronizers, system-on-a-chip, clock buﬀer tree delay, Component Wrapper Language
(CWL).
1 Introduction
This paper presents a detailed architecture and a wrapper synthesis approach
for the design of multi-clock systems-on-chips (SoCs). It builds upon the ini-
tial work on multi-clock latency-insensitive systems by Singh and Theobald [1],
1 This work was supported by an IBM Faculty Development Award.
2 Email: ankur@cs.unc.edu and montek@cs.unc.edu
Electronic Notes in Theoretical Computer Science 146 (2006) 5–28
1571-0661 © 2006 Elsevier B.V. 
www.elsevier.com/locate/entcs
doi:10.1016/j.entcs.2005.05.033
Open access under CC BY-NC-ND license.
and provides a more detailed system architecture and an approach to auto-
matically synthesize wrapper circuits.
Latency-insensitive systems were introduced by Carloni et al. [3,4,5] as a
correct-by-construction approach for the design of single-clock SoCs. The key
idea is to separate communication from computation by encapsulating syn-
chronous IP modules within wrappers that render them insensitive to com-
munication latencies. This idea was extended and generalized in several ways
by [1], which proposed an initial multi-clock latency-insensitive system archi-
tecture, and suggested an automated synthesis method for automatic wrapper
speciﬁcation and implementation. This paper attempts to fulﬁll that objec-
tive by contributing a detailed system architecture and an automated wrapper
synthesis method.
The remainder of the paper is organized as follows. Section 2 presents
background and previous work. Section 3 provides an overview of our ap-
proach, and then Section 4 and Section 5 present the system architecture and
the synthesis approach in detail. Results of our wrapper synthesis tool are
provided in Section 6, and ﬁnally, Section 7 gives conclusions.
2 Background and Related Work
As microelectronic chips become faster and more complex, it is becoming in-
creasingly challenging to distribute a single synchronous master clock to an
entire chip while keeping the skew and jitter manageable. In addition, the
increasing amount of integration made possible by Moore’s law is leading to
complex systems-on-chips (SoCs) which often require coordination and com-
munication between multiple distinct clock domains, thereby compounding
the challenge. To ameliorate the challenge, globally-asynchronous locally-
synchronous (GALS) design paradigm has been proposed by Chapiro [6].
The GALS paradigm eliminates the requirement of a global clock by al-
lowing the system to be composed of several synchronous intellectual property
(IP) modules communicating asynchronously. Each module is allowed to run
on its own local clock, while data exchange between any two modules follows
a protocol that typically allows ﬂow control and metastability resolution.
One approach to GALS is to employ clock pausing in which the inactive
phase of a clock generator is stretched whenever its associated module must
be paused [7,8]. When a module has a value to send to another module, it
generates a request along with the output data, and stretches its clock until
it receives an acknowledgement. Likewise, when a module is ready to receive
a value, it stretches its clock and waits for the request.
A signiﬁcant disadvantage of pausible clocking approach is the use of a ring
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–286
oscillator, as opposed to a crystal oscillator. The ring oscillator suﬀers from
signiﬁcant amount of jitter caused by the repeated stopping and restarting of
the oscillator. In addition, the clock frequency of the ring oscillator is prone
to variation due to changes in operating temperature and voltage [9]. As a
result, the performance of synchronous blocks may be severely degraded, since
stable low-jitter clocking is the key to modern high-performance synchronous
design.
An alternative to clock pausing is to employ clock gating to stall syn-
chronous modules. The modules are typically enclosed within “wrapper cir-
cuits,” which enable the modules to operate correctly even in the presence of
arbitrary delays on their input and outptut channels. Such systems are called
latency-insensitive sytems, for which a comprehensive theory was introduced
by Carloni et al. [3,4,5]. Their approach is a correct-by-construction method-
ology for single-clock latency-insensitive SoC design. The key idea of this
approach is to use clock gating to stall the module whenever any of its input
or output channel is unavailable, thereby making module’s operation tolerant
to arbitrary communication delays. By encapsulating a synchronous mod-
ule inside a specially-designed wrapper circuit, the computation performed by
the synchronous module is eﬀectively decoupled from inter-module commu-
nications. As a result of this encapsulation, the synchronous blocks become
more modular, thereby facilitating design reuse.
Recently, Singh and Theobald [10,1] have proposed several extensions of
the basic latency-insensitive approach to make them more useful and practical
for the design of large-scale SoC’s. The ﬁrst extension allows each synchronous
module to treat its input and output channels in a more ﬂexible manner. In
particular, by proposing a more sophisticated wrapper design, their approach
allows the reading and writing of only those channels that are actually needed,
thereby eliminating unnecessary stalls caused by unavailability of channels
that are not needed in the next operation. The second extension generalizes
inter-module communication from point-to-point channels to more complex
networks of arbitrary topologies. Finally, the third extension allowes handling
of multiple clock domains.
This paper further contributes to the theory of latency-insensitive design
by building upon [10,1] to provide a comprehensive architecture and a formal
wrapper synthesis method.
3 Overview of New Approach
This section provides an overview of the challenges that this paper focuses on,
and then highlights the key features of our approach.
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–28 7
3.1 Challenges Addressed
The challenges involved in the design of multi-clock latency-insensitive systems
can be classiﬁed into two categories: (i) challenges for making pre-designed
synchronous modules latency-insensitive, and (ii) challenges for connecting
diﬀerent synchronous modules together. The former is handled in our ap-
proach through use of sophisticated wrapper circuits, whereas the latter is
handled through a generalized communication network that provides not only
connectivity and buﬀering but also ﬂow control and metastability resolution.
These challenges are now examined in more detail.
3.1.1 Latency Desensitization
The task of determining when to stall a synchronous module is not a trivial
one. In general, knowledge of the interface behavior of the module, and the
module’s internal state, may be required to determine most eﬃciently when
the module should be stalled.
Carloni et al. make a safe approximation: stall the module when any of its
input or output channels is unavailable for the next clock tick. This approach
is quite conservative: it assumes that every input and every output channel
is required at every clock tick, and may cause unnecessary stalling. However,
the beneﬁt is simple stall logic in the wrapper: a single AND gate is used to
combine “ready” signals from all of the channels.
Singh and Theobald [10,1] extend Carloni et al.’s approach by allowing the
stall logic to incorporate state information, i.e., generalize the stall logic to
be a full ﬁnite-state-machine instead of a single AND gate. Making the stall
logic stateful has the beneﬁt of reduced need for stalling: the synchronous
module is allowed to operate when the unavailable channels are determined
to be actually unnecessary for the next operation.
This paper addresses the following challenges:
(i) Automatically generate the FSM speciﬁcation and implementation for
the wrapper stall logic from a high-level interface speciﬁcation
(ii) Handle deep clock distribution networks with latencies of multiple clock
periods
3.1.2 Robust Inter-Module Communication
Gluing together diﬀerent synchronous modules also provides several chal-
lenges: buﬀering, ﬂow control, and metastability resolution. Buﬀering is re-
quired to introduce elasticity in the system, which enables a producer module
to generate data even when a consumer module is momentarily stalled. In
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–288
(async)
Module
Wrapped
Module
Wrapped
A
req
data
ack
ackack
req
data data
data
req
req
in−ack[i]
in−data[i]
in−req[i]
b)a)
ack
in
in
out
out
out−ack[j]
out−req[j]
out−data[j]
F
joi
n
fo
rk
C
B
E
D
Fig. 1. Encapsulation of synchronous modules using wrapper circuits: (a) a single wrapped module,
and (b) a wrapped module communicating with other modules
order to ensure that no data gets lost or overrun, adequate ﬂow control is re-
quired. Flow control is typically introduced into the communication network
through use of handshake signals (e.g., “request” and “acknowledge” signals
or, alternatively, “stall” signals). Finally, if the system includes two or more
distinct clock domains, metastability can result whenever a signal crosses over
from one clock domain to another. Although metastability cannot be com-
pletely eliminated, adequate circuitry must be added to ensure its resolution.
This paper provides a communication architecture that addresses the above
challenges.
3.2 Overview of Solution
Figures 1 and 2 give an overview of our enapsulation strategy using wrapper
circuits. Assume an unwrapped synchronous IP module has several data input
channels, in-data[i], and several data output channels, out-data[i]. As shown
in Figure 1(a), encapulation using a wrapper augments the module’s interface
with extra control signals that facilitate ﬂow control: in-req[i], in-ack[i], out-
req[j] and out-ack[j].
These request and acknowledge signals augment the data inputs and out-
puts, to eﬀectively implement a bundled datapath [11]. Note, however, that
the synchronous IP module itself does not see the request and acknowledge
signals. The bundling of these control signals with the datapath is only visible
to the module’s wrapper, which make use of these signals to implement ﬂow
control.
Figure 1(b) shows how a wrapped module can be connected to other mod-
ules. Module A is another wrapped synchronous module directly connected to
the wrapper module in the center. Modules B and C are connected through
a join block to the center module, which in turn is connected through a fork
block to modules D and E. Finally, module F is an asynchronous module con-
nected to the center module. The ﬁgure shows a scenario where additional
buﬀering using FIFOs is not needed on the interconnects. However, in general,
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–28 9
asynchronous FIFOs can be placed on the interconnects to provide additional
buﬀering capability, or to segment long wires with unacceptably long laten-
cies. These asynchronous FIFOs are analogous to the relay stations of Carloni
et al.
In this paper, our particular wrapper instantiation assumes that the ﬂow
control signals, i.e., requests and acknowledges, are 2-phase or “transition”
signals. That is every up or down transition on these signals indicates an
event, and there is no return-to-zero phase. This speciﬁc choice is made for
illustration purposes only. Our approach itself is quite general, and the ac-
tual circuits presented can be easily modiﬁed to accomodate other signaling
conventions as well (e.g., four-phase).
clock
Input
Handshake
Interface
Input
Handshake
Interface
Output
Handshake
Interface
Output
Handshake
Interface
in−req[i]
out−ack[j]
in−data[i]
Clock Buffer Tree Delay Resolver
ack_enable[i]
clock or gclock
gclock
out−data[j]
out−req[j]
clock
gSynchronizer
Synchronizer
req_enable[j]
ack_enable[i]
clock
clock or gclock
valid[i]
ready[j]ack_sync[j]
req_sync[i]
vaild_data[i]
ck_enable
La
tc
h
La
tc
h
combinational
logic
FSM
Synchronous Module
P.
S.
clock or gclock
clock or gclock
clock
in−ack[i]
Fig. 2. Detailed architecture of a module wrapper
Figure 2 shows a wrapped module in detail. There are several key features.
First, the clock signal provided for the synchronous module, clock, is gated
using an enable signal (ck enable) generated by the stall logic. The gated
clock signal is labeled gclock. The stall logic itself is implemented as a ﬁnite-
state-machine. To account for clock distribution delays of deep clock trees,
another layer of buﬀering (labeled “clock buﬀer tree delay resolver”) is used for
buﬀering the input and output channels of the synchronous module. Incoming
request and acknowledge signals (in-req[i] and out-ack[j]) are synchronized to
the module’s clock using synchronizers which provide metastability resolution.
Finally, blocks labeled “input handshake interface” handle incoming requests
and acknowledges, converting them from 2-phase signals to level signals, and
blocks labeled “output handshake interface” convert level signals back to 2-
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–2810
phase acknowledge and request outputs.
A key beneﬁcial feature of our approach is the low latency of the wrapper
circuits. In particular, under steady-state operation, new data values can be
read from the input channels, and new values can be generated for the output
channels, on every clock cycle. This eﬃciency is made possible because the
implementation has a latency from the request to acknowledge (i.e., from
in-req[i] to in-ack[i]) as low as a half clock cycle.
Our approach provides for automated synthesis of the stall logic for the
wrapper circuits from high-level speciﬁcations. In particular, our approach as-
sumes that the module’s interface speciﬁcation is available in the Component
Wrapper Language (CWL) [2], and automatically generates FSM speciﬁca-
tions for the wrapper stall logic. Since the stall logic is synchronous, any
conventional synchronous FSM synthesis tool can be used to produce the
gate-level circuit. Our approach uses the popular SIS synthesis tool.
4 New Approach: Architecture
This section presents the new approach in detail. Section 4.1 discusses the
wrapper design in detail. Next, Section 4.2 presents the communication strat-
egy, including ﬂow control and the buﬀering of channels required to handle
clock distribution delays. Finally, Section 4.3 discusses how multiple clock
domains are handled by our approach.
4.1 Stallable modules
Referring to Figure 2, this subsection discusses some of the details of the
implementation and operation of the wrapper circuits.
4.1.1 Synchronizers and Handshake Interfaces
The synchronizers take the incoming asynchronous request and acknowledge
signals, and synchronize them to the module’s clock. Several eﬃcient synchro-
nizer implementations are available in literature. If the sender’s and receiver’s
clocks are derived from the same crystal oscillator and are rationally related,
then it is possible to use synchronizer implementations that have a zero prob-
ability of failure. In general though, if the sender’s and receiver’s clocks are
derived from distinct crystal oscillators, then it is fundamentally impossible to
guarantee freedom from failures, but through careful design, acceptably long
mean times before failure (MTBF) can be achieved.
Let us focus on the special case of the sender’s and receiver’s clocks being
unrelated. Our approach, in this case, is to use the synchronizer implemen-
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–28 11
ack_enable
Input
Handshake
Interface
Output
Handshake
Interface
D−FF
clock
in−req
ack_enable
0.76T 0.63T
in−ack
valid
1.5T 0.5T
latency between in−req and in−ack varies between 0.5T & 1.5T
fsmt
FSM
−
Synchronizer
in−req valid
clock−polarity:
D−FF
+
T in−ack
−
req_sync Y
+
Y
req_sync
Fig. 3. Timing Diagram showing latency between in-req and in-ack
tation shown at the top of Figure 3. This implementation is referred to in
literature as the “single ﬂop synchronizer” [12]. This synchronizer implemen-
tation is best used at low to medium frequencies (e.g., up to 300MHz for
0.13µm process; See Section 4.1.4 for a discussion of MTBF). At higher fre-
quencies, the single ﬂop synchronizer may not have suﬃcient time to resolve
metastabilities and, therefore, for adequate reliability, additional D ﬂipﬂops
must be inserted in series, i.e., making it a two-ﬂop (or three-ﬂop, etc.) syn-
chronizer. These additional ﬂipﬂops, of course, will imply higher input-output
cycle times, and therefore reduced input-output throughput, but that is a nec-
essary consequence of the demand for reliability. Timing issues are discussed
further in Section 4.1.4.
The synchronizer and the input handshake interface work together to not
only synchronize the incoming request or acknowledge signal, but also perform
phase conversion: the incoming transition signal is converted to a level signal.
The ﬁrst D ﬂipﬂop in picture at the top of Figure 3 (to the left) performs the
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–2812
synchronization, while the second D ﬂipﬂop is part of the handshake interface
that performs the phase conversion. In particular, the second D ﬂipﬂop holds
the value of in-req[i] from the previous clock cycle, and then the XOR gate
is used to detect whether in-req[i] has toggled in the current clock cycle. If
in-req[i] has toggled, then valid[i] is asserted, indicating arrival of new inputs
at that channel; else, valid[i] is deasserted. Similarly, a toggle on the out-ack[i]
input indicates that the output channel is ready to receive data, which, after
synchronization, causes ready[i] to be asserted; else, ready[i] is deasserted.
Similar to the input handshake interface, an output handshake interface
is used to once again perform phase conversion: the level signals produced
by the FSM, ack enable[i] and req enable[i], are latched and converted back
to transition signals in-ack[i] and out-req[i], respectively. This conversion is
achieved by using a negative-edge-trigerred toggle ﬂip-ﬂop (labeled “T”). As
a result, only if the FSM asserts the enabling of an acknowledge or request
signal, the corresponding acknowledge or request signal is toggled.
4.1.2 Stall Logic FSM
The stall logic state machine reads the status of all the communication chan-
nels, and determines if the module should be stalled on the next clock cycle. In
particular, the validity of input channels and the readiness of output channels
is veriﬁed, and if, given the current state, it is determined that the module’s
operation cannot proceed for the next clock cycle, then the module is stalled
for the next cycle.
As mentioned above, our approach to creating stall logic is quite diﬀerent
from that of Carloni et al. [4]. Unlike [4], our approach makes use of available
information about the synchronous module’s functionality in order to gener-
ate more optimal stall logic. In particular, while the wrappers of [4] stall the
synchronous module whenever any input or any output channel is unavail-
able, our approach stall the module only if an unavailable channel is actually
required for the next clock cycle. Thus, some knowledge of the module’s be-
havior is required in our approach. In particular, our approach assumes that
the module’s interface behavior is given as a formal speciﬁcation. Note that
the a formal speciﬁcation of the module’s internal behavior is not required;
an interface speciﬁcation, or a safe approximation thereof will suﬃce. As a
special case, if no interface speciﬁcation is available, our approach converges
to the that of [4].
Our stall logic FSM is implemented as a classic Mealy machine which pro-
duces several outputs: (i) the signal g, which determines whether or not to
gate the module’s clock, (ii) a set of signals ack enable[i] which selectively
acknowledge those input channels that will be read, and (iii) a set of signals
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–28 13
req enable[j] which selectively request those output channels that will be writ-
ten. As in most common Mealy machine implementations, these outputs are
produced by combinational logic, and are a function of both the present state
of the machine and the inputs to the machine. These outputs are not latched
inside the FSM, though they are latched outside as needed. The next-state
output is latched inside the machine using ﬂipﬂops.
The FSM output that determines the clock gating function, signal g, is
latched on the negative clock edge using a transparent latch before it is used
to gate the module’s clock. This latching action is needed to ensure that
the gated clock, gclock, is free of glitches. The latching is performed on the
negative clock edge (i.e., a half clock period before), so that the gating signal
itself is available in time for the next rising clock edge.
4.1.3 Optimization: Elimination of “Busy Waiting” in FSM
An optimization is proposed that can signiﬁcantly reduce the complexity of
the stall logic FSM. In particular, the FSM is controlled by the gate clock,
gclock, not by the original clock, clock. This has the eﬀect of stalling the state
machine whenever the associated module is stalled. As a result, “busy waiting”
is eliminated from the state machine, thereby simplifying its speciﬁcation and
implementation.
Our choice of a Mealy machine architecture for the stall logic FSM ensures
that there will not be any deadlocks due to this optimization. In particu-
lar, when the clock is stalled, the FSM’s outputs will still respond to input
changes, even though the state bits of the machine are frozen. When the in-
puts to the machine are such that the module’s clock can be restarted, the
FSM’s g output will be asserted, which will in turn restart the module’s clock,
and consequently unstall the FSM as well. Thus, this optimization correctly
eliminates busy waiting without introducing any deadlocks.
The motivation for this optimization is that, as seen for several examples
(e.g., see Results, Section 6), the stall logic FSM speciﬁcation is typically quite
sparse. That is, there are only a small number of total states in which the
FSM enables the module’s clock; the remaining total states cause the clock to
be disabled. For those total states where the clock is disabled and the module
is stalled, the FSM itself has a “busy wait” loop, i.e., its internal state does
not change. Thus, whenever the synchronous module is stalled, the FSM has
a speciﬁed next state that is identical to the current state.
This optimization simpliﬁes the stall logic by introducing “don’t-cares”
into the FSM speciﬁcation. In particular, by controlling the stall logic FSM
with gclock instead of the original clock, the ﬂip-ﬂops inside the FSM which
store the state are disabled. As a result, the next-state function now becomes
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–2814
DFF
t
su
t FSM
XORt
t DFF
meta
t
D−FF
D−FF
T
T
in−ack
out−req
ck_enable
req_enable
ack_enable
g
La
tc
h
D−FF
in−req D−FF valid
out−ack ready
FSM
P.
S.
req_sync
ack_sync
Y
Z
XORt t FSM t sumetat
   
   


   
   
   


   
   


   
   
   



clock
valid
ack
ack_enable
req_sync
Y
in−req
+ + + < T
2
+ t
Fig. 4. Timing Constraint
“don’t-care” for those states where the module is stalled. Similarly, as shown
in Figure 2, the toggle ﬂip-ﬂops as well as the input data latch are all now
controlled by gclock. This makes ack enable[i] and req enable[j] also “don’t-
care.”
The net impact is that the FSM speciﬁcation becomes signiﬁcantly less
complex to specify, and upon synthesis, will typically require fewer gates. An
example in Section 5 illustrates this optimization, and experimental results
conﬁrm its beneﬁt.
4.1.4 Timing and Mean-Time-Before-Failure (MTBF) Analysis
Figure 3 is the timing diagram for the wrapper circuit. It shows four scenarios,
diﬀering in the arrival times of in-req relative to the next rising clock edge.
At the bottom of the ﬁgure, the latencies from the arrival of in-req to the
generation of its corresponding acknowledgment, in-ack, are shown in units
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–28 15
of the clock period, T. All four of these scenarious assume the single-ﬂop
synchronizer implementation shown at the top of that ﬁgure. (As mentioned
earlier, for clock frequencies higher than a certain limit for the particular
process used, additional ﬂipﬂops may be necessary in the synchronizer to
ensure reliability, which will degrade the latency.)
If a new in-req arrives just before the rising clock edge (i.e., a setup time
before), then the in-ack is produced with a half clock cycle latency; this
scenario is shown rightmost in the ﬁgure. If in-req arrives well before the rising
clock edge, then it must wait until the rising edge before it is latched; in-ack is
produced a half clock cycle later, thereby making the request-to-acknowledge
latency somewhat greater than a half clock cycle. This corresponds to the
ﬁrst two scenarios in the ﬁgure. Finally, the third scenario corresponds to
in-req ’s arrival just after the rising clock edge. In this scenario, the request-
to-acknowledge latency is the longest—1.5 clock cycles—because in-req must
wait for an entire clock cycle before it is latched.
For maximum I/O throughput, the sender must be capable of responding
to an acknowledgment with a new data item within a half clock cycle. If
so, the complete cycle time for reading one data item from an input channel
is no more than a clock period. Similarly, output channels have maximum
throughput when an out-req is acknowledged by an out-ack in no more than
a half clock cycle. Under these conditions, a wrapped module is capable of
handling I/O throughputs that equal its clock rate.
Mean-Time-Before-Failure (MTBF) Analysis
We now present an analysis of the MTBF of our single-ﬂop synchronizer,
and use it to derive the range of frequencies over which its operation will be
reliable. Our analysis follows the reasoning of [12].
To maximize I/O throughput, we limited the request-to-acknowledge la-
tency to at most a half clock cycle. This time should not only account for the
latencies through the synchronizer (tDFF), input handshake interface (tXOR),
and the FSM (tFSM), but also allow adequate time for metastability resolu-
tion (tmeta), plus a setup time for the output handshake interface (tTFF-setup).
Therefore, the following constraint must hold:
tDFF + tXOR + tFSM + tTFF-setup + tmeta < T/2(1)
The ﬁrst four terms in the above equation add up to approximately 10 gate
(FO4 inverter) delays. Recall that the FSM latency is simply the latency of
the combinational logic that generates the output (i.e., 2–4 gate delays); there
is no latch on the path from FSM inputs to outputs. From [12], approximately
40 gate delays are suﬃcient for adequate metastability resolution, i.e., for a
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–2816
synchronizer MTBF of 10,000 years. Therefore, our single-ﬂop synchronizer
will work reliably for clock periods of 100 gate delays or more.
As an example, for a 0.13µm CMOS process, an FO4 delay is approxi-
mately 30ps. Our single-slop synchronizer, in this case, works reliably up to
333MHz clock frequency. At higher frequencies, the time available to resolve
metastability shrinks, thereby requiring modiﬁcation of the synchronizer by
adding extra D ﬂipﬂops (i.e., making it a two-ﬂop, three-ﬂop etc. synchro-
nizer). These extra D ﬂipﬂops imply longer request-acknowledge latencies
and therefore an I/O throughput lower than the clock rate. For instance, a
two-ﬂop synchronizer allows T to be as low as 40 gate delays (tmeta = T ), and
therefore can work reliably at clock frequencies up to 833MHz. However, since
the I/O cycle time is now increased to two clock cycles, the maximum I/O
throughput achieved is now half of the clock frequency.
4.1.5 Input Data Buﬀering
Data from the input channels, in-data[i], is latched by a special-purpose latch
(see Figure 2). The latch used for this purpose is controlled by an inverted
clock, and an enable input. In particular, only those input channels that
need to be read are actually read. This is accomplished by using the FSM’s
ack enable[i] outputs, and using them as a mask to only latch data from
those input channels that are being acknowledged; data from other channels
is neither acknowledged nor latched.
4.1.6 Handling Clock Distribution Delays
After the clock signal has been conditioned by the stall logic, it still must go
through possibly several stages of ampliﬁcation before it is delivered to the
leaf-level cells inside the synchronous module. As a result, there may be a
signiﬁcant delay between when the stall logic stalls the module, and when
the module is actually stalled. For complex high-speed modules, the clock
distribution delay itself may be multiple clock cycles. An approach to handle
such clock delays is discussed later in section 4.2.2.
4.2 Communication Networks
4.2.1 Flow control
In our approach, ﬂow control is achieved by augmenting each communication
channel with handshake signals (request and acknowledge). We have used
bundled-data signaling with two-phase handshake protocol. There is a bundle
of data which carries information (using one wire for each bit) and two control
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–28 17
wires: req and ack. The signal req carries a transition from the sender to the
receiver when the data is valid, while the signal ack carries a transition from
the receiver to the sender when the data has been used. The protocol sequence
is illustrated in the Figure 5. Note that this only deﬁnes the sequence in which
events must occur, there is no upper bound on the delays between consecutive
events.
receiver
ack
req
data
sender’s action recveiver’s action
req
ack
data
sender
Fig. 5. The Handshake Protocol
The wrapped module can communicate with another wrapped synchronous
module or with an asynchronous module, as long as pairs of communicaing
modules agree on the handshake protocol.
While this paper focuses on a 2-phase handshake protocol implementation,
our approach is quite general, and easily extensible to other protocols as well:
e.g., 4-phase handshaking, pull instead of push protocols, etc.
4.2.2 Handling Deep Clock Distribution Networks
The clock distribution delays for high-speed complex chips can approach sev-
eral clock periods. This delay results in problems while stopping and restarting
the clock for the latency-insensitive approach. For instance, when the clock
is enabled, the module starts operating after some delay, which increases the
latency of the system. However the disabling of the clock represents a worse
situation: since the module stops after some clock cycles of delay, it will actu-
ally have consumed data from input channels that might have been invalid. At
best, this can cause wasteful energy consumption, but this could easily cause
incorrect operation. Ideally, to deal with the clock distribution delay, the de-
cision about starting or stopping the clock should have been made in advance,
which is not possible. Alternatively, extra buﬀering must be provided in the
input and output channels to accommodate the clock delays.
Suppose the clock distribution delay is constant and is known in terms
of number of clock cycles. 3 Let us assume that this delay is k cycles. Thus,
3 The clock distribution delay, when measured in terms of number of clock cycles will vary
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–2818
when the gclock signal is enabled, it takes k cycles to actually start the module.
Similarly, when the gclock signal is disabled, the module stops after k cycles.
Input−Queue
in[i]
gclock
Synchronous Module
out[j]
Output−Queue
Fig. 6. Handling deep clock buﬀer trees with Queues
Our approach is to insert asynchronous queues of length k at each in-
put and output channel of the module, as shown in Figure 6. Several eﬃ-
cient implementations of asynchronous queue (FIFO) are available in litera-
ture [13,14,15,16]. The queue on the input side is called Input-Queue, while
the queue on the output side is called Output-Queue.
Let’s consider the case when the gclock signal is enabled. By the time
module starts after enabling gclock, k inputs are stored in the Input-Queue.
So, previous modules which are generating data need not be stalled for k
cycles. Once the module is started, in every clock cycle one data item is taken
from the Input-Queue and one data item is stored in the Input-Queue, which
makes it full all the time. On the contrary, the Output-Queue remains empty
as output data is generated and consumed each cycle.
Next consider the case when the gclock signal is disabled. It takes k cy-
cles for the module to stop. The stall signal is sent to the input generating
modules and output consuming modules, so no more input is inserted in the
Input-Queue and no more output is consumed from the Output-Queue. In
the meantime, for the next k cycles, data is read from the Input-Queue, and
data is written to the Output-Queue. After k cycles, the Input-Queue becomes
empty and the Output-Queue becomes full. Now, when the gclock signal is
enabled, while the synchronous module is “waking up,” data is dequeued from
Output-Queue and made available to the receiver for k cycles, while input data
is enqueued in Input-Queue. After k cycles, the Output-Queue becomes empty
and the Input-Queue becomes full; and the module reaches its steady state.
Finally, our approach also correctly handles arbitrary interleavings of stalled
and unstalled cycles, e.g., the case of transient stalls within an otherwise
smooth run. The synchronous module simply mimics the behavior of the
gated clock, except that it is running behind with a constant lag of k cy-
as the clock frequency is changed. Our approach currently assumes a ﬁxed clock delay.
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–28 19
cles. Since both the Input-Queue and the Output-Queue are suﬃciently long
to handle this lag, the wrapped module works correctly, without missing any
data item.
Our approach also has performance and energy beneﬁts through a reduced
need for stalling, and an avoidance garbage computation. In particular, the
presence of the Input-Queue allows the sender to start sending data items even
as the module is “warming up.” That is, once the module’s clock is restarted,
the sender need not wait for k clock cycles before sending data, since these
data items will be properly enqueued in the Input-Queue. Similarly, when the
module’s clock is stalled, the Input-Queue is able to feed valid data to the
module while it is “shutting down.” As a result, the module avoids reading
garbage data from the input channel, thereby avoiding wastage of energy. At
the same time, the Output-Queue ensures that the data items generated by
the module while it is shutting down are safely buﬀered.
Comparison with Mekie et al.
In a previous approach by Mekie et al. [17], the problem of clock buﬀer
tree delay is resolved by a diﬀerent kind of a wrapper. There are several key
diﬀerences between their approach and ours.
First, their approach uses a pausible clock based on a ring oscillator. While
pausible-clock approaches sometimes mitigate the problem of metastability,
they have their own disadvantages as discussed in Section 2. In particular,
pausible clocks have signiﬁcantly higher jitter and frequency variation, com-
pared with clock-gating approaches such as our that use crystal oscillators.
Second, unlike the Input-Queue used in our approach, their approach does
not provide input-side buﬀering. As a result, their approach lacks the per-
formance and energy beneﬁts that accrue due to the Input-Queue of our ap-
proach. In particular, after the module’s clock is restarted, their approach
requires the sending module to be stalled while the current module is warm-
ing up, thereby leading to unnecessary stalling in the system. Similarly, when
the module’s clock is stalled, the absence of input-side buﬀering in their ap-
proach implies that the module may not have valid input data available while
the module is waiting to be shut down. This garbage computation leads to
wasted energy.
Third, in order to somewhat reduce latency, their approach uses an output
queue that has a bypass path in it, so that when the output queue is empty
data items are sent directly to the receiver, bypassing the queue. While our
approach could also beneﬁt from such a sophisticated queue implementation,
we currently favor simpler implementations consisting of linear FIFOs. How-
ever, several low-latency high-throughput asynchronous FIFOs exist, which
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–2820
FIFO
Module
Wrapped
E
D
fo
rk
F
(async)
B
C
joi
n
A
in
in
data
req
ack
data
req
ack
          
          
          



          
          
          



          
          
          



          
          
          



out
out
data
req
ack
data
req
ack
FIFO FIFO
FIFO
Fig. 7. Arbitrary topologies
can be a good match for our approach [14,18,15,16].
4.2.3 Arbitrary topologies
The basic approach to latency-insensitive design by Carloni et al. assumes
that all channels in the system are point-to-point channels. However, Singh
and Theobald [10,1] demonstrated that several situations require more general
communication topologies, e.g., those including forks and joins, or other non-
linear structures. For instance, forks and joins may be required if one fast
module is to be replaced by two slow modules working in tandem. For more
details, please see [1].
Accordingly, the approach of this paper allows for arbitrary communication
topologies as shown in Figure 7. The ﬁgure shows how two wrapped modules,
B and C, may be connected to a single channel on the wrapped module in
the center, through a special “join” block in the communication network.
Similarly, data from a single output channel of the module in the center can
be distributed to two wrapped modules, D and E, through a “fork” block.
The speciﬁcation and synthesis of such communication networks, however,
is beyond the scope of this paper.
4.3 Handling Multiple Clock Domains
The key challenges of handling multiple clock domains are: metastability res-
olution, and ﬂow control to handle mismatch of data rates between diﬀerent
clock domains. Our approach has support for both: metastability is resolved
correctly and eﬃciently by the synchronizer (see Section 4.1.1), and ﬂow con-
trol is provided by the communication network (see Section 4.2.1). Therefore,
our approach is suitable for the design of multi-clock SoCs. As an example,
Figure 7 shows a system composed of ﬁve synchronous modules, A-E, each
possibly operating on its own distinct clock, and an asynchronous module F.
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–28 21
5 New Approach: Formal Wrapper Synthesis
This section presents our formal approach of specifying and synthesizing the
stall logic that is critical to the design of wrapper circuits for the synchronous
modules. We choose an interface speciﬁcation language developed by Hi-
tachi/Fujitsu to formally specify a module’s interface behavior. The language
is called Component Wrapper Language (CWL). However, our approach could
be extended to handle other interface languages as well.
Our approach is a combination of syntax-directed translation and synthe-
sis. In particular, from a given interface speciﬁcation in CWL, a syntax-driven
step automatically tranforms it into a speciﬁcation of the stall logic FSM. A
synthesis step then automatically converts the FSM speciﬁcation into a gate-
level netlist implementation of the wrappers.
This section is orgnized as follows. Section 5.1 provides an overview of
CWL. Then, Section 5.2 presents our synthesis method and an example.
5.1 Background: Component Wrapper Language (CWL)
CWL[2] is a language used to write external interface speciﬁcation for an IP
module. The speciﬁcation expressed by CWL can be reﬁned and used for
the actual design of the IP module, and it can also be used by an IP vendor
to formally specify the module’s interface for the customer. Use of a formal
language such as CWL reduces labor needed for designing and veriﬁcation,
and is better than timing diagrams and natural-language speciﬁcations which
frequently have errors, missing descriptions and ambiguities.
The format for interface description in CWL is very easy. There are four
essential deﬁnition items which should be included in every speciﬁcation:
(i) Port: The port deﬁnition component deﬁnes the external terminals of a
logic module. For each terminal it speciﬁes I/O direction (input/ output/
inout), attribute (clock/ control/ data), bit width and terminal name.
(ii) Alphabet: The alphabet deﬁnition component deﬁnes logical signal values
of ports at a certain time.
(iii) Word: The word deﬁnition component deﬁnes an alphabet sequence by a
regular expression. For example, “A∗” signiﬁes zero or more occurrence
of A, “B+” signiﬁes one or more occurrence of B, and “C|D” signiﬁes C
or D.
(iv) Sentence: The sentence description component deﬁnes a sequence of
words, which captures the module’s external behavior.
Example. Figure 8(a) is an example of a CWL speciﬁcation for merge-
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–2822
{R  , x ,   B ,   C }; 
endinterface
sentence;
endword
W;
.
(a) (b)
interface sample
port;
input.clock clk;
endport
alphabet
endsignalset
endalphabet
word;
signalset all=
input.clock clk;
input.control rst_;
intput.control en_;
output.control wait_;
endsignalset
init: I+;
nop: N;
init [nop | read]+;
interface sample
port;
endport
alphabet
endalphabet
word;
endsentence
endinterface
sentence;
endword
W:
I:
N:
signalset all=
input.data [9:0] ad;
output.data [7:0] dt;
{clk, rst_, en_, ad , wait_, dt  };
{R  , 0    , x     , x   , 1       , z  };
{R  , 1    , 1     , x   , 1       , z  };
{R  , 1    , 1     , x   , 0       , z  };
Q(reg[9:0] A): 
O(reg[7:0] D): {R  , 1    , 0     , x   , 1       , D};
read{reg[9:0] A, reg[7:0] D}: Q(A)W*O(D);
{R  , 1    , 0     , A  , 0       , z  };
P(reg[3:0]A,reg[5:0]C):
Q(reg[3:0]B,reg[5:0]C):
W: [PQ]+;
{clk, d1,  d2, out};
input.data [3:0] d1;
input.data [3:0] d2;
output.data [5:0] out;
{R  , A ,   x ,   C };
endsentence
Fig. 8. CWL Examples: (a) merge-alternate, (b) 1-port memnory (from [2])
alternate module. Here, clk is the input clock to the module, and d1 and d2
are 4-bit input data channels, and out is a 6-bit output data channel. In the
alphabet description, R stands for rising edge, x stands for unknown or “don’t
care,” and z stands for high impedance. The “signalset all” clause describes
the order of ports for the alphabet description. In the next line, alphabet
value P is set to the signal combination of (clk=R, d1=A, d2=x, out=C ). In
“P(reg[3:0]A, reg[5:0]C )”, arguments are used for data reference: A refers to
the 4-bit data value read on input channel d1, and C refers to the 6-bit data
value written to the output channel out. The next line similarly deﬁnes the
alphabet value Q.
This example describes a module whose only legal interface behavior is
given by the sentence that consists simply of a single word W. The word W in
turn is deﬁned by “W: [PQ]+”. Here, alphabets P and Q alternate, starting
with P. Further, P represents the action of reading value A from input channel
d1, and writing value C to output channel out. Similarly, B represents the
action of reading value B from input channel d2, and writing value C to
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–28 23
output channel out.
Thus, the overall interface behavior of this module is to read values from
input channels d1 and d2 alternately, but write a data value to the output
channel out on every clock cycle.
5.2 Wrapper Synthesis Approach
Our approach can automatically synthesize wrapper circuits if the module’s
interface description is available in CWL. We now present the algorithm used
in our tool:
Algorithm(input:CWL, output:FSM)
1 s ← read sentence(CWL);
2 ∀w ∈ s, replace word(w,alpha expr(w));
3 T ← parse tree(s);
4 generate FSM(T );
5 augment FSM();
6 busy wait optimization();
7 state minimization();
8 state assignment();
9 logic minimization();
The algorithm starts by reading in a sentence of the interface desciption
given in CWL. Then, in step 2, words in that sentence are substituted by
their corresponding alphabet expression. This results in a complete sentence
which consists of a regular expression in terms of alphabets. For the example
in Figure 8(a), the resulting sentence is [P(A,C) Q(B,C)]+. In step 3, the
sentence is parsed into a parse tree, and in step 4, a state machine speciﬁ-
cation is generated through syntax-directed translation of the parse tree. In
particular, each operator results in a speciﬁcation fragment, as shown in Fig-
ure 9. Each edge is labeled with the conditions that are present in the signal
set corresponding to the next state’s alphabet.
Since each input and output channel is augmented with request and ac-
knowledge signals, the FSM is similarly augmented by the valid/ready signals
from these channels in step 5. More formally, for each edge of the state ma-
chine, there should be conditions corresponding to the handshake signals for
each input and output channel that is required for that particular transition.
In particular, the valid signal for each input channel and the ready signal for
each output channel should be 1. Each state transition is also labeled with the
FSM outputs that are generated: the ck enable, acks corresponding to input
channels, and reqs corresponding to output channels are speciﬁed to be 1. For
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–2824
X*
start stop
s
s
λXY.
X.stop Y.start
start stop
λ
λ λ
λ
XY|
Y.start Y.stop
X.stopX.start
λ
X+
Y.stopX.start
λ
λ
A.start A.stop
Fig. 9. Sentence evaluation for each operator
1X1 /1101
(a) (b) (c)
S0 S1
read d1 /write out
read d2 /write out
S0 S1
S0 S1
0XX /000
XX0 /000
X0X /000
XX0 /000
X11 /1011
1X1 /1101
X11 /1011
Fig. 10. Wrapper synthesis example: (a) after state machine instantiation, (b) after handshake
augmentation, (c) after busy-waiting optimization
all other input combinations, a self edge to the same state is added to stall
the module.
In step 6, the optimization of Section 4.1.3 is applied to eliminate busy
waiting from the FSM. In particular, the target FSM implementation is as-
sumed to be controlled using the gated clock instead of the original clock,
resulting in all of the self edges transforming into don’t-cares.
The resulting state machine from the step 6, is synthesized using the pop-
ular SIS [19] synthesis tool. The following three steps are performed: state
minimization (STAMINA), state assignment (JEDI), and logic covering and
minimization (ESPRESSO).
Example. Figure 10 shows the resulting FSM for the CWL example of
Figure 8(a). In Figure 10(b), there are three input signals to the FSM as a
result of the handshake augmentation step: valid d1, valid d2 and ready out,
which indicate the validity of the two input channels, d1 and d2, and the
readiness of the output channel, out. The state machine has four outputs,
listed in the following order: g, ack d1, ack d2 and req out. There are four
input/state combinations that result in stalling, represented as the self edges
in Figure 10(b). When the busy-waiting elimination step is applied, the state
machine speciﬁcation is signiﬁcantly simpliﬁed because the self edges are re-
placed by don’t-care transitions, as shown in Figure 10(c).
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–28 25
Table 1
Results of synthesis method
Name Sentence States Latches Literals Literals(opt.) Reduction
merge-alternate [A.B]+ 2 1 12 10 17%
merge-diﬀ-freq [A.A.A.B.B.B.B]+ 7 3 35 32 9%
split-3way [X.Y.Z]+ 3 2 21 17 19%
fork-2way [C]+ 1 1 6 3 50%
multiplexor [A|B]+ 1 1 12 10 17%
sequencer A.W∗.B 2 1 22 20 9%
1-port memory I+.[N|[Q.W∗.O]]+ 2 1 37 31 16%
2-port memory I+.[N|[Q.W∗.O]|[R.X∗.P]]+ 3 2 71 59 17%
6 Results
Our synthesis tool has been implemented in C++ on an IBM PC running at 2
GHz and 512 MB of memory. This section presents the results of our wrapper
synthesis method for several example CWL module interface descriptions.
Table 1 summarizes the results of our experiments. The ﬁrst column lists
the CWL example used as input to our tool. Since there is no standard CWL
benchmark repository established yet, we generated seven CWL examples;
the table lists the sentence for each. An eighth example in the table (“1-port
Memory”) is from [2], and is reproduced in Figure 8(b) for ease of reference.
The next two columns list the number of symbolic states (after state mini-
mization) in the FSM speciﬁcation generated by our tool, and the number of
state bits after state assignment.
The examples chosen represent several interesting interfaces. The example
merge-alternate, is the example of Figure 8(a), and it represents a module that
merges two incoming streams in an alternating manner, into one outgoing
stream. The next example is also a merge, but it alternatingly takes three
inputs from one stream, followed by four inputs from the other stream. Split-
3way splits one data stream into three, whereas fork-2way copies one incoming
stream onto two outgoing ones. The next example acts as a multiplexor,
allowing either of two interfaces to get access to a shared resource. In the
event that both interfaces request access for the same clock cycle, an additional
control signal is assumed to be available to break the tie. The next example
illustrates a transaction sequencer: it processes a transaction on one interfaces,
followed by arbitrary number of wait cycles, followed by a transaction on the
other interface. The last two examples are memories, with the 1-port memory
example taken from [2]; the 2-port memory generalizes it.
Our synthesis tool was successfully able to synthesize all of the wrapper
circuits. The time taken for each of the examples was under 1 second.
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–2826
The two columns labeled “Literals” and “Literals (opt.)” list the number
literals in the synthesized gate-level implementation of the FSM. The former
one represents the cost of the implementation without the busy-waiting op-
timization of Section 4.1.3, whereas the latter represents the cost with the
optimization. The last column represents the improvement in solution quality
as a result of the optimization: 10%–20% improvement in most cases, and
50% improvement for fork-2way.
7 Conclusions
This paper presented an architecture and a wrapper synthesis approach for the
design of multi-clock systems-on-chips. The architecture provides for a ﬂexible
communication strategy, and a sophisticated wrapper design avoids unneces-
sary stalling when unavailable input or output channels are determined not be
required for the next clock cycle. In addition, latencies of deep clock trees are
handled. The wrapper synthesis approach uses the Component Wrapper Lan-
guage (CWL) from Hitachi/Fujitsu as the speciﬁcation language. Synthesis
results for a small set of examples were promising.
References
[1] Montek Singh and Michael Theobald. Generalized latency-insensitive systems for single-clock
and multi-clock architectures. In Proc. Design, Automation and Test in Europe (DATE),
February 2004.
[2] Fujitsu Ltd., Fujitsu Laboratories Ltd., and Hitachi Ltd. Component wrapper language.
http://www.labs.fujitsu.com/en/techinfo/cwl/index.htm.
[3] Luca P. Carloni, Kenneth L. McMillan, and Alberto L. Sangiovanni-Vincentelli. Latency
insensitive protocols. In Computer Aided Veriﬁcation, pages 123–133, 1999.
[4] Luca Carloni, Kenneth McMillan, and Alberto Sangiovanni-Vincentelli. The theory of latency
insensitive design. IEEE Transactions on Computer-Aided Design, 20(9), September 2001.
[5] L.P. Carloni and A.L. Sangiovanni-Vincentelli. Coping with latency in SoC design. IEEE
Micro, Special Issue on Systems on Chip, 22(5), Sep/Oct 2002.
[6] Daniel M. Chapiro. Globally-Asynchronous Locally-Synchronous Systems. PhD thesis,
Stanford University, October 1984.
[7] Jens Muttersbach, Thomas Villiger, and Wolfgang Fichtner. Practical design of globally-
asynchronous locally-synchronous systems. In Proc. International Symposium on Advanced
Research in Asynchronous Circuits and Systems, pages 52–59, April 2000.
[8] Kenneth Y. Yun and Ryan P. Donohue. Pausible clocking: A ﬁrst step toward heterogeneous
systems. In Proc. International Conf. Computer Design (ICCD), October 1996.
[9] Ajanta Chakraborty and Mark R. Greenstreet. Eﬃcient self-timed interfaces for crossing clock
domains. In Proc. International Symposium on Advanced Research in Asynchronous Circuits
and Systems, pages 78–88. IEEE Computer Society Press, May 2003.
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–28 27
[10] Montek Singh and Michael Theobald. Generalized latency-insensitive systems for GALS
architectures. In Proc. Workshop on Formal Methods for GALS Systems (FMGALS-03), Pisa,
Italy, September 2003.
[11] Al Davis and Steven M. Nowick. An introduction to asynchronous circuit design. Technical
Report UUCS-97-013, Dept. of Computer Science, University of Utah, September 1997.
[12] Ran Ginosar. Fourteen ways to fool your synchronizer. In Proc. International Symposium
on Advanced Research in Asynchronous Circuits and Systems, pages 89–96. IEEE Computer
Society Press, May 2003.
[13] Montek Singh and Steven M. Nowick. MOUSETRAP: Ultra-high-speed transition-signaling
asynchronous pipelines. In Proc. International Conf. Computer Design (ICCD), pages 9–17,
November 2001.
[14] Montek Singh and Steven M. Nowick. High-throughput asynchronous pipelines for ﬁne-grain
dynamic datapaths. In Proc. International Symposium on Advanced Research in Asynchronous
Circuits and Systems, pages 198–209. IEEE Computer Society Press, April 2000.
[15] Ivan Sutherland and Scott Fairbanks. GasP: A minimal FIFO control. In Proc. International
Symposium on Advanced Research in Asynchronous Circuits and Systems, pages 46–53. IEEE
Computer Society Press, March 2001.
[16] Tiberiu Chelcea and Steven M. Nowick. Robust interfaces for mixed-timing systems
with application to latency-insensitive protocols. In Proc. ACM/IEEE Design Automation
Conference, June 2001.
[17] J. Mekie, S. Chakraborty, and D. K. Sharma. Evaluation of pausible clocking scheme for
interfacing high speed IP cores in GALS framework. In Proc. International Conference on
VLSI Design, January 2004.
[18] Montek Singh and Steven M. Nowick. Fine-grain pipelined asynchronous adders for high-speed
DSP applications. In Proceedings of the IEEE Computer Society Workshop on VLSI, pages
111–118. IEEE Computer Society Press, April 2000.
[19] Ellen M. Sentovich, et al. SIS: a system for sequential circuits synthesis. Department of EECS
Technical Report No. UCB/ERL M92/41, U.C. Berkeley, May 1992.
A. Agiwal, M. Singh / Electronic Notes in Theoretical Computer Science 146 (2006) 5–2828
