GALS Test Chip on 130nm Process  by Bormann, David S.
GALS Test Chip on 130nm Process
David S. Bormann1 ,2
Intel Labs
Intel Corporation
Santa Clara, California, USA
Abstract
We present a Globally Asynchronous Locally Synchronous test chip fabricated on a 130nm silicon
process. The primary design goals of this chip were to measure the stability of local clocks on a
deep submicron process technology, evaluate diﬃculties using GALS in a standard design ﬂow, and
to measure power consumption. The original Asynchronous Wrapper building blocks were used to
construct a conﬁgurable data pipeline that can be tuned to emulate the operation of many diﬀerent
types of algorithms.
Keywords: GALS, asynchronous wrapper, local clock, stretchable clock, AFSM
1 Introduction
Intel envisions a future with wireless communication capabilities on nearly
every piece of silicon. Several requirements lead directly from this Radio Free
Intel vision. First, the radios must be inexpensive. Second, the power needs
for this new capability must be low. Finally, the communications logic should
be easy to integrate into diverse chips.
Globally Asynchronous Locally Synchronous (GALS) logic is a design style
that can help meet these ambitious requirements. The GALS technique main-
tains the typical synchronous design methodology for the bulk of the logic but
takes advantage of the robustness and ﬂexibility of an asynchronous interface
1 The author would like to thank Kirk Skeba for his continued support of this work. Ad-
ditionally, the test chip would not have been possible without the help of Serge Rutman,
Dave Clark, Jeen Miin, William Jiang and Samson Huang.
2 Email: david.bormann@intel.com
Electronic Notes in Theoretical Computer Science 146 (2006) 29–40
1571-0661© 2006 Elsevier B.V. 
www.elsevier.com/locate/entcs
doi:10.1016/j.entcs.2005.05.034
Open access under CC BY-NC-ND license.
between logic blocks. This leads to compact logic, low power, and a highly
modular design that can be easily integrated and updated.
This paper follows a method of assembling GALS systems that is intended
to be more easily adopted by engineers familiar with synchronous design tech-
niques. The majority of the design is constructed using typical synchronous
methods and the system is then partitioned into locally synchronous regions.
To convert the synchronous system into a GALS system, each region is sur-
rounded by an Asynchronous Wrapper of predeﬁned library components. No
asynchronous design experience is required since the wrapper components can
be dropped in without modiﬁcation.
The remainder of the paper describes a test chip experiment to evaluate
the risks and rewards of including GALS techniques in future high volume
products. Synchronous design is a very well understood process; there need
to be substantial measurable beneﬁts to justify the risks of moving to GALS
design. This experiment was intended to gain experience designing GALS
systems with standard tool ﬂows, to measure the robustness of local clocks,
and to quantify any power savings as compared to a standard clock-gated
synchronous design.
2 Asynchronous Wrapper
Previously the author proposed the Asynchronous Wrapper [2] as a simple
methodology for assembling GALS systems [3] [13] [7] [6]. The Wrapper shown
in Figure 1 consists of a small set of building blocks that surround a locally
synchronous circuit, making it externally appear to be an asynchronous hand-
shake circuit. The module consumes no dynamic power until data arrives
and halts immediately when processing completes. The asynchronous wrap-
per allows a designer to subdivide a globally synchronous circuit into locally
synchronous regions, reducing problems of clock skew and improving modu-
larity and reusability. Since the idea was published, several chips have been
fabricated that prove the concept in working silicon [8] [4].
The Asynchronous Wrapper is made up of a local clock, a collection of
input and output ports to handle external communication, and port control
logic to select what communication needs to take place during each clock cycle.
The wrapper can have any number of input and output ports and need not
be restricted to use in a linear pipeline.
The asynchronous interface on each port uses a four-phase bundled data
handshake protocol. This means that there are four events per handshake
cycle: Req+, Ack+, Req-, and Ack-. We adopt a late data-valid scheme [9]
whereby data is guaranteed to be valid when a Req- transition occurs and may
D.S. Bormann / Electronic Notes in Theoretical Computer Science 146 (2006) 29–4030
Fig. 1. Asynchronous Wrapper
Fig. 2. Local Clock Generator
change at any time after Ack-.
Local Clock
Each GALS block contains its own Local Clock generator as shown in
Figure 2. The clock generator contains a delay element that models the
delay through the combinatorial logic. The delay can be as simple as an
inverter chain as demonstrated in an encryption chip [8] or more elaborate
schemes [5] [10]; the only requirement is that the local clock generates a clock
period that is consistently longer than the worst-case delay through the local
logic function block.
The clock generator has a single input, Stretch, which is used to stretch
out the low clock phase for as long as necessary to exchange data with the
environment. The Stretch signal is the logical OR of the Stretch signals from
all of the input and output ports within the block. As long as any port is
waiting to receive or send data, the clock will be held low and no dynamic
power will be consumed within the block. If Stretch rises and then falls again
before a nominal clock period expires, however, the Stretch signal will have
D.S. Bormann / Electronic Notes in Theoretical Computer Science 146 (2006) 29–40 31
Fig. 3. Passive Input Port AFSM
no eﬀect.
Input Port
The Input Port is activated to receive data whenever Input rises. Upon
activation, the port stretches the local clock by holding StretchI high until new
data arrives. The input port is an asynchronous ﬁnite state machine described
by extended burst mode speciﬁcations and with logic equations generated by
the 3D tool [11] [12]. The logic contains only combinatorial gates but will
not produce glitches. Since the logic does not require a clock, a full four-
phase request-acknowledge handshake can take place within a single cycle.
Therefore, new data can be received in every clock cycle.
The burst-mode speciﬁcation in Figure 3 describes the behavior of the
passive input port. Recall that a burst-mode circuit waits for all transitions
in the input burst to occur before sending the output burst, but that the input
transitions may arrive in any order. A # symbol indicates that the signal is
a directed don’t-care; it may either remain at 0, monotonically change from 0
to 1, or remain at 1. Similarly, a ˜ means that the signal may remain at 1,
monotonically change from 1 to 0, or remain at 0.
To ensure correct behavior of the extended-burst-mode circuit, we must
meet three conditions [11]. The fundamental-mode environmental constraint
requires that a new input burst must not begin until the machine has stabi-
lized. This will be met as long as the environment does not respond to an
output burst faster than the internal state machine can recover. The feedback
delay requirement can be satisﬁed by the synthesis tool using the conservative
unbounded wire delay model. We do not have to worry about the setup time
requirement because we are not using any conditional signals in our speciﬁca-
tions.
Every input burst in the speciﬁcation must contain at least one compulsory
transition. A compulsory transition is a signal change that can only occur
while in the current state and is required before moving on to the next state.
In the port speciﬁcations shown, it was often impossible to provide compulsory
D.S. Bormann / Electronic Notes in Theoretical Computer Science 146 (2006) 29–4032
Fig. 4. Active Input Port AFSM
transitions without overly restricting concurrency. For instance in the passive
input port of Figure 3, we require that the Stretch signal is immediately
asserted after Input+. We do not know whether ReqI will arrive before or
after Input, but AckI+ cannot occur until after both signals have arrived.
The input burst from state 1 to 2 should really only have ReqI+, however
to provide a compulsory transition we feed back the previous output burst,
StretchI+.
A ReqI+ transition may arrive at a passive input port at any time. The re-
quest will not be acknowledged, however, until the locally synchronous module
asserts Input+. Since we are using a late data-valid bundling convention, the
data is not guaranteed to be valid until after ReqI-. Therefore LatchI+, the
transition which triggers the input buﬀers, must not occur until after ReqI-.
Finally, AckI- must not occur before the data is latched since after AckI- the
data can change at any time.
The pipeline of this design is implemented with passive input ports and
active output ports. Initially, the modules are inactive while passively waiting
for input data to arrive. When a request arrives the input port captures the
data, completes the handshake and releases the stretch signal to allow the
module to move on to the next clock cycle.
Output Port
The Output Port sends data to another block whenever instructed by
the Port Select. The AFSM and implementation are described in Figure 4.
The block latches data and begins driving it immediately while initiating a
four-phase handshake. The port stretches the local clock until the data is
successfully transmitted and the four-phase handshake is complete. In normal
operation, each computation node in the pipeline will receive and send data in
every clock cycle. Since the input and output ports are separate asynchronous
ﬁnite state machines, they can each independently execute their handshakes
as fast as the environment allows.
D.S. Bormann / Electronic Notes in Theoretical Computer Science 146 (2006) 29–40 33
Fig. 5. GALS Test Chip Plot
3 GALS Test Chip
When the opportunity to build a test chip ﬁrst arose, the plan was to build a
GALS 802.11a compatible OFDM baseband [1]. However, when another group
published a paper on a similar design [4] another plan was conceived. Instead,
this chip is a more general system that can be conﬁgured to emulate the activ-
ity pattern of the OFDM system and many others. The two primary design
goals are to test tunable local clocks and to measure the power consumption of
a GALS system. The power measurements will allow a comparison of GALS
and synchronous power eﬃciency under various scenarios.
A plot of the GALS test chip is shown in Figure 5. The chip contains a
number of experiments but the ﬁgure only includes the corner of the die with
the GALS design. This design ﬁts in an area approximately 2.5mm by 4.0mm
and runs on a nominal core voltage of 1.2V.
Modes of Operation
The GALS test chip consists of sixteen identical locally synchronous mod-
ules. The system can operate in two major modes. The ﬁrst mode tests the
tunable local clocks in isolation. The second mode conﬁgures the collection of
locally synchronous modules into one or more pipelines.
In the clock test mode, each local clock operates independently and is
used to increment a local counter. All of the local counters are sent into a
multiplexer and can be externally monitored one at a time. Each local clock
D.S. Bormann / Electronic Notes in Theoretical Computer Science 146 (2006) 29–4034
Fig. 6. Locally Synchronous Model
can be independently tuned and monitored for stability over time and with
respect to the other clocks. This mode will be used to evaluate the eﬀectiveness
of the tunable local clock implementation in isolation.
In pipeline mode, each module can be conﬁgured to be a source, sink or
computation node. Computation modules perform a conﬁgurable mathemat-
ical operation on the data as it ﬂows through. Source nodes generate data
of a conﬁgurable sequence with a tunable activity pattern. As an example,
a source can generate ﬁve sequential data words every hundred clock cycles.
Sink nodes accept data whenever it is received and end a pipeline segment.
Tunable Local Clock
Each of the sixteen local clocks can be conﬁgured over a range of frequencies
of approximately 55 to 500MHz. The tunable delay is adjusted through a series
of multiplexers. The delay element design is similar to that shown in [5] and
is made up of coarse and ﬁne delay adjustments. The coarse delay setting
controls a series of delay elements of approximately 1, 2, 4, and 8 ns. The ﬁne
delay selector can be used to add from zero to nine approximately 100ps delay
elements. The minimum delay is about 2ns and the maximum delay through
all the elements is approximately 18ns.
Each clock can be suspended indeﬁnitely by stretching the low clock phase.
The clock is stretched when a module is waiting for either input or output
data or both. Additionally, for test purposes all local clocks can be stretched
D.S. Bormann / Electronic Notes in Theoretical Computer Science 146 (2006) 29–40 35
through a single global halt signal. The Stretch signal is the logical OR of all
of the possible blocking conditions: a stall by any input or output port or the
global halt. However, the global halt must be handled carefully since it is not
synchronized to the local clock.
To ensure correct operation of the local clock the Stretch signal must be
asserted synchronously. On the other hand, the falling edge of the Stretch
signal can occur at any time without causing problems. If the rising edge
of Stretch arrived just before a rising clock edge, the clock could become
metastable and cause a catastrophic failure. The design of the input and
output ports prevents this situation by only asserting Stretch immediately
after a rising clock edge.
To avoid timing problems with the global halt signal, the halt signal is
synchronized to the local clock at the falling edge. This ensures that the
decision to halt the clock is made at least half a clock period before the
critical rising edge. Metastability can occur here but the half clock period
will help to prevent metastability from causing a problem. The global halt
is only included because it enables the local clocks to be started and stopped
together. This is useful for conﬁguration and for reading out a snapshot of
the local counters and other debug information.
Port Select
The port select component operates diﬀerently depending on whether the
module is a sink, source or computation node. In a clock cycle where data
needs to be exchanged with another pipeline stage, the input or output port
asserts the corresponding Stretch signal after the rising clock edge. If the
data exchange takes place quickly, the stretch signals are cleared before the
end of a nominal clock cycle and the clock period is not aﬀected. However, if
data exchange cannot take place, the clock is stretched and the module waits
without consuming any dynamic power.
The port select logic for sink mode is the most simple; every clock cycle
is an input cycle and the module is only active when data is received. In
source mode, the clock runs continuously for a conﬁgurable number of idle
cycles between one or more output cycles. Computation nodes are inactive
until data is received. Once input data arrives, a computation is performed
on the data for a conﬁgurable number of cycles before sending the data on to
the next pipeline stage.
Logic Core
Each module has a set of conﬁguration registers to control the function of
the node. The mode setting determines whether the node is part of a pipeline
D.S. Bormann / Electronic Notes in Theoretical Computer Science 146 (2006) 29–4036
or a free-running local clock. The nominal frequency of each local clock is set
by storing a string of bits to control a series of multiplexers.
The logic core of each module contains a conﬁgurable set of mathematical
functions. Two register bits select which of the four math functions is per-
formed by the node. The functions were selected and designed to have widely
diﬀerent delays through the logic. The four functions were 16-bit addition,
16-bit multiplication, 32-bit addition and 32-bit multiplication. Each func-
tion was created as separate structural netlist to allow the timing of each to
be optimized independently.
Each computation node can emulate a pipeline of a conﬁgurable depth.
For example, if the depth is set to ﬁve, the module will not generate output
data until ﬁve input data words have been received. The pipelines do not
allow ”holes” so output data is only sent when input data arrives. However,
each module has an End of Packet signal that tells the pipeline to ﬂush.
This causes the module to temporarily behave like a source node, sending
output data without accepting new input data. The End of Packet signal is
propagated to the next module when the ﬁnal data word is ﬂushed.
The pipeline depth emulation diﬀers from a true multi-stage pipeline in
several important ways. First, instead of having multiple stages of logic gates
operating in parallel, only one logic function is active in each cycle. Therefore,
power consumption is less and there is only one set of data values for the entire
module. The output data value is hence always updated based on new input
data after a single clock cycle regardless of emulated pipeline depth. Despite
these diﬀerences, the pipeline emulation allows the modules to behave with
an activity pattern that is similar to the OFDM baseband and to other real
problems.
4 Design Issues
RTL
The system was designed and simulated almost entirely using a Verilog
RTL description. The logic equations for each port’s asynchronous ﬁnite state
machines were typed directly into the Verilog and the initial design worked
in simulations within a few days. The math functions within each block were
instantiated using a structural Verilog netlist. The intent was to have diﬀerent
frequency goals for each of the various functions rather than allow the tools
to optimize all towards a single performance target.
Some diﬃculties arose from the design of the Port Select logic. The AFSM
requires that the control signal to initiate a data exchange must return to
zero between cycles. To allow the ports to receive or send data in every cycle,
D.S. Bormann / Electronic Notes in Theoretical Computer Science 146 (2006) 29–40 37
the control signal to the AFSM must be cleared before the end of each clock
cycle. The solution was to use a ﬂip-ﬂop with an asynchronous clear. Several
timing issues arise from this implementation. The clear signal to the FF must
also return to zero, and it must do so long enough before the start of the
next clock cycle. In the ﬁnal stages of taping out the clear signal was ﬂagged
because the width of the clear pulse was too narrow for the ﬂip ﬂops. One of
the design goals with the asynchronous wrapper was to avoid two-sided timing
constraints, but this Port Select implementation did not meet that target.
The diﬃculty in generating the Input and Output port select signals is
due to the fact that they are generated by synchronous logic but received by
asynchronous logic. The signals must have both a rising and a falling edge
within a single clock, but the falling edge cannot be tied to the falling edge
of the clock. Depending on how quickly the environment responds, the falling
edge of the clock might occur before or after the handshake is completed.
Another reason for concern is that the port select signals cannot glitch because
they directly drive the AFSMs.
Synthesis
During synthesis it was expected that there would be some issues in pre-
venting the tools from optimizing the carefully crafted AFSM equations. A
number of attempts were made with ”don’t touch” directives but the tool still
made unexpected changes. The solution was to separate out each equation
into its own design unit and to mark the entire unit as ”don’t touch.” This
solved the optimization problem, but by inspecting the netlist it was discov-
ered that many of the 16 copies of each equation were synthesized slightly
diﬀerently. One implementation was selected for each equation and copied to
all 16 instances in the design.
A better method would have been to create a hard macro for one of the
GALS blocks and then to repeat the entire block 16 times. This would have
signiﬁcantly reduced back-end work and allowed more time to improve the
single block rather than spreading the eﬀort across 16 copies of the same
logic.
Place and Route
Once the design went through place and route some additional timing prob-
lems were experienced. The constraints used in both synthesis and APR were
insuﬃcient to force the tools to assemble the chip in a GALS-friendly manner.
The desire was to keep each of the GALS modules tightly grouped to keep
the local clock tree as small as possible. One of the advantages of the inter-
module asynchronous interconnect is that it will function correctly regardless
D.S. Bormann / Electronic Notes in Theoretical Computer Science 146 (2006) 29–4038
of the absolute delay as long as the bundling constraint is met. Instead, the
tools optimized the asynchronous interface between modules and spread out
the module internals. The result was very fast handshake turnaround but
poor performance of the logic within the module. Eventually, by improving
the constraints a satisfactory result was achieved.
Additionally, the timing analysis tools were unable to analyze the timing
loops present in the AFSM logic. The port logic equations generated by the
3D tool contain a number of loops and the static timing analyzer refused to
analyze them. This meant that the timing had to be manually analyzed and
the only automated validation was through back-annotated simulations. We
believe that the standard tools should be able to comprehend the logic but
there was not enough time available to solve the problem during this eﬀort.
5 Conclusion
This paper presented a reconﬁgurable GALS test chip built on a 130nm process
technology. The ﬁrst goal of the design has already been completed: to gain
expertise fabricating GALS logic within a standard ASIC design ﬂow. While
encountering signiﬁcant challenges, the design was completed with limited
resources using a minimally modiﬁed commercial design ﬂow. With hindsight,
many of the problems could have been avoided. The fabricated test chips are
anxiously awaited and expected to be available in July 2005.
References
[1] David S. Bormann. Globally asynchronous locally synchronous design for low power radio
baseband. In Proc. WNCG Wireless Networking Symposium, Austin, TX, October 2004.
[2] David S. Bormann and Peter Y.K. Cheung. Asynchronous wrapper for heterogeneous systems.
In Proc. International Conf. Computer Design (ICCD), October 1997.
[3] Daniel M. Chapiro. Globally-Asynchronous Locally-Synchronous Systems. PhD thesis,
Stanford University, October 1984.
[4] Milosˇ Krstic´ and Eckhard Grass. GALSiﬁcation of IEEE 802.11a baseband processor. In Enrico
Macii, Odysseas G. Koufopavlou, and Vassilis Paliouras, editors, Power and Timing Modeling,
Optimization and Simulation (PATMOS), volume 3254 of Lecture Notes in Computer Science,
pages 258–267, September 2004.
[5] S. W. Moore, G. S. Taylor, P. A. Cunningham, R. D. Mullins, and P. Robinson. Self-calibrating
clocks for globally asynchronous locally synchronous systems. In Proc. International Conf.
Computer Design (ICCD), September 2000.
[6] Simon Moore, George Taylor, Robert Mullins, and Peter Robinson. Point to point GALS
interconnect. In Proc. International Symposium on Advanced Research in Asynchronous
Circuits and Systems, pages 69–75, April 2002.
[7] Jens Muttersbach. Globally-Asynchronous Locally-Synchronous Architectures for VLSI
Systems. PhD thesis, ETH, Zu¨rich, 2001.
D.S. Bormann / Electronic Notes in Theoretical Computer Science 146 (2006) 29–40 39
[8] Jens Muttersbach, Thomas Villiger, and Wolfgang Fichtner. Practical design of globally-
asynchronous locally-synchronous systems. In Proc. International Symposium on Advanced
Research in Asynchronous Circuits and Systems, pages 52–59, April 2000.
[9] Ad Peeters and Kees van Berkel. Single-rail handshake circuits. In Asynchronous Design
Methodologies, pages 53–62. IEEE Computer Society Press, May 1995.
[10] George Taylor, Simon Moore, Steev Wilcox, and Peter Robinson. An on-chip dynamically
recalibrated delay line for embedded self-timed systems. In Proc. International Symposium on
Advanced Research in Asynchronous Circuits and Systems, pages 45–51, April 2000.
[11] Kenneth Y. Yun and David L. Dill. Automatic synthesis of extended burst-mode circuits:
Part i (speciﬁcation and hazard-free implementation). IEEE Transactions on Computer-Aided
Design, 18(2):101–117, February 1999.
[12] Kenneth Y. Yun and David L. Dill. Automatic synthesis of extended burst-mode circuits:
Part ii (automatic synthesis). IEEE Transactions on Computer-Aided Design, 18(2):118–132,
February 1999.
[13] Kenneth Y. Yun and A. E. Dooply. Pausible clocking-based heterogeneous systems. IEEE
Transactions on VLSI Systems, 7(4):482–488, December 1999.
D.S. Bormann / Electronic Notes in Theoretical Computer Science 146 (2006) 29–4040
