Design Challenges for a Differential-Power-Analysis Aware GALS-based AES Crypto ASIC  by Gürkaynak, Frank K. et al.
Design Challenges for a
Diﬀerential-Power-Analysis Aware GALS-based
AES Crypto ASIC
Frank K. Gürkaynak, Stephan Oetiker, Hubert Kaeslin
Norbert Felber, Wolfgang Fichtner
Integrated Systems Laboratory
Swiss Federal Institute of Technology
Zurich, Switzerland
Abstract
In recent years several successful GALS realizations have been presented. The core of a GALS
system is a locally synchronous island that is designed using industry standard synchronous design
methodologies. In principle, any functional synchronous block can be encapsulated as a locally
synchronous island to form a GALS module. There are, however, several important trade-oﬀs and
design decisions involved in doing so. Partitioning a design into several GALS compatible modules
is still the most diﬃcult task facing GALS system designers. The controlling state machine of
a synchronous functional block may need to be enhanced signiﬁcantly to accommodate varying
latencies involved in data transfers between GALS modules.
Such design challenges can not be easily generalized, and in this paper, are presented based on the
experiences of designing a GALS system that implements a cryptographic algorithm. The example
design uses the GALS methodology to improve resistance against cryptographic power attacks. The
problem of side channel attacks against hardware implementations of cryptographic algorithms are
brieﬂy presented ﬁrst, and the GALS architecture featuring several countermeasures against such
attacks is introduced. The main part of the paper concentrates on the design decisions involved in
the development of this architecture.
Keywords: AES, DPA countermeasures, GALS
1 Introduction
Standard synchronous implementations of cryptographic algorithms are known
to ’leak’ additional information, like variable power consumption, while pro-
cessing cryptographic data. Such leaking information is generally known as
Electronic Notes in Theoretical Computer Science 146 (2006) 133–149
1571-0661 © 2006 Elsevier B.V. 
www.elsevier.com/locate/entcs
doi:10.1016/j.entcs.2005.05.039
Open access under CC BY-NC-ND license.
Local Clock Generator
Locally
Synchronous
Island
Po
rt
Po
rt
GALS Module
Pen
Ta
Req
Ack
DataOUT
Ri
Ai
Lclk
Pen
TaReq
Ack
Ri
Ai
DataIN
Fig. 1. Block diagram of a Gals module
side channels. In recent years, several attacks have shown that it is possi-
ble to extract secret information protected by the cryptographic algorithm
by processing this side channel information. Consequently, much eﬀort has
been invested in reducing various forms of side channel information. Crypto-
graphic hardware designers have been especially interested in the mostly uni-
form power spectrum of asynchronous circuit implementations, as the power
dissipation represents by far the most easily exploited side channel. Not sur-
prisingly, a large number of contributions in the ﬁeld of asynchronous design
are associated with cryptographic hardware. The Gals methodology that
targets to combine the advantages of asynchronous with the convenience of
established synchronous design methodologies is no exception [12].
While there are diﬀerent ﬂavors of the Gals methodology, in this paper
the term Gals will be used for the speciﬁc Gals methodology that is based
on the work of Muttersbach [13]. In this methodology, a locally synchronous
(LS) island that is designed by a standard digital design methodology is con-
verted into a Gals module by adding a self-timed wrapper. This self-timed
wrapper contains a local clock generator that can be paused by asynchronous
port controllers to ensure safe data transfers between interconnected Gals
modules. A simpliﬁed block diagram of a single Gals module with a single
input and output port is shown in ﬁgure 1.
In this paper, design challenges that are encountered when porting a stan-
dard synchronous design to Gals are presented on an example design that
implements a cryptographic algorithm. To explain the design rationale, ﬁrst a
general overview of side-channel attacks against cryptographic hardware and
possible countermeasures are given in section 2. The insight on how success-
ful side channel attacks are performed has led to the development of a Gals
based design. The Gals system implements the popular Aes algorithm and
is expected to provide increased resistance to such attacks. The architecture
and the diﬀerent countermeasures provided by this architecture are brieﬂy ex-
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149134
plained in section 3. The process of converting a standard digital design, into
several Gals-compatible locally synchronous islands involves many trade-oﬀs.
Based on the Aes implementation, the partitioning, designing of suitable port
controllers, and testability are discussed in section 4. The resulting chip com-
bines the experiences and design methodologies obtained from earlier GALS
implementations, but is a completely new design. Finally the conclusions are
drawn in section 5.
2 Diﬀerential Power Analysis (DPA) security
Cryptographic algorithms provide methods to convert plaintext information
into ciphertext with the help of a cipherkey. In a good cryptographic algo-
rithm, extracting the plaintext from the ciphertext without knowledge of the
cipherkey is practically impossible. Thus, the security provided by a crypto-
graphic algorithm is determined by its ability to keep the cipherkey secret.
Side channels are deﬁned as sources of information other than the direct
outputs of a system that implements a cryptographic algorithm. The power
consumption, electromagnetic radiation, the time required to complete an op-
eration, as well as the surface temperature of the system can all be considered
side channels. Side channel analysis attacks on cryptographic systems, ﬁrst
demonstrated by Kocher [10], try to extract parts of the cipherkey by observing
these side channels. The vast majority of modern cryptographic hardware is
manufactured using boolean logic gates designed with standard static CMOS
logic. The power consumption of these gates heavily depend on their input
switching activity. Simple Power Analysis (Sca) attacks use direct power
measurements to determine parts of the cipher key. These Sca attacks are
intuitive to understand and eﬃcient countermeasures against Sca attacks can
easily be devised.
In Diﬀerential Power Analysis (Dpa) attacks [9], statistical methods over
a large number of measurements on the same device are used. This method is
surprisingly eﬀective, as even the slightest variance in power consumption can
be extracted by suﬃcient number of measurements [17]. A speciﬁc operation
of the cryptographic algorithm, where a portion of the secret key (subkey) is
combined with data, is targeted by the Dpa attack. For an m bit subkey there
will be K = 2m subkey permutations. The bit length of the subkey is chosen
so that the number of subkey permutations remains manageable. For each
one of the K subkey permutations S diﬀerent samples are processed using a
simpliﬁed power model of the circuit, and a hypothetical power consumption
matrix H1..K,1..S is estimated. Then the actual power consumption of the de-
vice is measured while it encrypts the same S samples using the unknown
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149 135
secret key. The result is a vector P1..S that holds the corresponding power
consumption for all S inputs. The correct subkey is revealed by correlating
the hypothetical power consumptions H1..K,1..S with the measured power con-
sumption P1..S. In a successful attack, the correct subkey hypothesis Hkc,1..S
will show a ’signiﬁcantly’ higher correlation to the measured power P1..S than
all other subkey hypotheses.
Developing countermeasures against Dpa attacks has been an active re-
search area ever since the discovery of the ﬁrst attacks. The goal of Dpa
countermeasures is to increase the number of samples required to reveal the
subkey to a level where it is not feasible to perform such Dpa attacks. These
countermeasures fall into several categories:
• Using alternative logic styles with data independent switching activity [12],
[19,18]: If cryptographic hardware can be designed with logic gates that have
a constant power consumption regardless of their input switching activity,
the Dpa attacks would not be successful .
• Algorithmic methods that add random masks to the computation [1,6,15,16],
[2]: In a cryptographic algorithm, at some point the key is combined with
the data in some way. This leads to a power consumption that depends on
the cipherkey and on the data processed by the cryptographic device. Al-
gorithmic methods try to avoid this by combining the data with a random
mask that changes after each operation.
• Generating additional noise to make measurements more diﬃcult [11]: The
attacker is interested in retrieving the variance of power consumption of
only a small subset of the circuit over a number of samples. Any power
consumption of operations performed in parallel, as long as they exhibit
uncorrelated switching activity over the measurements, will be perceived as
’noise’ for the Dpa measurement and makes it more diﬃcult to retrieve the
secret key.
• Confusing the measurements by adding dummy operations to the crypto-
graphic process [4,11,3]: The attacker needs to observe the power consump-
tion of the same operation for a large number of samples. If the exact time
when this particular operation is performed can be varied randomly, the
attacker would be forced to collect more data.
The ﬁrst two alternatives listed above require modiﬁcations to the design
methodology, or the algorithm itself. The last two alternatives, on the other
hand, are applicable to all implementations and should always be considered
while designing custom cryptographic hardware.
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149136
3 DPA-Aware GALS Architecture
The Gals design methodology gives designers additional methods to imple-
ment Dpa countermeasures. To demonstrate the eﬃciency of such Gals based
Dpa countermeasures, a crypto chip named Acacia was designed. Acacia im-
plements the popular Advanced Encryption Standard (Aes) cryptographic
algorithm [14]. Aes consists of a round function composed of four main op-
erations:
• The AddRoundKey operation is a simple bit-wise Xor operation that adds
a round key to the processed data. It is this AddRoundKey operation that
is the main problem in terms of Dpa attacks.
• SubBytes is a non-linear 8-bit substitution operation that occupies signiﬁ-
cant area. Transformations that allow a smaller realization result in signiﬁ-
cant penalties in execution time. Typically the amount of parallel SubBytes
operations performed within one clock cycle has a strong inﬂuence on the
overall performance of the system, and is described as the datapath width
for an Aes architecture.
• The ShiftRows operation is a ﬁxed permutation that, unlike in software
implementations, can be implemented without signiﬁcant resources.
• Finally MixColumns combines four bytes by a simple matrix multiplication.
The Aes cipher using 128 bit cipher key (Aes-128) is composed of 10 iden-
tical rounds with the exception of the last round that does not include the
MixColumns operation. For the Aes modes that use 192 bit (Aes-192) and
256 bit cipher keys (Aes-256) the same round operation is repeated 12 and
14 times respectively. Each round uses a separate round key that is derived
from the cipher key by a key schedule deﬁned in the Aes standard.
The Acacia architecture shown in ﬁgure 2 contains a large 128-bit dat-
apath, called Goliath, that houses the round key generator, the AddRound-
Key and the ShiftRows operations. Two smaller, identical 32-bit datapaths,
called David, contain the remaining Aes round operations SubBytes and Mix-
Columns.
Acacia has been designed with several layers of countermeasures against
Dpa attacks:
(i) All datapath elements in Acacia are designed to operate continuously.
If the cryptographic schedule is unable to provide ’real’ data during a
clock cycle, the datapath elements are fed from pseudo-random num-
ber generators and perform operations with ’fake’ data. Since exactly
the same hardware is utilized, these ’real’ and ’fake’ operations are not
distinguishable externally.
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149 137
Local Clock Generator
Key Generator
Round Key 
Memory
AddRoundKey
ShiftRows
MixColumns
SubBytes
32-bit Reg
Local Clock Generator
d2gg2d
g2s DAVID
DAVID_GALS
GOLIATH_GALS
GOLIATH
g2d d2g
MixColumns
SubBytes
32-bit Reg
Local Clock Generator
DAVID
DAVID_GALS
INTERFACE
128-bit Reg
ACACIA
Clock
Data
Handshake
16
32
32
128
Fig. 2. Simpliﬁed block diagram of Acacia.
(ii) David contains two identical 8-bit SubBytes operators. Prior to execut-
ing one MixColumns operation, a total of four 8-bit SubBytes operation
has to be processed. David can schedule these four independent opera-
tions randomly. In a given clock cycle, all, only one, or none of the two
SubBytes datapath elements may process ’real’ data while the remaining
datapaths are fed with random data. The controller keeps track of the
progress and executes the MixColumns operation after all four SubBytes
operations have been performed.
(iii) For each Aes round, the result of four 32-bit MixColumns operations
are required. In similar fashion, Goliath can schedule these operations
in any order between the two David datapaths. Once all four operations
are completed the next round operations are performed.
(iv) All three datapaths are implemented as Gals modules with their own
local clock generators. A special block is used to interface with a syn-
chronous external clock. The operation speed of the Aes crypto core
is totally independent from the external clock. In this arrangement the
attacker can not reduce the clock rate to perform measurements at a rate
that is more convenient for precise monitoring of the supply current. Fur-
thermore the clock rates of the individual Gals modules are independent
and not tied to each other.
(v) All Gals modules use a special local clock generator that can gener-
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149138
Data I/O SubBytes M.C. Data I/O
Data I/O SubBytes MixCols I/O
MixColumnsAddKey
MixColumns 4 I/O MixColumns 1Data I/O
MixColumns 2 Dummy OperationData I/O
Dummy Dummy Sub1 Sub3 Dummy Dummy Sub4 Dummy Dummy
Dummy Dummy Sub4 Sub2 Dummy Dummy Dummy Dummy Sub2
Dummy Dummy Dummy Dummy Sub1 Sub2 Dummy Dummy Dummy Sub1
Dummy Dummy Sub1 Sub3 Sub2 Dummy Sub4 Dummy Dummy Dummy
SubBytes
Dummy Operation
M.C.
Sub3 Sub1 D.
Dummy Sub4
Sub3 Dummy
Dummy Sub2
I/O
GOLIATH
DAVID
DAVID
D.
Clk
Operation
David1
David2
SubBytes1
SubBytes2
Clk
Operation
SubBytes1
SubBytes2
Clk
Operation
Fig. 3. Simpliﬁed timing diagram of the Gals modules of Acacia. Operations where the crypto-
graphic data is not processed are shown in gray.
ate clock pulses with diﬀerent lengths for each cycle. A pseudo-random
number generator is used to control the period of each clock cycle.
The combination of all the listed countermeasures is expected to provide a
serious challenge to attackers using Dpa techniques. A simpliﬁed timing dia-
gram of Acacia that demonstrates the above mentioned countermeasures can
be seen in Figure 3. In this diagram, the clock signal of each datapath, the
operation that is being processed, and the task assigned to two sub-units is
given. The ’fake’ operations shown in gray, supply the regular datapath units
with data from pseudo-random number generators. In the ﬁgure, Goliath can
be observed to perform an AddRoundKey operation and then schedule three
of the four MixColumns operations in random order between two David dat-
apaths. The two datapaths, although processing the same operations, will
exhibit diﬀerent latencies due to both the diﬀerent clock rates and the vary-
ing amount of clock cycles within the datapaths. In the ﬁgure, even though
the fourth MixColumns operation is assigned after the second MixColumns
operation it is processed faster.
As mentioned earlier, for a successful Dpa attack the power consumption
of the same operation needs to be measured and compared over a large number
of samples. In a standard synchronous design, the individual clock cycles are
used as clear references for the cryptographic ﬂow. As the power consumption
of Acacia does not contain these references, the attacker is forced to collect
power measurements from a time window that contains multiple independent
operations. A proper evaluation of the Dpa countermeasures implemented in
Acacia with respect to other alternatives will be the topic of a further study.
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149 139
4 Challenges of Designing GALS Compatible Systems
A Gals design clearly separates communication from functionality. The com-
munication between Gals modules is performed asynchronously using hand-
shake protocols. The functionality, however, remains synchronous. Most of the
published work on Gals has been concentrated mainly on the asynchronous
part. Theoretically, all synchronous functional blocks can be converted into
a Gals module, and therefore have not given much attention. However, de-
pending on the exact Gals methodology employed, the locally synchronous
islands must satisfy certain conditions.
In the Gals methodology developed by Muttersbach [13], the LS island
uses a Port Enable (Pen) signal to activate a port controller. Once the data
transfer is complete, the port controller sets a Transfer Acknowledge (Ta)
signal high. The LS island must be designed in a way to accommodate these
signals for all data transfers to and from the island. The Pen signal drives an
asynchronous ﬁnite state machine, and must be free of spurious transitions.
The easiest way to ensure this is to have the Pen signal come directly out
of a register. Similarly, the Ta signal is set by the asynchronous ﬁnite state
machine and must be reliably processed by the LS island. In the most general
case, the input and output timing of an LS island that will be used in a Gals
module can be called tricky at best. To circumvent problems associated with
timing, Gals-compatible LS islands are required to have registers at both
data inputs and outputs.
4.1 GALS Partitioning
One of the leading open problems in the Gals design methodology is how
to partition a given design eﬃciently. Two main approaches for this problem
have been suggested. In methodologies where Gals is primarily seen as a
method to realize very large systems on chip, the size of the LS island that
is encapsulated by a Gals module is determined by the size of the circuit
that can be eﬃciently designed using the existing synchronous design ﬂow
with a reasonable eﬀort. A second approach partitions the design according
to functionality. In terms of eﬃciency, the following two points must be kept
in mind when determining a partitioning for a Gals design:
(i) Each Gals module brings some performance overhead, a ﬁne-grained
partitioning will have lower performance than a coarse-grained partition-
ing.
(ii) Basically, Gals modules run independently at their optimum speed until
they need to exchange data. During data transfer, the Gals module is
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149140
synchronized to its communication partners. This process may involve
slowing down one or both of the modules. Therefore, Gals modules that
exchange data every clock cycle with each other are hardly eﬃcient.
The ﬁrst constraint favors a partitioning dictated by the circuit complexity,
while the second constraint can more easily be met when using a functional
partitioning. Overall, a functional partitioning seems to oﬀer more advantages.
The security concept developed for Acacia requires multiple LS islands that
are clocked independently. The additional security oﬀered by this approach is
the increased eﬀort required on part of the attacker to determine the state of
the operation. This has two main consequences:
(i) There must be multiple LS islands
(ii) The LS islands should not exchange data all too frequently, as during
data transfers the local clocks of two modules are synchronized to each
other. If two LS islands exchanged data at each cycle the two clocks would
remain synchronous to each other, this would take away any advantage
gained by using independent clocks away.
A fully parallel AES cipher can be realized using around 100,000 gates. This
is not a large amount, even for a relatively mature technology like the 0.25µm
technology used in this project. After a careful analysis of the basic structure
of the Aes algorithm, its base operations are divided into two groups. The ﬁrst
group consists of the operations ShiftRows and AddRoundKeys and is called
Goliath. The second group called David, consists of the remaining operations
MixColumns and SubBytes.
One round of AES transformations is equivalent to running the operations
in Goliath once, and the operations in David four times. To complete the
operations within David, one 32-bit MixColumns operation and four 8-bit
SubBytes operations are required. David has been designed to include two
parallel SubBytes units and similarly Goliath is designed to interface to two
identical David units. Figure 4 shows the partitioning of the encryption part
of the Aes algorithm (shown on the left) into David and Goliath. This new
arrangement is more suitable for a Gals implementation but results in a
relatively ﬁne-grained partitioning. The performance penalties incurred are a
tradeoﬀ for the increased Dpa security that this architecture is able to oﬀer.
A common misconception is to attribute the additional overhead of a given
partitioning only to the self-timed wrapper. The LS island must be modiﬁed so
that it can properly interface with the asynchronous controllers. It is extremely
diﬃcult to quantify the overhead involved in adapting a LS island so that it can
be part of a Gals module. A synchronous version of the Acacia without any
speciﬁc Dpa countermeasures has been integrated for comparison purposes.
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149 141
David
Mul. Inverse
Selection Block
AddRoundKey
Affine Transform
Reg-128
Reg-128
AddRoundKey
ShiftRows
Selection Block Selection Block
Reg-32
Selection Block
ShiftRows
MixColumns
Mul. Inverse Affine Transform
MixColumns Mul. InverseS
ub
By
te
s
David
Goliath David
Fig. 4. Decomposition of the AES encryption datapath (left) into a 128-bit Goliath (middle) and
two 32-bit David datapath units (right).
Parameters for both circuits are presented in table 1. It must be noted that the
huge diﬀerence in circuit sizes can largely be attributed to the additional Dpa
countermeasures, and diﬀerent optimization constraints used during synthesis
(area constraints were given a higher priority for the synchronous design). It
can be seen from table 1 that the area overhead of the self-timed wrapper
alone is relatively small (7% for David and 3.3% for Goliath), even for such a
ﬁne grained Gals system. The area of the local clock generator and the port
controllers is given separately in table 1. The number for the port controller
appears large in comparison to the clock generator, since it also contains the
latches required for the data transfer between Gals modules.
The important diﬀerence between the two implementation lies in the la-
tency of the individual blocks. Since the handshaking protocol of Gals re-
quires that data is stored both at the input and at the output in registers,
when compared to a typical synchronous solution, the data transfer between
Gals modules costs an additional clock cycle. Depending on the implementa-
tion, this disadvantage can be partly compensated if one of the Gals modules
has a shorter critical path. Although the Gals solution requires ten instead
of seven clock cycles for an encryption round (an increase of more than 40%),
Acacia is able to complete the one round of encryption within 104% of the
time required for the synchronous solution.
4.2 Additional Control Complexity
The state machine of a Gals compatible LS island must be able to control
data transfers with diﬀerent latencies between Gals modules. As an example,
consider the task of the controller in Goliath that needs to schedule four
MixColumns operations to two datapaths for one encryption round. Assume
that the ﬁrst operation MixColumns1 was scheduled on the ﬁrst datapath and
the second operation MixColumns2 was scheduled on the second datapath
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149142
Table 1
Comparison of the Gals and synchronous implementations of the Aes partitioning used in
Acacia. All numbers given are synthesis results for a 0.25 µm CMOS technology. The GALS
version was conﬁgured for high throughput.
Synchronous Gals + Dpa
David Goliath David Goliath
Area (µm2)− LS 93,123 207,031 183,007 551,194
Area (µm2)− LFSRs N.A. N.A. 26,928 73,512
Area (µm2)− ClockGen N.A. N.A. 7,579 7,626
Area (µm2)− Ports N.A. N.A. 6,225 11,412
Area (µm2) - TOTAL 393,277 963,855
Critical path (ns) 5.43 5.84 3.98 5.27
Latency (clock cycles) 3 1 4 2
Clock frequency (MHz) 170.96 250.8 189.6
Encryption round (clock cycles) 7 8 2
Encryption round time (ns) 40.88 42.38
initially (which is not always the case, as either datapath may be unavailable
for the ﬁrst operation). Either of the datapaths may ﬁnish processing the
assigned operation in a given cycle. Therefore the controller must be able to
handle the following four situations:
(i) Both datapaths continue processing and are not available.
(ii) Both datapaths ﬁnish processing, and MixColumns3 is scheduled on the
ﬁrst datapath and the last operation MixColumns4 is scheduled on the
second datapath.
(iii) MixColumns1 is ﬁnished, and MixColumns3 is scheduled on the ﬁrst dat-
apath.
(iv) MixColumns2 is ﬁnished, and MixColumns3 is scheduled on the second
datapath.
In the last two cases, the remaining MixColumns4 would need to be scheduled
on the ﬁrst available datapath. It is indeed possible that the datapath that was
assigned MixColumns3 will be available before (or at the same time with) the
other one. Note that the situation described here can occur even if none of the
previously described Dpa countermeasures were implemented. The latency of
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149 143
the datapath, which is in a separate Gals module, relative to the controller
may also change as a result of data transfers with a third Gals module, or
as a result of the unbounded delay of the mutual exclusion elements used in
the synchronization sub-system. Probably the highest penalty paid for a more
complex state machine, apart from the cost of properly implementing it, is the
increased eﬀort that is required for the functional veriﬁcation of such a state
machine.
4.3 The Case with Port Controllers
In GALS design, the port controllers govern the communication between
GALS modules. The speciﬁc implementation of the port controller can have
signiﬁcant ramiﬁcations on the design of the LS island as well. Muttersbach
[13] has presented a demand-type port controller. This controller, once ac-
tivated by the Pen signal, immediately pauses the local clock generator and
performs the handshake with its communication partner. Until the data trans-
fer is complete, no clock edge is generated, and the LS island is eﬀectively
suspended. If properly implemented, a demand-type port controller would be
able to work without the Ta signal. In this case the controller of the LS island
can be signiﬁcantly simpliﬁed, as pending data transfers need not be taken
into account.
In terms of Dpa security, it is good practice to reveal as little as possible
on the operation. Suspending the LS islands while they wait for new data to
process may reduce the power consumption, but while doing so may also reveal
the eﬀective state of the operation to the attacker. Therefore, demand-type
controllers are not used in Acacia. Both GALS modules are allowed to run
normally, until both have signalled their readiness to transfer data. Within
each GALS module, the Ta signal must be evaluated by the LS island to
determine whether or not the data transfer has been completed.
The port controllers used by Muttersbach are designed to transfer data
in subsequent clock cycles and are trigerred by a change in the Pen signal.
This fast mode of operation is not possible if the Ta signal, corresponding to
each data transfer request, needs to be evaluated. In Acacia, a new simpliﬁed
port controller that is trigerred by a level sensitive Pen signal has been used
instead. Once activated they wait until the communication partner signals its
readiness. The clock is only paused momentarily during data transfer. The
port controllers have been designed using a signal transition graph (STG) and
have been converted into two-level logic equations using the Petrify tool [5].
The equations have then been mapped to the gate level netlists manually.
This process requires some experience, but should not be a serious challenge
for designers accustomed to working with highly complicated EDA tools.
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149144
Pen-
Ai-
Pen+
Ai+
Req-
Ack+
Ack- Ri-
Ri+
Ta-
Ta+
Req+
M
UT
EX
LO
G
IC
D
EL
AY
C
C
Clk
PenReq
Ta
Ri Ai
Ack
Enable
Done
Lclk
Local Clock Generator
Synchronous Island
One Sided 
Port
Controller
Clock Tree
Fig. 5. The one-sided port controller used in Acacia.
4.4 One-Sided Port Controllers
In the most general case, Gals communication requires port controllers and
pausable clocks on both the transmitting and the receiving Gals module
to enable reliable communication. This method of communication places no
additional timing constraints on the system itself. Consider the case where
the clock period of one GALS module is signiﬁcantly longer than that of the
other. Depending on the diﬀerence between the clock periods, and the amount
of additional interruptions that any of the GALS modules may recieve, it is
possible to imagine scenarios where only the faster GALS module slows its
own clock to exchange data without interrupting the local clock of the slower
GALS module in any way. Such port controllers, where only one Gals module
adjusts its clock period to synchronize to another clock domain are called
one-sided port controllers. Standard synchronous blocks can be designed to
interface to such one-sided ports reliably.
The communication between the synchronous interface and Goliath (seen
in Figure 2), is governed by a one-sided port controller, whose connections
and STG are shown in more detail in Figure 5. The interface uses only rising-
edge trigerred ﬂip-ﬂops and initiates the data transfer with the Enable signal.
A Muller-C element combines this Enable signal with the negated clock of
the Interface, so that the port controller receives the Req signal only during
the second half of the clock period where the value of Enable is stable. The
Interface monitors the Done signal to conclude the data transfer. A second
Muller-C element combines the Ack signal from the port controller and the
clock to ensure that the Done signal only changes its value during the ﬁrst half
of the clock cycle, avoiding setup and hold violations at the Interface. Similar
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149 145
to the LS islands, the inputs and outputs of the block that communicates
with a one-sided port controller should be registered to simplify the timing
constraints.
4.5 Test Coverage
Testing of asynchronous circuits is known to be a diﬃcult task. A distinct ad-
vantage of Gals based systems is that asynchronous circuits and thereby
all associated problems are limited to the self-timed wrapper. Therefore,
rather than developing test solutions that are universally applicable to all
asynchronous circuits, dedicated solutions for a small set of circuits can be de-
veloped [8]. As an example, the stuck-at fault dictionary of Acacia has more
than 154,000 entries of which only 182 are from asynchronous port controllers.
In Acacia, all LS islands have their own separate scan-based test solution.
The local clock generators include a special test mode where they multiplex a
test clock to their outputs. In this test mode the scan chains of all LS islands
can directly be accessed by external synchronous automated test equipment.
In this conﬁguration, a test coverage of more than 96% has been obtained for
Acacia. All asynchronous port controllers, the local clock generators and all
registers together with associated glue logic directly connected to the Gals
interfaces can not be tested for stuck-at faults with this method. The remain-
ing stuck-at faults are covered by applying functional test patterns to encrypt
random data values. The combined stuck-at test coverage obtained by both
methods for Acacia is 99.88%.
4.6 Design Methodology
A hierarchical design ﬂow was used in the design of Acacia. The LS islands
were designed using an industry standard digital design ﬂow. The self-timed
wrapper that converts the LS island into a Gals module consists of basically
the port controllers, the local clock generator and a few glue logic cells. Gen-
erating the netlist for the self-timed wrapper is a trivial task. Although in
earlier designs automated scripts have been used [7] for this purpose, it was
fairly easy to generate the netlists manually for Acacia since it only requires
a total of ﬁve port controllers in three Gals modules.
The Ta signal generated by the port controllers needs to be sampled by
the LS island using its local clock. The port controllers have been designed in
a way to enable the Ta signal after the local clock generator is released. As
can be seen in the block diagram in ﬁgure 5, the local clock signal after being
released by the Mutual exclusion element travels through some internal logic
in the local clock generator and propagates through the clock tree within the
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149146
Synchronous
Version
David David
Goliath
Interface
Fig. 6. Layout of the Acacia design. The left side of the die is occupied by another design.
LS island. The Ta signal must be delayed to match the propagation delay of
the local clock. As the exact amount of the clock tree insertion delay is known
only during the back end design ﬂow, the delay matching has to be performed
during this stage as well.
The interconnection of all Gals modules on the top level is just a matter of
providing the required interconnections. Unlike in a synchronous design, where
the input and output timings of all involved modules need to be balanced, no
additional steps needs to be performed at this stage.
5 Conclusions
A Gals based implementation of the Aes algorithm has been implemented
and was sent to fabrication. The layout of the design can be seen in ﬁgure 6.
The design is partitioned into three datapaths which are realized as separate
Gals modules. Several key topics of migrating a standard synchronous design
into a Gals system have been discussed in this article.
Probably the most important task facing a designer who wants to im-
plement a Gals system is an eﬃcient partitioning of the design into Gals
modules. Although data transfer between Gals modules can be achieved
reliably, when compared to a synchronous design, it involves a certain over-
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149 147
head. It is therefore advisable to place functional blocks that exchange data
in subsequent blocks within the same Gals module. A functional partitioning
scheme has better chances in achieving this objective.
The LS islands that compose the Gals module need to satisfy certain
conditions to resolve timing constraints at the Gals module level. Using
registered data inputs and outputs, and supporting a simple handshaking
protocol are essential. Implementing these changes may result in additional
latency in the system. Moreover, the state machines that control these LS
islands must be capable of dealing with more cases which can signiﬁcantly
increase their complexity. In turn, this increase in complexity also increases
the eﬀort required for the functional veriﬁcation of the system.
The port controllers used in Gals should be adapted to the speciﬁc re-
quirements of the design. As an example when data transfers in subsequent
cycles are not required, simpliﬁed port controllers with reduced gate count
can be used. Although not described explicitly in this paper, the authors can
imagine that specialized port controllers that are suited for burst data trans-
fers, or port controllers that implement a ’time-out’ feature may need to be
developed to satisfy future design requirements. It is common to imagine that
both sides of a Gals data transfer channel are involved in synchronization.
It was shown that, when the nominal period of one module is much shorter
than the other one, the faster module can be controlled to adapt to the slower
module without interrupting the slower module. Port controllers that sup-
port this one-sided operation can also be used to interface to modules with a
synchronous clock.
Traditionally, design and test automation have been seen as the main
problems facing a Gals implementation. Limiting the asynchronous circuits
strictly to the self timed wrapper reduces hard to solve general problems, into
easily solvable cases. For the presented example design, more than 99.88%
scan-test coverage has been obtained using standard design tools. Similarly
standard hierarchical design ﬂows can easily be adapted to the Gals design.
Problems that can (still) not be addressed by present design tools are simple
tasks that can be performed manually for small designs or easily automated
for larger designs.
References
[1] Akkar, M.-L. and C. Giraud, An Implementation of DES and AES, Secure against Some
Attacks, in: CHES ’01: Revised Papers from the 3th International Workshop on Cryptographic
Hardware and Embedded Systems, 2002, pp. 309–318.
[2] Blömer, J., J. Guajardo and V. Krummel, Provably Secure Masking of AES, in: Selected Areas
in Cryptography: 11th International Workshop, SAC 2004, 2004, pp. 69–83.
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149148
[3] Chari, S., C. S. Jutla, J. R. Rao and P. Rohatgi, Towards sound approaches to counteract power-
analysis attacks, in: CRYPTO ’99: Proceedings of the 19th Annual International Cryptology
Conference on Advances in Cryptology (1999), pp. 398–412.
[4] Clavier, C., J.-S. Coron and N. Dabbous, Diﬀerential power analysis in the presence of
hardware countermeasures, in: CHES ’00: Proceedings of the Second International Workshop
on Cryptographic Hardware and Embedded Systems (2000), pp. 252–263.
[5] Cortadella, J., M. Kishinevsky, A. Kondratyev, L. Lavagno and A. Yakovlev, Petrify: a tool
for manipulating concurrent speciﬁcations and synthesis of asynchronous controllers, IEICE
Transactions on Information and Systems E80-D (1997), pp. 315–325.
[6] Golic, J. D. and C. Tymen, Multiplicative Masking and Power Analysis of AES, in: CHES ’02:
Revised Papers from the 4th International Workshop on Cryptographic Hardware and Embedded
Systems (2003), pp. 198–212.
[7] Gürkaynak, F. K., S. Oetiker, T. Villiger, N. Felber, H. Kaeslin and W. Fichtner, On the
GALS Design Methodology of ETH Zurich, in: Proceedings of the Formal Methods For Globally
Asynchronous Locally Synchronous (GALS)Architecture FMGALS2003, 2003, pp. 181–189.
[8] Gürkaynak, F. K., T. Villiger, S. Oetiker, N. Felber, H. Kaeslin and W. Fichtner, A functional
test methodology for globally-asynchronous locally-synchronous systems, in: Proc. International
Symposium on Advanced Research in Asynchronous Circuits and Systems, 2002, pp. 181–189.
[9] Kocher, P., J. Jaﬀe and B. Jun, Diﬀerential power analysis, Lecture Notes in Computer Science
1666 (1999), pp. 388–397.
URL citeseer.ist.psu.edu/kocher99differential.html
[10] Kocher, P. C., Timing attacks on implementations of Diﬃe-Hellman, RSA, DSS, and other
systems, Lecture Notes in Computer Science 1109 (1996), pp. 104–113.
URL citeseer.ist.psu.edu/kocher96timing.html
[11] Mangard, S., “Securing Implementations of Block Ciphers against Side-Channel Attacks,”
Ph.D. thesis, Graz University of Technology (2004).
[12] Moore, S., R. Anderson, P. Cunningham, R. Mullins and G. Taylor, Improving smart card
security using self-timed circuits, in: Proc. International Symposium on Advanced Research in
Asynchronous Circuits and Systems, 2002, pp. 211–218.
[13] Muttersbach, J., “Globally-Asynchronous Locally-Synchronous Architectures for VLSI
Systems,” Ph.D. thesis, ETH, Zurich (2001).
[14] National Institute of Standards and Technology (NIST), Advanced Encryption Standard (AES),
FIPS Publication 197 (2001).
[15] Oswald, E., “On Side-Channel Attacks and the Application of Algorithmic Countermeasures,”
Ph.D. thesis, Graz University of Technology (2003).
[16] Pramstaller, N., F. K. Gürkaynak, S. Haene, H. Kaeslin, N. Felber and W. Fichtner,
DPA Resistant AES Crypto-Chip Design, in: Proc. European Solid-State Circuits Conference
(ESSCIRC) (2004), pp. 307–310.
[17] Örs, S. B., F. K. Gürkaynak, E. Oswald and B. Preneel, Power-Analysis Attacks on an
ASIC AES Implementation, in: Proc. of International Conference on Information Technology
(ITCC): Special Track on Embedded Cryptographic Hardware, 2004, pp. 546–552.
[18] Sokolov, D., J. Murphy, A. Bystrov and A. Yakovlev, Design and analysis of dual-rail circuits
for security applications, IEEE Transactions on Computers 54 (2005), pp. 449–460.
[19] Tiri, K. and I. Verbauwhede, Securing Encryption Algorithms against DPA at the Logic
Level: Next Generation Smart Card Technology, in: CHES ’02: Revised Papers from the 4th
International Workshop on Cryptographic Hardware and Embedded Systems, 2003, pp. 125–136.
F.K. Gürkaynak et al. / Electronic Notes in Theoretical Computer Science 146 (2006) 133–149 149
