In recent years several successful GALS realizations have been presented. The core of a GALS system is a locally synchronous island that is designed using industry standard synchronous design methodologies. In principle, any functional synchronous block can be encapsulated as a locally synchronous island to form a GALS module. There are, however, several important trade-offs and design decisions involved in doing so. Partitioning a design into several GALS compatible modules is still the most difficult task facing GALS system designers. The controlling state machine of a synchronous functional block may need to be enhanced significantly to accommodate varying latencies involved in data transfers between GALS modules. Such design challenges can not be easily generalized, and in this paper, are presented based on the experiences of designing a GALS system that implements a cryptographic algorithm. The example design uses the GALS methodology to improve resistance against cryptographic power attacks. The problem of side channel attacks against hardware implementations of cryptographic algorithms are briefly presented first, and the GALS architecture featuring several countermeasures against such attacks is introduced. The main part of the paper concentrates on the design decisions involved in the development of this architecture.
Introduction
Standard synchronous implementations of cryptographic algorithms are known to 'leak' additional information, like variable power consumption, while processing cryptographic data. Such leaking information is generally known as side channels. In recent years, several attacks have shown that it is possible to extract secret information protected by the cryptographic algorithm by processing this side channel information. Consequently, much effort has been invested in reducing various forms of side channel information. Cryptographic hardware designers have been especially interested in the mostly uniform power spectrum of asynchronous circuit implementations, as the power dissipation represents by far the most easily exploited side channel. Not surprisingly, a large number of contributions in the field of asynchronous design are associated with cryptographic hardware. The GALS methodology that targets to combine the advantages of asynchronous with the convenience of established synchronous design methodologies is no exception [12] . While there are different flavors of the GALS methodology, in this paper the term GALS will be used for the specific GALS methodology that is based on the work of Muttersbach [13] . In this methodology, a locally synchronous (LS) island that is designed by a standard digital design methodology is converted into a GALS module by adding a self-timed wrapper. This self-timed wrapper contains a local clock generator that can be paused by asynchronous port controllers to ensure safe data transfers between interconnected GALS modules. A simplified block diagram of a single GALS module with a single input and output port is shown in figure1.
In this paper, design challenges that are encountered when porting a standard synchronous design to GALS are presented on an example design that implements a cryptographic algorithm. To explain the design rationale, first a general overview of side-channel attacks against cryptographic hardware and possible countermeasures are given in section 2. The insight on how successful side channel attacks are performed has led to the development of a GALS based design. The GALS system implements the popular AES algorithm and is expected to provide increased resistance to such attacks. The architecture and the different countermeasures provided by this architecture are briefly explained in section 3. The process of converting a standard digital design, into several GALS-compatible locally synchronous islands involves many trade-offs. Based on the AES implementation, the partitioning, designing of suitable port controllers, and testability are discussed in section 4. The resulting chip combines the experiences and design methodologies obtained from earlier GALS implementations, but is a completely new design. Finally the conclusions are drawn in section 5. 
Differential Power Analysis (DPA) security
Cryptographic algorithms provide methods to convert plaintext information into ciphertext with the help of a cipherkey. In a good cryptographic algorithm, extracting the plaintext from the ciphertext without knowledge of the cipherkey is practically impossible. Thus, the security provided by a cryptographic algorithm is determined by its ability to keep the cipherkey secret.
Side channels are defined as sources of information other than the direct outputs of a system that implements a cryptographic algorithm. The power consumption, electromagnetic radiation, the time required to complete an operation, as well as the surface temperature of the system can all be considered side channels. Side channel analysis attacks on cryptographic systems, first demonstrated by Kocher [10] , try to extract parts of the cipherkey by observing these side channels. The vast majority of modern cryptographic hardware is manufactured using boolean logic gates designed with standard static CMOS logic. The power consumption of these gates heavily depend on their input switching activity. Simple Power Analysis (SCA) attacks use direct power measurements to determine parts of the cipher key. These SCA attacks are intuitive to understand and efficient countermeasures against SCA attacks can easily be devised.
In Differential Power Analysis (DPA) attacks [9] , statistical methods over a large number of measurements on the same device are used. This method is surprisingly effective, as even the slightest variance in power consumption can be extracted by sufficient number of measurements [17] . A specific operation of the cryptographic algorithm, where a portion of the secret key (subkey) is combined with data, is targeted by the DPA attack. For an m bit subkey there will be K 2 m subkey permutations. The bit length of the subkey is chosen so that the number of subkey permutations remains manageable. For each one of the K subkey permutations S different samples are processed using a simplified power model of the circuit, and a hypothetical power consumption matrix H 1¡
S is estimated. Then the actual power consumption of the device is measured while it encrypts the same S samples using the unknown secret key. The result is a vector P 1¡ ¢ ¡ S that holds the corresponding power consumption for all S inputs. The correct subkey is revealed by correlating the hypothetical power consumptions H 1¡ Developing countermeasures against DPA attacks has been an active research area ever since the discovery of the first attacks. The goal of DPA countermeasures is to increase the number of samples required to reveal the subkey to a level where it is not feasible to perform such DPA attacks. These countermeasures fall into several categories: ¤ Using alternative logic styles with data independent switching activity [12, 19, 18] :
If cryptographic hardware can be designed with logic gates that have a constant power consumption regardless of their input switching activity, the DPA 3 attacks would not be successful .
¤
Algorithmic methods that add random masks to the computation [1, 6, 15, 16, 2] :
In a cryptographic algorithm, at some point the key is combined with the data in some way. This leads to a power consumption that depends on the cipherkey and on the data processed by the cryptographic device. Algorithmic methods try to avoid this by combining the data with a random mask that changes after each operation.
Generating additional noise to make measurements more difficult [11] :
The attacker is interested in retrieving the variance of power consumption of only a small subset of the circuit over a number of samples. Any power consumption of operations performed in parallel, as long as they exhibit uncorrelated switching activity over the measurements, will be perceived as 'noise' for the DPA measurement and makes it more difficult to retrieve the secret key.
Confusing the measurements by adding dummy operations to the cryptographic process [4, 11, 3] :
The attacker needs to observe the power consumption of the same operation for a large number of samples. If the exact time when this particular operation is performed can be varied randomly, the attacker would be forced to collect more data.
The first two alternatives listed above require modifications to the design methodology, or the algorithm itself. The last two alternatives, on the other hand, are applicable to all implementations and should always be considered while designing custom cryptographic hardware.
DPA-Aware GALS Architecture
The GALS design methodology gives designers additional methods to implement DPA countermeasures. To demonstrate the efficiency of such GALS based DPA countermeasures, a crypto chip named Acacia was designed. Acacia implements the popular Advanced Encryption Standard (AES) cryptographic algorithm [14] . AES consists of a round function composed of four main operations:
¤
The AddRoundKey operation is a simple bit-wise XOR operation that adds a round key to the processed data. It is this AddRoundKey operation that is the main problem in terms of DPA attacks.
SubBytes is a non-linear 8-bit substitution operation that occupies significant area. Transformations that allow a smaller realization result in significant penalties in execution time. Typically the amount of parallel SubBytes operations performed within one clock cycle has a strong influence on the overall performance of the system, and is described as the datapath width for an AES architecture.
The ShiftRows operation is a fixed permutation that, unlike in software implementations, can be implemented without significant resources. Acacia has been designed with several layers of countermeasures against DPA attacks:
(i) All datapath elements in Acacia are designed to operate continuously. If the cryptographic schedule is unable to provide 'real' data during a clock cycle, the datapath elements are fed from pseudo-random number generators and perform operations with 'fake' data. Since exactly the same hardware is utilized, these 'real' and 'fake' operations are not distinguishable externally.
(ii) David contains two identical 8-bit SubBytes operators. Prior to executing one MixColumns operation, a total of four 8-bit SubBytes operation has to be processed. David can schedule these four independent operations randomly. In a given clock cycle, all, only one, or none of the two SubBytes datapath elements may process 'real' data while the remaining datapaths are fed with random data. The controller keeps track of the progress and executes the MixColumns operation after all four SubBytes operations have been performed. (iii) For each AES round, the result of four 32-bit MixColumns operations are required. In similar fashion, Goliath can schedule these operations in any order between the two David datapaths. Once all four operations are completed the next round operations are performed.
(iv) All three datapaths are implemented as GALS modules with their own local clock generators. A special block is used to interface with a synchronous external clock. The operation speed of the AES crypto core is totally independent from the external clock. In this arrangement the attacker can not reduce the clock rate to perform measurements at a rate that is more convenient for precise monitoring of the supply current. Furthermore the clock rates of the individual GALS modules are independent and not tied to each other.
(v) All GALS modules use a special local clock generator that can generate clock pulses with different lengths for each cycle. A pseudo-random number generator is used to control the period of each clock cycle.
The combination of all the listed countermeasures is expected to provide a serious challenge to attackers using DPA techniques. A simplified timing diagram of Acacia that demonstrates the above mentioned countermeasures can be seen in Figure  3 . In this diagram, the clock signal of each datapath, the operation that is being processed, and the task assigned to two sub-units is given. The 'fake' operations shown in gray, supply the regular datapath units with data from pseudo-random number generators. In the figure, Goliath can be observed to perform an AddRoundKey operation and then schedule three of the four MixColumns operations in random order between two David datapaths. The two datapaths, although processing the same operations, will exhibit different latencies due to both the different clock rates and the varying amount of clock cycles within the datapaths. In the figure, even though the fourth MixColumns operation is assigned after the second MixColumns operation it is processed faster. As mentioned earlier, for a successful DPA attack the power consumption of the same operation needs to be measured and compared over a large number of sam-6 ples. In a standard synchronous design, the individual clock cycles are used as clear references for the cryptographic flow. As the power consumption of Acacia does not contain these references, the attacker is forced to collect power measurements from a time window that contains multiple independent operations. A proper evaluation of the DPA countermeasures implemented in Acacia with respect to other alternatives will be the topic of a further study.
Challenges of Designing GALS Compatible Systems
A GALS design clearly separates communication from functionality. The communication between GALS modules is performed asynchronously using handshake protocols. The functionality, however, remains synchronous. Most of the published work on GALS has been concentrated mainly on the asynchronous part. Theoretically, all synchronous functional blocks can be converted into a GALS module, and therefore have not given much attention. However, depending on the exact GALS methodology employed, the locally synchronous islands must satisfy certain conditions.
In the GALS methodology developed by Muttersbach [13] , the LS island uses a Port Enable (Pen) signal to activate a port controller. Once the data transfer is complete, the port controller sets a Transfer Acknowledge (Ta) signal high. The LS island must be designed in a way to accommodate these signals for all data transfers to and from the island. The Pen signal drives an asynchronous finite state machine, and must be free of spurious transitions. The easiest way to ensure this is to have the Pen signal come directly out of a register. Similarly, the Ta signal is set by the asynchronous finite state machine and must be reliably processed by the LS island. In the most general case, the input and output timing of an LS island that will be used in a GALS module can be called tricky at best. To circumvent problems associated with timing, GALS-compatible LS islands are required to have registers at both data inputs and outputs.
GALS Partitioning
One of the leading open problems in the GALS design methodology is how to partition a given design efficiently. Two main approaches for this problem have been suggested. In methodologies where GALS is primarily seen as a method to realize very large systems on chip, the size of the LS island that is encapsulated by a GALS module is determined by the size of the circuit that can be efficiently designed using the existing synchronous design flow with a reasonable effort. A second approach partitions the design according to functionality. In terms of efficiency, the following two points must be kept in mind when determining a partitioning for a GALS design:
(i) Each GALS module brings some performance overhead, a fine-grained partitioning will have lower performance than a coarse-grained partitioning.
(ii) Basically, GALS modules run independently at their optimum speed until they
need to exchange data. During data transfer, the GALS module is synchronized to its communication partners. This process may involve slowing down one or both of the modules. Therefore, GALS modules that exchange data every clock cycle with each other are hardly efficient.
The first constraint favors a partitioning dictated by the circuit complexity, while the second constraint can more easily be met when using a functional partitioning. Overall, a functional partitioning seems to offer more advantages. The security concept developed for Acacia requires multiple LS islands that are clocked independently. The additional security offered by this approach is the increased effort required on part of the attacker to determine the state of the operation. This has two main consequences:
(i) There must be multiple LS islands (ii) The LS islands should not exchange data all too frequently, as during data transfers the local clocks of two modules are synchronized to each other. If two LS islands exchanged data at each cycle the two clocks would remain synchronous to each other, this would take away any advantage gained by using independent clocks away.
A fully parallel AES cipher can be realized using around 100,000 gates. This is not a large amount, even for a relatively mature technology like the 0 25 µm technology used in this project. After a careful analysis of the basic structure of the AES algorithm, its base operations are divided into two groups. Figure 4 shows the partitioning of the encryption part of the AES algorithm (shown on the left) into David and Goliath. This new arrangement is more suitable for a GALS implementation but results in a relatively fine-grained partitioning. The performance penalties incurred are a tradeoff for the increased DPA security that this architecture is able to offer.
A common misconception is to attribute the additional overhead of a given partitioning only to the self-timed wrapper. The LS island must be modified so that it can properly interface with the asynchronous controllers. It is extremely difficult to quantify the overhead involved in adapting a LS island so that it can be part of a GALS module. A synchronous version of the Acacia without any specific DPA countermeasures has been integrated for comparison purposes. Parameters for both circuits are presented in table 1. It must be noted that the huge difference in circuit sizes can largely be attributed to the additional DPA countermeasures, and different optimization constraints used during synthesis (area constraints were given a higher priority for the synchronous design). It can be seen from table 1 that the area overhead of the self-timed wrapper alone is relatively small (7% for David and 3.3% for Goliath), even for such a fine grained GALS system. The area of the local clock generator and the port controllers is given separately in table 1. The number for the port controller appears large in comparison to the clock generator, since it also contains the latches required for the data transfer between GALS modules. The important difference between the two implementation lies in the latency of the individual blocks. Since the handshaking protocol of GALS requires that data is stored both at the input and at the output in registers, when compared to a typical synchronous solution, the data transfer between GALS modules costs an additional clock cycle. Depending on the implementation, this disadvantage can be partly compensated if one of the GALS modules has a shorter critical path. Although the GALS solution requires ten instead of seven clock cycles for an encryption round (an increase of more than 40%), Acacia is able to complete the one round of encryption within 104% of the time required for the synchronous solution.
Additional Control Complexity
The state machine of a GALS compatible LS island must be able to control data transfers with different latencies between GALS modules. As an example, consider the task of the controller in Goliath that needs to schedule four MixColumns operations to two datapaths for one encryption round. Assume that the first operation MixColumns 1 was scheduled on the first datapath and the second operation MixColumns 2 was scheduled on the second datapath initially (which is not always the case, as either datapath may be unavailable for the first operation). Either of the datapaths may finish processing the assigned operation in a given cycle. Therefore the controller must be able to handle the following four situations:
(i) Both datapaths continue processing and are not available.
(ii) Both datapaths finish processing, and MixColumns 3 is scheduled on the first datapath and the last operation MixColumns 4 is scheduled on the second datapath.
9 (iv) MixColumns 2 is finished, and MixColumns 3 is scheduled on the second datapath.
In the last two cases, the remaining MixColumns 4 would need to be scheduled on the first available datapath. It is indeed possible that the datapath that was assigned MixColumns 3 will be available before (or at the same time with) the other one. Note that the situation described here can occur even if none of the previously described DPA countermeasures were implemented. The latency of the datapath, which is in a separate GALS module, relative to the controller may also change as a result of data transfers with a third GALS module, or as a result of the unbounded delay of the mutual exclusion elements used in the synchronization sub-system. Probably the highest penalty paid for a more complex state machine, apart from the cost of properly implementing it, is the increased effort that is required for the functional verification of such a state machine.
The Case with Port Controllers
In GALS design, the port controllers govern the communication between GALS modules. The specific implementation of the port controller can have significant ramifications on the design of the LS island as well. Muttersbach [13] has presented 10 a demand-type port controller. This controller, once activated by the Pen signal, immediately pauses the local clock generator and performs the handshake with its communication partner. Until the data transfer is complete, no clock edge is generated, and the LS island is effectively suspended. If properly implemented, a demand-type port controller would be able to work without the Ta signal. In this case the controller of the LS island can be significantly simplified, as pending data transfers need not be taken into account. In terms of DPA security, it is good practice to reveal as little as possible on the operation. Suspending the LS islands while they wait for new data to process may reduce the power consumption, but while doing so may also reveal the effective state of the operation to the attacker. Therefore, demand-type controllers are not used in Acacia. Both GALS modules are allowed to run normally, until both have signalled their readiness to transfer data. Within each GALS module, the Ta signal must be evaluated by the LS island to determine whether or not the data transfer has been completed.
The port controllers used by Muttersbach are designed to transfer data in subsequent clock cycles and are trigerred by a change in the Pen signal. This fast mode of operation is not possible if the Ta signal, corresponding to each data transfer request, needs to be evaluated. In Acacia, a new simplified port controller that is trigerred by a level sensitive Pen signal has been used instead. Once activated they wait until the communication partner signals its readiness. The clock is only paused momentarily during data transfer. The port controllers have been designed using a signal transition graph (STG) and have been converted into two-level logic equations using the Petrify tool [5] . The equations have then been mapped to the gate level netlists manually. This process requires some experience, but should not be a serious challenge for designers accustomed to working with highly complicated EDA tools.
One-Sided Port Controllers
In the most general case, GALS communication requires port controllers and pausable clocks on both the transmitting and the receiving GALS module to enable reliable communication. This method of communication places no additional timing constraints on the system itself. Consider the case where the clock period of one GALS module is significantly longer than that of the other. Depending on the difference between the clock periods, and the amount of additional interruptions that any of the GALS modules may recieve, it is possible to imagine scenarios where only the faster GALS module slows its own clock to exchange data without interrupting the local clock of the slower GALS module in any way. Such port controllers, where only one GALS module adjusts its clock period to synchronize to another clock domain are called one-sided port controllers. Standard synchronous blocks can be designed to interface to such one-sided ports reliably.
The communication between the synchronous interface and Goliath (seen in Figure 2) , is governed by a one-sided port controller, whose connections and STG are shown in more detail in Figure 5 . The interface uses only rising-edge trigerred flip-flops and initiates the data transfer with the Enable signal. A Muller-C element combines this Enable signal with the negated clock of the Interface, so that the port controller receives the Req signal only during the second half of the clock period where the value of Enable is stable. The Interface monitors the Done signal to conclude the data transfer. A second Muller-C element combines the Ack signal from the port controller and the clock to ensure that the Done signal only changes its value during the first half of the clock cycle, avoiding setup and hold violations at the Interface. Similar to the LS islands, the inputs and outputs of the block that communicates with a one-sided port controller should be registered to simplify the timing constraints.
Test Coverage
Testing of asynchronous circuits is known to be a difficult task. A distinct advantage of GALS based systems is that asynchronous circuits and thereby all associated problems are limited to the self-timed wrapper. Therefore, rather than developing test solutions that are universally applicable to all asynchronous circuits, dedicated solutions for a small set of circuits can be developed [8] . As an example, the stuckat fault dictionary of Acacia has more than 154,000 entries of which only 182 are from asynchronous port controllers.
In Acacia, all LS islands have their own separate scan-based test solution. The local clock generators include a special test mode where they multiplex a test clock to their outputs. In this test mode the scan chains of all LS islands can directly be accessed by external synchronous automated test equipment. In this configuration, a test coverage of more than 96% has been obtained for Acacia. All asynchronous port controllers, the local clock generators and all registers together with associated glue logic directly connected to the GALS interfaces can not be tested for stuck-at faults with this method. The remaining stuck-at faults are covered by applying
functional test patterns to encrypt random data values. The combined stuck-at test coverage obtained by both methods for Acacia is 99.88%.
Design Methodology
A hierarchical design flow was used in the design of Acacia. The LS islands were designed using an industry standard digital design flow. The self-timed wrapper that converts the LS island into a GALS module consists of basically the port controllers, the local clock generator and a few glue logic cells. Generating the netlist for the self-timed wrapper is a trivial task. Although in earlier designs automated scripts have been used [7] for this purpose, it was fairly easy to generate the netlists manually for Acacia since it only requires a total of five port controllers in three GALS modules. The Ta signal generated by the port controllers needs to be sampled by the LS island using its local clock. The port controllers have been designed in a way to enable the Ta signal after the local clock generator is released. As can be seen in the block diagram in figure5, the local clock signal after being released by the Mutual exclusion element travels through some internal logic in the local clock generator and propagates through the clock tree within the LS island. The Ta signal must be delayed to match the propagation delay of the local clock. As the exact amount of the clock tree insertion delay is known only during the back end design flow, the delay matching has to be performed during this stage as well.
The interconnection of all GALS modules on the top level is just a matter of providing the required interconnections. Unlike in a synchronous design, where the input and output timings of all involved modules need to be balanced, no additional steps needs to be performed at this stage.
Conclusions
A GALS based implementation of the AES algorithm has been implemented and was sent to fabrication. The layout of the design can be seen in figure6. The design is partitioned into three datapaths which are realized as separate GALS modules. Several key topics of migrating a standard synchronous design into a GALS system have been discussed in this article.
Probably the most important task facing a designer who wants to implement a GALS system is an efficient partitioning of the design into GALS modules. Although data transfer between GALS modules can be achieved reliably, when compared to a synchronous design, it involves a certain overhead. It is therefore advisable to place functional blocks that exchange data in subsequent blocks within the same GALS module. A functional partitioning scheme has better chances in achieving this objective.
The LS islands that compose the GALS module need to satisfy certain conditions to resolve timing constraints at the GALS module level. Using registered data inputs and outputs, and supporting a simple handshaking protocol are essential. Implementing these changes may result in additional latency in the system. Moreover, the state machines that control these LS islands must be capable of dealing with more cases which can significantly increase their complexity. In turn, this increase in complexity also increases the effort required for the functional verification of the system. The port controllers used in GALS should be adapted to the specific requirements of the design. As an example when data transfers in subsequent cycles are not required, simplified port controllers with reduced gate count can be used. Although not described explicitly in this paper, the authors can imagine that specialized port controllers that are suited for burst data transfers, or port controllers that implement a 'time-out' feature may need to be developed to satisfy future design requirements. It is common to imagine that both sides of a GALS data transfer channel are involved in synchronization. It was shown that, when the nominal period of one module is much shorter than the other one, the faster module can be controlled to adapt to the slower module without interrupting the slower module. Port controllers that support this one-sided operation can also be used to interface to modules with a synchronous clock.
Traditionally, design and test automation have been seen as the main problems facing a GALS implementation. Limiting the asynchronous circuits strictly to the self timed wrapper reduces hard to solve general problems, into easily solvable cases. For the presented example design, more than 99.88% scan-test coverage has been obtained using standard design tools. Similarly standard hierarchical design flows can easily be adapted to the GALS design. Problems that can (still) not be
