Abstract: Significant efforts have been put into optimization of GALS systems by considering locally synchronous modules as individual and independent islands of circuits. While existing approaches do improve some of the system characteristics, due to their limited scope, their achieved improvement is often limited too. This paper proposes an optimization approach in which a GALS system is optimized as a whole. The approach allows combinational or sequential sub-modules to move from one synchronous module to another while preserving the GALS system functionality. Experimental results show that if the movements are done properly, the proposed approach provides better results than the existing methods. The application of the proposed approach on benchmark circuits demonstrates 15.8% latency reduction with 3.3% area overhead in average.
Introduction
GALS design methodology has been proposed to overcome large systems' design limitations by combining the advantages of synchronous and asynchronous systems. Among the most attractive properties of GALS methodology are higher performance and no need for a sophisticated clock tree design which lowers the overall power consumption and eliminates the clock skew problems in large systems [1] . These advantages come with very little overhead. Significant efforts have been put into optimization of GALS systems by considering locally synchronous (LS) modules as individual and independent islands of circuits. Various system characteristics have been considered as the target of the local optimizations, such as power consumption [2] , latency and performance [3] . While these approaches do improve some of the system characteristics, the achieved improvement is often limited and marginal. This is because these approaches inherently look at the LS modules independently and over look the mutual effects of various modules on each other. This might suggest that the ability to perform optimization on the entire system may provide better results.
However, before such system optimization can be achieved, a few obstacles need to be overcome such as the special interface modules used for handling synchronous and asynchronous designs called wrappers [4, 5] . The wrappers surround each locally synchronous module and are responsible for all synchronous/asynchronous transactions. This is to the extent that the two realms of synchronous and asynchronous systems involve different methodologies, design tools, and skill sets.
To overcome these obstacles, it seems that the ability to move some parts of an LS module to adjacent modules should be considered in GALS system optimization in such a way that the overall functionality of the system is preserved. This paper explores the possibility of such a scheme in which a GALS system is optimized in its entity. The optimized system will have the same functionality as the original one but with better system characteristics.
The paper has been organized as follows: Section two contains a brief review of the prior art. Section three focuses on the optimization methodology. Experimental results are described in section four. Conclusions and future works can be found in the last section. through the transmission process [4] .
Various optimization approaches have been proposed for GALS systems that focus on one of its characteristics. For instance, a scheme to optimize individual LS modules by scaling their voltage levels to achieve an energy efficient processor is proposed in [2] . They introduce a multiple clock domains GALS processor. A dynamic voltage scaling scheme adjusts the voltage level of each domain independently. Possibility of scaling frequency, has also been discussed to optimize the system for its power consumption [2] .
Another example is an approach to improve the performance of a GALS system by taking advantage of the delay dependencies of its input vectors [3] . Allowing local clock frequency optimizations, this approach classifies the input data into several classes, each containing the input data that requires the same clock frequency for a given LS module. It has been shown that the selection of a suitable clock period for each input data class can maximize the performance.
On the other hand, retiming has been used extensively as a powerful optimization method to improve a synchronous circuit's timing characteristics [6] . In our previous work, we noted that the wrappers' communication protocol includes some timing gaps in which one LS module is waiting for a response from another LS module. We showed that the concept of retiming can be used to reposition some of the combinational logic to an LS module's boundaries, immediately before or after the wrappers. The propagation delay of the repositioned combinational logic will have to be precisely calculated to match the timing gaps of its adjacent wrapper as much as possible. Our work demonstrated that, when applicable, this approach reduces the critical path of an LS module, whereby increasing its operation frequency. While this would lead to a performance improvement, the achieved improvement is limited and restricted to that individual module [5] .
As can be seen, previous works have focused their attention on local optimizations and lack a global view into the system. The next sections propose such view.
The optimization concept
The above discussion suggests that considering neighboring LS modules can beneficial during the optimization process. This notion would also provide the additional advantage of being able to utilize the well-developed techniques for synchronous circuit optimization. However, there remains a major concern: A potential for an inadvertent change of the circuit functionality throughout this process. This can be addressed by following some strict rules governed by the well-known retiming concept. These rules guarantee that the overall circuit functionality remains intact [6] .
In the original retiming technique, the sequential elements of a circuit (namely, DFFs) are moved over its combinational elements (gates) following two basic rules of join and fork in order to optimize the circuit's specific characteristics (often its critical path) [6] . Borrowing from this concept, we propose a rule, called migration rule, to move the combinational and sequential circuit elements of one LS module over its wrappers and its adjacent module's wrappers to 'migrate' those elements into the neighboring modules. This rule will be discussed below in more detail.
Migration rule
To describe the migration rule we start with a scenario in which a combinational element or a block of purely combinational elements is going to be moved from one LS module to another. In this scenario, the migration rule stipulates that to move such element or block three steps are necessary:
1-Removing the wrappers corresponding to the moving block at source and destination LS modules.
2-Moving the block from the source module to the destination module.
3-Inserting appropriate wrappers at source and destination modules.
After migration, another round of retiming can be performed on the source and destination modules locally if necessary. Fig. 1 shows an example of this process.
Fig. 1. A simple example for migration of a combinational block. a) before migration b) after migration (global retiming) c) after a local retiming
Someone can concern critical path violation possibility through migration of node A to LS #2 which can be addressed by system frequency reduction to preserve the functionality. Frequency reduction problem can be solved using synchronous optimization methods such as the original retiming algorithm at the end of this step.
For migration of synchronous sub-modules, minor changes in destination LS module (s) is required to preserve the system functionality.
Let's assume that a DFF is going to be transferred from LS #1 to LS #2, in a situation that there is no other input to destination LS module and this input is the only connection between the two LS modules. In this situation, migration of the DFF means that LS #1 would start handshaking earlier and destination module would lag one clock cycle. If there are some other inputs from the source module to the destination module, they will not be affected by the migration process because the new wrappers inserted at stage 3 of the migration process adjust the timing of all of the inputs between the two modules. If destination module has some other inputs from other LS modules, all primary inputs of the destination module should be delayed for the same number of clock cycles (Fig. 2 a and b) .
It should be noted that the migration process remains valid even in the presence of feedback loops. This is because appropriate wrappers can be placed in the source and destinations modules to handle the timing of data transfer.
Based on the above discussion, a general migration rule can be proposed to transfer combinational and sequential elements or blocks in GALS systems as follows.
Migration Rule: A combinational or sequential element or block in an LS module of a GALS system can be transferred to a neighboring LS module (s) by placing it immediately after the destination module's input wrappers and modifying the wrappers in both LS modules to reflect the new timing. In some cases extra delay element (s) have to be added to the other inputs of the destination module. Shouled this be the case, the number of such delay element (s) is the maximum number of delay elements located at different paths from primary inputs to primary outputs of the migrating block. 
The Optimization Strategy
The most important advantage of the proposed optimization method is that it can remove the barriers between the locally synchronous modules to achieve the optimum results exploiting a global view into the entire GALS system. At the same time, by placing appropriate wrappers at the inputs and outputs of the LS modules, one can still benefit from the advantages of GALS system. Furthermore, since the process of transferring the blocks among LS modules takes place gradually, there is no need to "flatten" the hierarchy which would make the optimization process more difficult and the repositioning of the wrappers more complicated. Another positive feature of the proposed approach is that it can be used to reach optimal partitioning during the course of a GALS system design. In other words, a designer can incorporate the synchronous blocks into the system and leave the task of module definition to a tool that works using the proposed concept. Such a tool would partition the design to define the individual LS modules considering a particular system characteristic (e.g. performance, power, frequency, etc.) as an optimization target. This scheme is currently being pursued by the authors.
Experimental results
To evaluate the proposed method, 10 benchmark circuits from ISCAS'89 were chosen. Following the methodology described in [7] , the circuits were partitioned into three (for smaller circuits S298, S510, and S1238) and five (for the rest) roughly equal regions. D-type wrappers with output and input controllers were added to each partition to create the LS modules.
The LS modules were optimized to obtain the minimum latency using commercial synthesis tools. The first two columns of Table I show the resulting latency and area. Then, the original retiming algorithm was manu- Table I . Experimental results ally applied to each LS module individually to further reduce each module's latency as shown in the middle columns of Table I . Finally, the proposed approach was applied to the GALS system. The final latency and area of the circuits can be seen in the last columns of Table I along with the percentages of latency reduction and area overhead. This area overhead is caused by the DFF (s) and the wrapper (s) that may have been added to the system.
As can be seen in the table, the application of the proposed optimization method reduces system latency by 15.84% in average as compared with only 4.2% reduction reached by local retiming. The results show a 3.3% area overhead in average which appears to be a reasonable trade off for the considerable delay reduction obtained.
Conclusion and future works
In this paper a new GALS optimization approach with a global view was proposed. The possibility of concurrent manipulation of different types of delay elements and application of the idea on mixed synchronous and asynchronous circuits is being studied as future works.
