Abstract -Pausible clocking based globally-asynchronous locally-synchronous (GALS) system design has been proven a promising approach to SoCs and NoCs. In this paper, we analyze the throughput reduction and synchronization failures introduced by the widely used pausible clocking scheme, and propose an optimized scheme for higher throughput and more reliable GALS design. The local clock generator is improved to minimize the acknowledge latency, and a novel input port is applied to maximize the safe timing region for the clock tree insertion. Simulation results using the IHP 0.13-µm standard CMOS process demonstrate that up to one-third increase in data throughput and an almost doubled safe timing region for clock tree distribution can be achieved in comparison to the traditional pausible clocking scheme.
I. INTRODUCTION
With the growing complexity of systems-on-chips (SoCs), traditional synchronous digital circuits become increasingly difficult to implement. A major challenge is the distribution of a low skew global clock. The large number of required buffer cells can lead to 40% of the total power dissipation and occupy significant silicon area.
By eliminating the global clock, globally-asynchronous locally-synchronous (GALS) design provides a promising solution to SoCs. The most straightforward way to GALS systems is to insert synchronizer circuits between different clock domains [1] . Normally a synchronizer consists of two or more cascaded flops, and it introduces latency in data transfer. Another approach to GALS systems is the use of asynchronous FIFOs [2, 3] , and this results in overheads in both area and power. In recent years, an alternative method to GALS design, which is mainly based on pausible local clocks, has been developed [4, 5, 6, 7, 8, 9, 10, 11, 12] . Communication between asynchronous modules is achieved using a pair of request-acknowledge handshaking signals, and the local clocks are paused and stretched, if necessary, to avoid metastability in data transfer.
Until now, most of the silicon validated pausible clocking systems [13, 14, 15, 16] are designed based on the scheme proposed and improved in [5, 6, 8] . As a latest example, this scheme is applied in [17] to implement a dynamic voltage frequency scaling (DVFS) NoC. Fig. 1 depicts a point to point GALS system based on this well-known scheme and its waveforms of the handshake signals Each synchronous module is surrounded by an asynchronous wrapper, which mainly consists of a local clock generator and several asynchronous I/O ports. A four-phase bundled-data protocol is used and a single latch is deployed in the input port to load the data from the output port. In [13, 14] the design of asynchronous wrappers was discussed in detail.
Fig. 1 A GALS system (a) and its handshake signals (b)
In this paper, we focus on the analysis and optimization of this widely used pausible clocking scheme. Section 2 studies the acknowledge latency of local clock generators, and demonstrates its impacts on the system throughput and data synchronization. Section 3 shows an improved scheme with minimum latency from the clock generator and maximum safe timing region for the clock tree insertion. The proposed scheme is implemented and evaluated using the IHP 0.13-µm CMOS process in Section 4. Finally, a brief conclusion is given in Section 5.
II. ANALYSIS OF PAUSIBLE CLOCKING SCHEME
A typical local clock generator used in pausible clocking schemes is depicted in Fig. 2 [8, 9, 17, 18] . A programmable delay line is employed to generate the clock signal LClk. An array of MUTEX elements is used to arbitrate between port requests Reqx and the request clock signal RClk. If any Reqx gets acknowledged, LClk will be paused. 
A. Acknowledge Latency of Clock Generator
For a MUTEX element, at any time only one of two incoming events Reqx+ and RClk+ is allowed to pass on a first come first serve basis. A Reqx+ arriving before RClk+ will be acknowledged immediately by the MUTEX, but a Reqx+ arriving after RClk+ will not be acknowledged until RClk-happens. If Reqx+ and RClk+ arrive simultaneously, the MUTEX element will decide randomly which signal should be acknowledged.
For clarity in the following discussion, we define the request acknowledge window (RAW) in a local clock generator as the duration in each cycle of LCLK when the port requests can be acknowledged. For the clock generator shown in Fig. 2 , its RAW is the inactive phase of RClk, which corresponds to the active phase of LClk as shown in Fig. 3 . Considering 50% duty cycle of LClk, the duration of the RAW in this clock generator can be deduced as follows:
Fig. 3 Request acknowledged window
Any Reqx+ occurring outside of the RAW will lead to increased acknowledge latency. The worst situation happens if Reqx+ arrives concurrently with RClk+ and RClk is acknowledged by the MUTEX. Then Reqx can't be granted until RClk goes low. Therefore, based on the RAW, we can further derive the maximum acknowledge latency caused by the above local clock generator in equation (2) . In the following, we will analyze the impacts of the acknowledge latency on system throughput and data synchronization. 
B. Throughput Reduction For simplicity, a point to point GALS communication, as shown in Fig. 1 , is taken as an example. We firstly discuss the data transfer from a demand-type output (D-OUT) port to a poll-type input (P-IN) port, and similar analysis is then applied to the other three communication channels used in point to point GALS systems [8, 13] .
1) D-OUT Port to P-IN Port Channel
For the receiver equipped with a P-IN port, the local clock LClk Rx will be paused after Req Rx + occurs [8] . Req Rx will be asserted after a Req P + is detected, which is generated by the output port in the transmitter running at an independent clock. Therefore, without loss of generality, the arrival time of Req P +, and then the arrival time of Req Rx +, can be modelled as a uniformly distributed random variable within a period of LClk Rx . Since t RAW = T LCLKRx /2 in the above clock generator, there is 50% probability that a Req Rx + is extended to be acknowledged in the next RAW. Moreover, since the data is sampled in the receiver at the next rising edge of LClk Rx , Data RxS will be delayed for one cycle of LClk Rx . As an example, the latencies of acknowledge signal Ack Rx and sampled data Data RxS are depicted in Fig. 4 . For the transmitter equipped with a D-OUT port, its local clock LClk Tx is paused before Req P is asserted by the output port, and LClk Tx will not be released until Req P gets acknowledged [8] . Since Req P will not be acknowledged by the input port until Req Rx + is acknowledged by the clock generator on the receiver side, there is maximum a T LClkRx /2 latency in acknowledging Req P as well. Consequently, the latency in the receiver is propagated into the transmitter. If the period of LClk Rx is much longer than that of LClk Tx , this latency will result in a multi-cycle suspension in LClk Tx . Because the data is processed synchronously to LClk Tx in the transmitter, the suspension in LClk Tx will eventually result in a delay in data transfer.
For instance, considering a D-OUT port to P-IN port channel, where the periods of LClk Tx and LClk Rx satisfy the following condition:
and the requests in the transmitter, Req Tx , is asserted every N cycles of LClk Tx , as shown in equation (4):
After Req Tx is asserted in the transmitter, a Req P + will be generated by the output port controller. Every time LClk Tx is suspended, its inactive phase will be stretched for a period of (T LClkRx /2 -T LClkTx ). Based on conditions (3) and (4), the throughput reduction R Tx caused by the suspension of LClk Tx is deduced in equation (5). We see that the exact percentage is determined by the value of N. With the increase in the value of N, the limit of R Tx reaches 1/3, as shown in equation (6) . It means that up to one-third reduction in data throughput could be introduced by the acknowledge latency. 
2) Other Point to Point Channels
A similar analysis can be easily applied on the other three point to point communication channels. In Tab. 1 we present the impacts of acknowledge latency on handshake signals and local clocks for the four channels. It can be seen that the only exception occurs when both input port and output port are of demand type. No matter whether there is data ready to be transferred or not, the clocks on both the transmitter side and the receiver side will be paused as soon as the ports get enabled. Although there is no extension in the handshake signals or suspension in the clock signals caused by the acknowledge latency of local clock generators, this channel is prone to unnecessary long suspensions in both LClk Rx and LClk Tx , and a huge drop in data throughput could happen. Careful design is required for applying this type of channel to reduce system power consumption.
Tab. 1 Impacts of acknowledge latency Channel Type Extended Signals
Suspended Clock
C. Synchronization Failure A significant benefit from GALS design is to simplify the global clock distribution by a set of independent local clock networks. For pausible clocking schemes, however, a crucial issue is the synchronization failure caused by the local clock tree insertion delay in receivers. Fig. 6 depicts a failure case occurring in the traditional scheme ( Fig. 1) . As the clock tree insertion delay is irrelevant to the handshake signals' propagation delay, LClk RxDly + can arrive at the sampling flipflop FF at simultaneously with loading data into the input port latch L. Then metastability occurs in FF. Data synchronization issues in pausible clocking schemes were first discussed in [7] . The author suggested integrating a clock buffer network into the local ring oscillator, and proposed a pipelined interface to hide the control overhead. This method is only suitable for pipelined systems. Recent work in [12, 19] reveals that, for clock delays satisfying Δ LClkRx < T LClkRx , there are two timing regions in each cycle of LClk Rx , as shown in Fig. 7 for example, where negligible synchronization failure probability can be expected [12] .
In Fig. 7 , Cycle 1 illustrates the situation that the data is safely sampled by FF before L turns to be transparent. It contributes the safe timing region S1 ofΔ LClkRx as follows: 
where Δd MUTEX denotes the additional delay of the MUTEX to resolve metastability, and d Latch is the delay of L from asserting gate enable (Ack P +) to data being stable. Therefore, we see that the width of the safe regions falls inside the range of 1 
. For small T LClkRx , there is a rather narrow region S1 to insert clock tree. Even for large T LClkRx , only half of a clock period is allowed.
2) Δ LClkRx ≥ T LClkRx
It should be noticed that the safe regions within each cycle of LClk Rx , as discussed in the above, is always aligned with LClk Rx +. A stretch in LClk Rx will lead to a delay in the safe regions of the next cycle. If the clock tree delay meets Δ LClkRx ≥ T LClkRx , this delay in safe regions also need to be considered for data synchronization. Take Fig. 8 for instance, where
. During Cycle 1, the rising edge of the delayed clock LClk RxDly falls in the safe region of LClk Rx , and data is sampled correctly by FF. Then a stretching on LClk Rx happens, and the safe regions in Cycle 2 are delayed. But there is another LClk RxDly + scheduled in the clock tree before the stretched clock, which arrives at FF without any delay. This eventually leads to a sampling conflict. In bellow, the stretching on LClk Rx is analyzed according to the type of input ports used on the receiver side: a) P-IN port In this situation, a maximum T LClkTx /2 suspension on each cycle of LClk Rx could be introduced by the acknowledge latency as shown in Tab. 1. So the stretching on LClk Rx , and then the delay of safe regions, is up to (T LCLKTx /2-T LCLKRx ). Since T LCLKTx is independent from T LCLKRx , this delay could be long enough to mismatch the safe regions of successive cycles of LClk Rx , as illustrated in Fig. 9 . There turns to be no common safe region for the clock tree insertion. Moreover, if
>2
LCLKRx LCLKRx T Δ , more than one cycle of LClk Rx could be stretched within the clock tree delay, and an accumulated delay in safe regions should be considered. Now we can conclude that, for the clock tree insertion delay exceeding one clock period, the uncertainty on clock stretching must be taken into account, and no matter what type of input port is utilized, there is no safe region in the traditional scheme. In fact, in most of the reported pausible clocking systems, the local clock trees were deliberately distributed to satisfy Δ LClkRx < T LClkRx [13, 14, 15, 16] .
For a multiple cycle clock tree delay, an asynchronous FIFO was suggested in [20] to synchronize the input data with the delayed clock, which leads to increased latency in the datapath and additional overheads in area and power. An interface circuit using partial handshake signals was shown in [11] for high-speed systems with large clock delay, while there is an unknown nonzero probability of failure in the circuits. For the design of GALS systems insensitive to the clock tree delay, a synchronizing scheme based on locally delayed latching (LDL) was presented in [12, 19] . Since the clocks can't be paused in the LDL interface, it introduces additional timing constraints on both the asynchronous input port controller and the combinational logic following the sampling register FF, which limits its application. Hence, more stable and efficient synchronizing circuits are required for inserting local clock trees with multi-cycle delay in the pausible clocking based GALS systems.
III. OPTIMIZATION OF PAUSIBLE CLOCKING SCHEME
In this section, the pausible clocking scheme is optimized in two respects. The local clock generator is first improved to minimize the acknowledge latency, and then a novel input port, including the data latching mechanism and the port controller, is suggested to maximize the safe region for the clock tree distribution.
A. Optimized Local Clock Generator
Behind the acknowledge latency is the fact that in Fig. 1 the local clocks on both the transmitter side and the receiver side need to be paused for safe data transfer. To avoid the acknowledge latency, we can deploy an asynchronous FIFO as work [9] to decouple local clocks, with overheads in area and power as penalty. Another simple solution, however, is to widen the RAW of the clock generator as shown in Fig. 10 . There are two delay lines, the programmable delay line D0 followed by the fixed delay line D1, used in the local ring oscillator. The delay lengths of D0 and D1 are as below: The request clock RClk is now generated by an AND operation between LClkB, being the inverted signal of LClk, and L 0 , being the output signal of D0. It is asserted after both LClkB and L0 are high and is de-asserted as soon as LClkB turns low. The on-phase period of RClk in each cycle of LClk is the sum of the delays of following gates:
If such a delay is shorter than the half period of LClk, the RAW in this clock generator will be wider than that in Fig. 2 . For instance, if the period of the clock LClk is 10ns and the summation of above delays is 1.5ns, the RAW in Fig. 10 is 8.5ns, while it is only 5ns in Fig. 2 . Assuming a uniform distribution of the arrival time of Req Rx + in each cycle of LClk Rx , the probability drops from 50% to 15% for a Req Rx to introduce one cycle latency in the receiver. Fig. 11 depicts a comparison in RAW, Ack Rx latency and Data RxS latency between the two clock generators.
Fig. 11 Comparison in RAW, Ack and Data RxS
The fixed delay line D 1 is employed in Fig. 10 to remain LClk at 50% duty cycle. The delay from LClk+ to LClkis
, and the delay from LClk-to LClk+ . Since the delay time of D1 is configured to match the total delay of AND0, AND1 and the MUTEX as shown in (10), both of the delay paths are balanced. It is well-known that there is no upper bound on the resolution time of the MUTEX elements [21] . A practical solution is to estimate the resolution time based on the mean time between failures (MTBF) according to (11) . From [1, 19] , 40 FO4 inverter delays are sufficient for metastability resolution, i.e., for a MTBF of 10,000 years. It's long enough for normal applications. . Considering the delays of AND0 and AND1 in (10), we can fix the delay length of D1 at 1.5ns, which equals to about 50d FO4 . Based on the delay length of D1, the active phase of RClk is shown in equation (12), and furthermore, the duration of the RAW and the maximum acknowledge latency in this optimized clock generator are deduced as shown in equation (13) . It reveals that the RAW is determined by the period of LClk. If T LCLK >100d FO4 , typically which represents the shortest clock cycle for standard cells based SoCs [19] , the optimized local clock generator provides a wider RAW than the traditional one shown in Fig. 2 .
B. Optimized Input Port
In this section, a double latching mechanism is applied to widen the safe region, and the port controller is improved to minimize the uncertainty on clock stretching.
1) Double Latching Mechanism
To widen the safe regions for the clock tree delay meeting Δ LCLKRx <T LCLKRx , a double latching mechanism, which is based on the optimized clock generator, is proposed in Fig.  12 . The first stage of latch L1 loads the data from the transmitter, and then the second stage of latch L2 feeds the data into the receiver. Since L1 and L2 are enabled by the acknowledge signals of the MUTEX, there is only one latch transparent at any time. Therefore, data is transferred by two mutually exclusive coupling latches in this scheme, instead of the single latch L in Fig. 1 .
Fig. 12 Double latching mechanism
During the off-phase of RClk, RClkGrant remains low, and Data Rx is latched in L2 stably. Any LClk RxDly + arriving at FF in the inactive phase of RClk can sample Data Rx safely. If RClk turns high, RClkGrant+ is triggered, and L2 will get enabled to load Data* Rx . Once Req Rx + occurs simultaneously with RClk+, RClkGrant will be asserted by the MUTEX in a random resolution time. Consequently, any LClk RxDly + falling in the on-phase of RClk could conflict with loading Data* Rx in L2. Therefore, the safe timing region for the clock tree distribution in this double latching mechanism is the offphase period of RClk as shown in (14), which is exactly the same to the RAW in the optimized clock generator:
Analyze (14) in the following two typical cases: . For small T LCLKRx , there is half a clock cycle for clock tree insertion. For large T LCLKRx , almost the entire clock period is safe. Fig. 13 illustrates that, for any clock period, the width of the safe timing region is approximately doubled in the double latching mechanism.
Fig. 13 Comparison of W S in two mechanisms

2) Optimized Input Port Controller
In the traditional input port controllers, Req Rx -is triggered by Req P -. It means that LClk Rx can't be released until Req P is de-asserted by the output port controllers. This accounts for the large and unpredictable stretching on LClk Rx , another factor leading to synchronization failures for the multi-cycle clock tree delay. In Fig. 14 , we present an optimized signal transition graph for the asynchronous input port controller, and the corresponding logic synthesized with Petrify [22] . We see that it is a poll-type input port controller with the transition sensitive enable signal Pen Rx . Once LClk Rx has been paused, which is indicated by Ack Rx + from the clock generator, the input port controller will assert both Ack p and Ta Rx , and the combinational event of Ack p + and Ta Rx + will then trigger Req Rx -to de-assert Ack Rx , which signifies the release of LClk Rx . These transitions are highlighted in Fig.  14(a) , and their delay time determines the on-phase period of Ack Rx and the maximum stretching on LClk Rx . The longest delay path from Ack Rx + to Req Rx -is shown as the red line in Fig. 14(b) , which consists of only 4 complex gates. So the stretching on LClk Rx introduced by this optimized input port controller is small and predictable, as shown below: 
We use n CT to denote the maximum number of rising edges in the clock tree at one time, and it also determines the maximum number of the stretching on LClk Rx within the clock tree delay. To sample data correctly, a common safe region is required for the multi-cycle clock tree delay. The location of the common safe region provided by the input port is shown in equation (16), and the width of safe regions for different n CT is deduced in equation (17) . As an example, a scenario for n CT =3 is depicted in Fig. 15 . Note that the transfer acknowledge signal Ta Rx is required in the receiver to indicate the arrival time of valid input data. So an additional latch, which is enabled by signal RClkGrant, is needed in the input port to synchronize Ta Rx to Data Rx .
3) Performance Comparison
Compared to the other schemes proposed in [11, 12, 19, 20] for GALS design with multi-cycle clock tree delays, this timed local clock tree insertion has the following advantages: a) Low latency. In the double latching mechanism, L1 and L2 get enabled to latch the input data in opposite phases of RClk Rx , and then the FF samples the data immediately at the next rising edge of LClk RxDly . The maximum latency for data synchronization is only one clock cycle. Fig. 16 depicts the typical data flow in the improved input port. Fig. 16 The typical data flow b) Low overhead. Except for an additional stage of latches used in the double latching mechanism, there is no extra overhead caused by the input port. Also, there is no timing restriction on the delay of the asynchronous port or the combinational logic following the sampling register FF.
IV. EXPERIMENTAL RESULTS
A. Input Wrapper Simulation
As a simple example, an asynchronous input wrapper with the optimized clock generator and the novel input port was designed and simulated at transistor level using the IHP 0.13-µm CMOS process. The delay slice shown in [18] , whose delay is measured to be 0.13ns in the experiment, is used to generate the delay lines. According to the analysis in section III.A, the delay of D 1 is fixed to be 1.56ns (12 delay slices), and the delay of D 0 is programmed to be 0.52ns (4 delay slices). Hence, the period of LClk Rx is 4.16ns (240MHz). In the input port controller, the on-phase duration of Ack Rx , t AckRx=1 , is measured to be about 0.67ns, which decides the maximum clock stretching on LClk Rx . Based on the above timing parameters, we derive the safe region widths from equation (16) The simulation waveforms of the input wrapper using the Cadence Virtuoso Spectre in the case of n CT = 2, i.e., Fig. 17 , where the port request Req P is asserted every 6.6ns (150MHz) in association with a 16-bit input data. First, the exact location of the common safe timing region, which is covered by the green area in Fig. 17 , is calculated using equation ( , representing the multicycle clock tree delay falling inside the above common safe region in the receiver. Fig. 17 illustrates the transfer of the first 3 data items in the input wrapper. Each input data is first loaded in L1 when Ack Rx =1, and then it is loaded in L2 when RClkGrant=1, and finally it is sampled by FF at the next LClk RxDly +. As Ack Rx and RClkGrant are used as gating signals, their active phase should satisfy the minimum pulse width restriction, which is 0.14ns in the IHP 0.13-µm process. According to equation (15) , t AckRx=1 is dominated by the delays in the input port controller, which is measured to be 0.67ns. As to t RClkGrant=1 , it gets the minimum value, min(t RClkGrant=1 ), when LClk Rx is stretched, which is measured to be 0.28ns. Therefore, safe latching is guaranteed in the double latching scheme. Once data is loaded in L2, it will be sampled by FF immediately at the next rising edge of LClk RxDly . Even if LClk Rx is stretched and RClkGrant+ is delayed, as seen in the transfer of data D1 in Fig. 17 , there is a sufficient timing margin from loading data in L2 at RClkGrant+ to sampling data by FF at the next LClk RxDly +. Therefore, safe data synchronization is achieved. Also, there is no additional latency in synchronization in the input wrapper. 
B. Point to Point Communication
A point to point GALS system discussed in section II.B is implemented at gate level to demonstrate the throughput increase from the optimized scheme. On the transmitter side, the delay line is configured to generate a serial of different clock periods T LClkTx as shown in Tab. 3. On the receiver side, the delay of the ring oscillator is fixed to be 12.48ns (96 delay slices), thus T LClkRx is 24.96ns. Given T LClkRx and T LClkTx , the parameter N deduced from (3) For each value of T LClkTx in Tab. 3, the traditional scheme in Fig. 1 was firstly used to transfer 32 data items. Then the proposed clock generator and input port was applied in the scheme running simulation for exactly the same duration. Tab. 4 is the amount of data transfers accomplished using the optimized scheme and the percentage of improvement in data rate compared to the traditional scheme. It exhibits that the optimized scheme leads to much higher throughput, and the increase becomes pronounced for the large value of N. As an example, Fig. 18 presents a waveform fragment for N=7. In Fig. 18(a) , where the traditional scheme is used, the RAW of the clock generator is T LClkRx /2=12.48ns. All Req tx are extended for LClk Rx /2 and LClk Tx is periodically suspended. In Fig. 18(b) , where the optimized clock generator is adopted, the RAW is (T LClkRx -d D1 ) =23.4ns, and no extension in Req tx or suspensions in LClk Tx occurs. As a result, in Fig. 18(b) there were 8 data shown in signal Data Rx being transferred, but only 7 data items were transferred in Fig. 18(a) . As high as 1/7 improvement in throughput is achieved, which matches well with the theoretical deduction in equation (5 
V. CONCLUSION
Pausible clocking based GALS design has been widely studied and is accepted as an approach to high performance SoCs and NoCs. However, some critical issues for the well known pausible clocking scheme needed to be solved. In this paper, we have proven that up to one-third reduction in throughput can be introduced by the acknowledge latency of pausible clock generators. We also demonstrate that, due to the uncertainty of clock stretching, there is no safe region for a clock tree with a multi-cycle delay. To address these issues, the pausible clocking scheme has been optimized in two aspects. The local clock generator is firstly optimized to minimize the acknowledge latency, and secondly, a novel input port is applied to maximize the safe region for clock tree delay. The proposed scheme has been verified in the IHP 0.13-µm CMOS process. This work contributes to improve throughput and reliability in GALS system design.
