Introduction
Current architectural trends are driven by the observation that simply creating larger and more complex monolithic cores or IP blocks is rarely the best use of growing transistor budgets. A more flexible and scalable approach is to create a network of simpler IP blocks. Technology scaling is exploited by adding additional blocks rather than increasing the complexity of each individual block. This communication-centric methodology aims to exploit a block size that produces an efficient circuit-level implementation and isolates the designer from the need to consider multi-cycle interconnect delays. Furthermore, by restricting each blocks complexity we aim to avoid the pitfalls of employing ever more complex and power hungry techniques to obtain ever decreasing performance gains. The IP network also provides the flexibility necessary to operate in a fault-tolerant manner, manage power and thermal goals and produce the multi-use platforms dictated by rising design and NRE costs.
Much of the complexity in such a system is shifted from the design of individual IP blocks, concentrating on computation, to their interconnection, management and scheduling. In this environment the simplifying assumptions that a synchronous design style traditionally offers are less evident. In contrast to simply optimising combinational logic within a single clock domain, the process of integration requires us to consider a physically distributed system with significant process parameter, temperature and voltage variations. We must also span clock domains and handle multi-cycle interconnects. System timing is often further complicated by the application of voltage and frequency scaling and static power reduction techniques such as power gating. The challenges posed by the broad range of timing and communication requirements are perhaps more naturally tackled by adopting an event-driven control paradigm.
The techniques presented in this paper are designed to allow independently clocked IP blocks to be interconnected asynchronously, without the complexity of introducing additional clocks and synchronisers during the integration process. These systems are called Globally-Asynchronous Locally-Synchronous (GALS). Schemes are also described that can be used to clock the routers in more complex networked interconnects.
GALS on-chip networks share many of the benefits of fully-asynchronous implementations [1, 2, 11] , although locally synchronous routers may be more appropriate in cases where speculation is exploited or complex scheduling operations need to be supported. Removing the global network clock in either case simplifies the process of making dynamic delay/energy trade-offs at the link level and removes the restrictions imposed on network topology by a global clock. The removal of the high-frequency widely distributed clock tree normally associated with an on-chip network is of course also beneficial. An asynchronous or data-driven locally clocked network is also a natural choice when interconnecting multiple chips or the different levels of a die stack.
The use of asynchronous techniques also provides a robust framework for power reduction schemes, such as the voltage scaling of on-chip interconnects and IP blocks.
Robust implementations of self-timed power gating and timing speculation [12] could also be supported using the approaches described here.
Local Clock Generators
A ring oscillator constructed from a tunable delay line and an inverter (Figure 1(a) ) may be used as the basis for a flexible on-chip clock generator. The frequency of such a clock generator may be periodically calibrated to an off-chip reference clock [22] . In a GALS system, each synchronous block is clocked from a local clock generator of this type. When a free-running oscillator is employed the datapath clearly plays no role in the generation of the clock. However, by making small modifications to this basic oscillator circuit we will show how interesting and useful interactions with the datapath may be developed.
The circuit illustrated in Figure 1 (b) is the starting point for many of the published schemes and those presented here. In this circuit, the ring oscillator has been extended to require both an event on the req input and on the output of the delay-line before the next clock edge is generated. This is enforced by the use of a C-element that operates as an AND-gate for events [31] . By enforcing a strict four-phase handshake on the interface we are guaranteed a minimum clock period determined by the delay-line. In addition, we now have the opportunity to stretch the clock period by delaying the completion of the handshake. The circuit may also be viewed as a single stage of a micropipeline with the output handshake ports connected together [32] . We call this circuit a data-driven clock. As is, this clock simply generates a single clock cycle in response to each incoming input request.
If we complete the handshake by simply inserting an inverter between the ack and req ports, as illustrated in Figure 1 (c), we produce a simple ring oscillator. A well documented approach to producing a pausible clock is to interrupt this cycle with a mutual-exclusion element (MUTEX) [31] (see also Appendix A). This produces a clock that will normally oscillate unless we interrupt it by holding req high. This is in contrast to the data-driven clock where a complete handshake must take place during every clock cycle. This pausible clock circuit is illustrated in Figure 1(d) .
The majority of the clock generator circuits described are based on either a data-driven or pausible clock template. The ability to stretch or delay the clock may be exploited in a number of ways. The original purpose of clock pausing was to create additional time for metastability to resolve, e.g. to permit the safe transfer of data between different clock domains. An additional reason may be to create a data-driven clock that produces clock edges only when data is available for processing. This type of data-driven operation mimics a high-speed global clock without the associated synchronisation overheads and superfluous switching activity.
Input Ports
We distinguish between three different input behaviours for locally-clocked modules with multiple inputs. In each case we assume that once an input request is made it remains asserted until it is serviced.
• Arbitrated Inputs: At most one input request may be serviced per clock cycle. This requires the inputs to arbitrate for access to the IP block.
• Sampled Inputs: An event triggers the sampling of all input ports. This sampling determines which inputs have data that is ready to be admitted on the next clock cycle. The sampling event is either a (delayed) clock-edge or the arrival of an input request. The precise choice of sampling event depends on the type of local clock generator.
• Synchronised Inputs: A request to admit data is only generated when valid data is present on all inputs.
Each of the behaviours described could be supported by a subset of a block's inputs. The synchronised input behaviour could also be trivially extended to wait for some subset of the blocks inputs to become ready.
Scheduled communications could also be supported by the clock generators presented here, requiring data to be read from a specified input (or written to a specified output) port on a particular clock cycle. The Synchro-Tokens methodology [15] may be used to construct deterministic GALS systems supported by a mechanism for communicating at regular intervals (or recycle periods).
Data-Driven Clocks
The data-driven clock circuit (Figure 1(b) ) may be extended to support each of the input behaviours described in Section 3. Each of these circuits is illustrated in Figure 2 . While arbitrated and synchronised inputs are trivial to implement, the sampled inputs scenario requires some explanation. A data-driven clock with sampled inputs may be useful in an environment where an IP block may make forward progress regardless of the number of inputs that are ready. One example of such a block may be a locally-clocked router in an on-chip network. In this scenario, the detection of an input port request forces a decision to be made on whether to admit data from each input port on the next clock cycle. In the case of an on-chip router additional clock cycles would have to be generated to guarantee packets buffered within the router made forward progress when no new input data was forthcoming. How these additional cycles could be generated is discussed in Section 4.1.
The circuit illustrated in Figure 2 (b) supports a data-driven sampled-input behaviour using a circuit that takes inspiration from the design of a static priority arbiter [5] . The circuit is quiescent until a request is made by one of the input ports. A lock request is then asserted to force each MUTEX to grant either the lock or input port request. Only after it has been determined from which input ports data will be admitted in the next clock cycle, will a new rising clock edge be generated. To improve performance the lock signal is prevented from asserting before the falling clock edge. This prevents inputs from being 'locked out' early in the clock cycle.
Pipelines and Flushing
In contrast to a pausible clock generator where the clock is normally running, a data-driven clock only produces clock edges in response to input data. In some cases the architecture of the IP block may require additional clock cycles to be generated to complete an operation, e.g. in the case of a pipelined IP block or one that buffers data. These additional clock cycles may be generated in a variety of ways:
• Eager Flushing ensures a further clock cycle is generated without delay if the IP block has useful work to complete. In the case of a pipelined block a counter may be used to request these additional cycles. The counter is initialised to the pipeline depth after each successful input request. The counter subsequently requests clock cycles, decrementing its value on each cycle, until it reaches zero. If a sampled data-driven approach is used the counter would be responsible for asserting a lock request in the case when no valid input data was present. An alternative is to replace the counter with logic that examines datapath signals directly to determine if useful work is outstanding.
• Time-Out Flush: A slightly different approach is to wait for some predetermined time before initialising the counter. Only when the time-out occurs are the additional clock cycles generated. This approach potentially reduces the total number of clock cycles generated by providing an opportunity for new data to push previous values through the pipeline.
• Uninterrupted Flush: Some IP blocks may require an uninterrupted pipeline flush mechanism. In this case the pipeline is flushed before any new input data is allowed to enter the IP block.
• Pull-Driven Flush: Krstic et al [18] suggest switching from a data-driven (push) to a pull-driven mode when no new input requests are forthcoming. Although no details of such an implementation are provided.
Related Work: Active Clock Handshake Interfaces
The handshake interface on the data-driven clock illustrated in Figure 1(b) is passive, i.e. it can only respond to an external request. By swapping the req and ack ports we can create a data-driven clock with an active handshake port. The clock now acts as a request that the environment must acknowledge. To be able to generate a clock the incoming req signal must be inverted (or held high to indicate that the environment is ready). Kessels et al [17] explore such an approach to enable communication between a fast processor and slow memory.
An active clock handshake interface can still support the full range of input behaviours previously discussed. In addition, it is perhaps more natural in some cases to think of some communications as conditional [17] , rather than scheduled. Arbitrated access to a single passive resource from multiple clock modules may now also need to be considered.
Related Work: Request-Driven Clocking
A data-driven clocking mechanism with a time-out based flushing mechanism appears in a paper by Krstic et al [18, 19] . They use the term request-driven clock to describe their scheme.
Related Work: Clock Stretching
Clock stretching is a form of data-driven clocking where the handshake interface is replaced with a single stretch control signal that is asserted synchronously. The relationship with the basic data-driven clock circuit may be highlighted by redrawing the clock stretching circuit as illustrated in Figure 3(a) . The clock handshake port is now an active one as discussed in Section 4.2. Figure 3(b) simply removes the AND-gate by converting the C-element into an asymmetric gate. Both these circuits are equivalent to Bormann's stretchable clock generator [3, 4] . This type of clock stretching is exploited in a recent processor array targeting DSP applications [34] . In this design 36 processors are independently clocked by local stretchable clock generators.
Sietz describes a clock stretching scheme to permit asynchronous communication between synchronous systems in his chapter of Mead and Conway's book [29] . He indicates that this approach has been used in various proprietary designs since 1968.
Pěchouček observes that the ability to stretch the clock is the only solution which guarantees value-safe communication between independently clocked modules. He describes a system for extending the clock period of a system until metastability has resolved [27] . Pěchouček also outlines a data-driven clocking scheme where the generation of a fixed number of clock cycles is triggered by the arrival of input data. This type of clocking scheme was more recently employed to create an on-chip clock generator for a DSP [25] and a data-driven GALS clocking scheme for a low-power reconfigurable processor [36] . There are no synchronisation issues with such a scheme as the clock is always quiescent when the initial asynchronous data input arrives. This could be achieved in a robust manner by ensuring the ack signal from the data-driven clock is not deasserted until the processing of the data has completed.
Chapiro investigates the use of stretchable clock generators and introduces the term Globally-Asynchronous Locally-Synchronous (GALS) to describe synchronous system composed using such interfaces [6] .
Lim describes the use of a stoppable clock generator [20] where a single input to the clock generator is again used to delay the generation of the next rising clock edge until data is available. Lim also describes the use of a MUTEX to provide an arbitrated input behaviour. 
Pausible Clocks
Pausible clock circuits may be constructed using the simple template provided in Figure 1(d) as a starting point. Figure 4 illustrates how a pausible clock can support each of our input behaviours.
The tree arbiter shown in Figure 4 allows an input request to be initiated while it is determined which input port should be granted access [16] . If necessary, the eager request generation could be omitted. A tree-arbiter implementation is provided in Appendix A. Previous work has illustrated how pausible clock generators may be used to facilitate point-to-point value-safe communication between independently clocked IP blocks [23] . Figure 5 illustrates the receiver side of such a communication (note, the handshake protocol used here is a two-phase one). Data is latched safely in the first input register while it is guaranteed no rising clock edge can take place. After this operation is complete and the input request is removed, a rising clock edge is generated that safely transfers the input data into the synchronous domain. It is our belief that all existing high-throughput pausible clock schemes latch the input data in this manner.
An alternative is to replace the MUTEX element with an arbitrated call (see Figure 6) . A new rising clock edge may now be requested by either the inverted clock or new input data. If the input port is granted we can safely enable the input register and allow new data to enter the synchronous block. This approach reduces the chance that the clock period is extended by removing the need to block the generation of the next rising clock edge while the data is latched and the handshake is completed. An implementation of an arbitrated-call element is provided in Appendix A. 
Related Work: Lim's Operation Module
Lim [20] describes an extension to a data-driven clocking scheme where the synchronous block is designed to make forward progress without requiring a constant stream of input data. The clock is now normally enabled to run by allowing the module itself to make requests for further clock edges. The scheme is illustrated in Figure 7 . If the check input port signal is low, the start clock input to the clock generator will remain high allowing the clock to oscillate. Input data may only be admitted when the check input port signal is asserted. Lim suggests this may be done periodically or every cycle. To enable input data to be admitted on every clock cycle the clock itself could be used as the check input port signal. This produces a circuit close to our pausible clock template. The idea of generating a new rising clock edge independently of which MUTEX input is granted is also exploited in our arbitrated-call based pausible clock generator. The approach may be classified as a pausible clock generator with a scheduled sampling of input ports.
Related Work: Asynchronous Synchroniser Elements, Q-Elements and DFLOPs
The sampling or synchronising mechanism of the pausible clock may be applied at the level of a single register.
The Amulet3 interrupt synchroniser is one example of this approach [13] . The synchroniser circuit is reproduced in Figure 8 . This circuit is one component of the synchronous consumer circuit shown in Figure 5 . One can think of this circuit as a register with an asynchronous (write) handshake interface. The handshake interface is required as the time required to synchronise an input is unknown (and unbounded).
Rosenberger et al developed a technique to build delay-insensitive modules by exploiting input registers with asynchronous handshake interfaces [28] . These Q-modules operate in two distinct phases initiated by falling and rising clock edges. On a falling clock edge each input register (Q-element) samples its input and records this value -without updating its output. The subsequent rising clock edge is now delayed until each input register has acknowledged the completion of this operation. A rising clock edge is then generated prompting each Q-element to copy the synchronised input value to its output. Finally, when this 'update output' operation has been acknowledged by all registers, time is scheduled for the computation itself to take place. We may construct a Q-element from the synchroniser circuit shown in Figure 8 . We simply need to add an additional output register that is updated on receipt of a rising clock edge. The acknowledge output (A) is implemented with a SR-latch. Note, the Q-module specification requires this acknowledge to make a transition from high to low to indicate the current input value has been synchronised or read. The acknowledge is reset when the output is updated by a rising clock edge. The circuit is illustrated in Figure 9(a) . The Q-element described by Rosenberger et al is a more direct implementation of the required behaviour. A static-logic reimplementation of this is provided in Figure 9 (b). This circuit is in the style of the original Q-element, although the emphasis here is on the major components of the circuit and their interfaces. VanScheik and Tinder have described other possible implementations and optimisations [33] .
The derivation of a synchroniser circuit is also undertaken by Nystrom and Martin [26] . Here the circuit is designed to also cope with the removal of an input before it has been sampled.
Related Work: Pausible Clocks
The pausible clock circuits described generate a clock even when no input data is present. For some applications it is desirable for the clock generator to enter a sleep state with the clock stopped until new data arrives. Our earlier work explored a sleep mechanism of this type [23] . Here the sleep request is asserted synchronously by the clocked module.
The pausible clock control (PCC) circuit implemented by Yun and Dooply [35] closely resembles the pausible ETH Zurich have produced many different GALS test chips using pausible clock approaches, a recent paper by Gurkaynak et al provides an overview of their experiences [14] .
Output Ports
In this section we briefly discuss different output port behaviours.
• Scheduled: This type of port is used when an output operation must be completed on a particular clock cycle, e.g. when the output of the synchronous block is not registered. This port type will stall the generation of the next clock cycle until the data is successfully consumed.
• Registered: The addition of an output register permits the output operation and the next computation to take place concurrently. A registered output port only need stall the clock when the output becomes blocked for an extended period. At this point any further clock edges must be prevented to ensure data in the output port register is not overwritten.
• Polled: This type of port polls the output to determine when it is safe to send data. The clock is interrupted only in cases where additional time is required to resolve any metastability occurring due to the sampling of the asynchronous output port ready signal. The synchronous block is responsible for coping with blocked output ports.
The implementation of each of these output port behaviours requires no new techniques. Each may be based upon the existing input port and clock generator templates.
An example of how previously discussed approaches may be combined to produce specific input and output port behaviours is illustrated in Figure 10 . This clock generator supports both a sampled input port (based upon a pausible clock) and a registered output port implemented using a stretchable clock. Each new port of a different style requires its own handshake port on the clock generator. Figure 2 (c) shows a clock generator with N handshake ports. It should be noted that combining different port types may alter the behaviour of individual ports or prevent some ports accepting any new data. Special care must be taken when combining both ports based on data-driven (stretchable) clocking templates and those constructed from the pausible clock template.
The circuit in Figure 10 is a simple example that could be improved. One extension would be to allow the output port to be clocked as soon as the computation is complete, even in the case when the input port is pausing the clock. An in-depth discussion of such optimisations and each of the possible output port circuits is beyond the scope of this paper.
Clock Tree Insertion Delays
In all the previous examples it has been assumed that the delay imposed by the clock tree is insignificant. In reality this clock insertion delay may vary from a few gate delays to many clock cycles. The precise delay will depend on the number of sequential elements in the synchronous block and its physical size. The design of a traditional synchronous system is mostly unaffected by this delay as there is no reason to distinguish between different clock edges produced by the clock source. If clock gating or the techniques described here are adopted, an association is made between particular clock edges and datapath operations. This forces us to consider the clock tree insertion delay in any analysis of the circuit.
This section outlines how the impact of clock tree insertion delays may be minimised. The analysis is presented for a simple data-driven clock, but applies equally to any local clock wrapper.
The circuit illustrated in Figure 11 (a), shows a simple data-driven clock generator clocking a single input register. The clock tree insertion delay is shown as a chain of buffers. To guarantee that input data is correctly latched we must ensure an input request is only acknowledged after the input register has been clocked. This ack signal must therefore be delayed by at least the time taken to propagate a clock edge through the clock tree [9] .
The concern now is that if the clock insertion delay is greater than half the clock period the clock will always be extended. In this case the clock period is increased from twice the delay of the delay-line to at least twice the clock tree insertion delay.
If we are certain the clock tree insertion delay is less than one clock cycle we can simply add an additional register to buffer the input data. This scheme is illustrated in 11(b). The additional latch holds the input data until it can be clocked into the synchronous module, allowing the handshake to complete quickly. If the clock insertion delay was to exceed one clock cycle this scheme would fail as new data would be latched in the first input register before the previous data item had been copied to the second. The constraint that there is at most one rising clock edge in the clock tree applies to many of the published aperiodic clocking schemes.
The time between the clocking of the first input register and the second is equal to the clock tree insertion delay. As this delay is fixed we may insert combinational logic between the registers to complete useful work during this period. Naturally, the delay of this combinational logic plus the register setup time must be less than the insertion delay (t dXY + t s < t i ).
It should be noted that an input register will usually require some buffering to distribute clock and enable signals, forcing the ack to be delayed to some degree. (a) (b) Figure 11 . Accounting for the clock insertion delay when generating a data-driven clock
Hiding Small Insertion Delays
In some cases the additional latency incurred by even a relatively small insertion delay will be unacceptable. In these cases, latency can be minimised by considering the clocking of the synchronous module's input registers separately from the clocking of its output and state registers. Two clock trees are now generated, one small low-latency tree to clock the module's input registers and a larger one to clock the modules state and output registers. The larger insertion delay is hidden by initiating a new clock edge early from a tap within the delay-line. The total delay to the tap plus the clock-tree insertion delay ensures the state/output registers are clocked one clock period after the input register. This "skewed tree" scheme is only applicable in cases where the insertion delay of the state/output register clock tree is less than half a clock cycle.
Depending on the output port style employed it may in some cases be necessary to stall the clocking of the output registers. Schemes could also be developed that clocked input, state and output registers at different times from different clock trees. For example, it may be possible in some designs to clock output registers before state registers, thereby reducing latency while maintaining correct operation. The introduction of additional clocks and matched delays moves the design methodology towards a bundled-data asynchronous one.
Multi-Cycle Insertion Delays
If a synchronous IP block is sufficiently large the clock insertion delay may exceed a single clock period. In this case additional input buffering is required to prevent the clock period from being extended significantly. The single input register added in Section 7 must now be extended to a FIFO memory. The number of elements in the FIFO reflects the maximum number of rising clock edges which may be present in the clock tree at one time. Any newly arrived input data must wait at least this number of clock cycles before it is admitted into the synchronous block. We must also guarantee that data is always able to arrive at the head of the input FIFO before the clock edge used to admit it into the synchronous block.
In some schemes, e.g. pausible clocks, data is not necessarily admitted on every clock cycle. In these cases, we must carefully record on which clock cycle data has been scheduled to be admitted. The decision to admit data or not is readily available in many of the clock generators presented. This dual-rail value may be queued and subsequently used to admit data on the correct clock cycle at the associated input register. An outline of this scheme is shown in Figure 12 .
Additional input buffering guarantees correct operation without extending the clock cycle time. The additional latency is unavoidable and can only be tackled by making input requests early with prior knowledge of the delays in the clock generator and clock tree.
The reader may consider the idea of 'promoting' data in the data FIFO so it may be read on an earlier clock cycle. Unfortunately, the clock cycle a data item will be admitted on cannot be rescheduled in bounded time. Furthermore, the clock edges delineating these clock cycles will have been dispatched. As they are already travelling through the clock tree their arrival time at the clock tree leaf cells cannot be influenced.
It should be noted that applying GALS techniques to systems composed from a small number of very large synchronous IP blocks is probably counterproductive even before clock-tree insertion issues are considered.
Related Work: Clock Tree Delays
Sjogren and Myers are first to highlight the issues associated with clock insertion delays and stoppable clocks [30] . They focus on the need to handle substantial insertion delays but those of less than one clock cycle. Clock insertion delays are hidden in their handshaking protocol with the use of additional pipeline buffering.
It is useful to note that the 'state-holding gate' used in their stoppable clock circuit (and illustrated at the transistor level) is in fact an implementation of the asymmetric C-element as shown in Figure 3(b) .
Other published techniques [21, 10] for coping with large clock insertion delays depart from the value-safe communication principle and use time-safe techniques 
Summary
A wide variety of ingenious aperiodic clocking schemes have been published to date. The aim of this paper has been to illustrate the similarities between many of these approaches. The paper has also presented new mechanisms for supporting different kinds of input and output port and coping with clock insertion delays.
Current work is exploring the verification of local clock generators using the Veraci asynchronous circuit verifier [7] . Verified implementations of different port types together with formalised techniques for their composition, would form the major components of a GALS wrapper synthesis system. A similar system based on a library of Petri net models has also been recently proposed [8] . This work also compares the performance of data-driven and pausible clock schemes. Our current work is also examining the design and performance of on-chip network routers with local data-driven clocks [24] . 
