Abstract
Introduction
Current VLSI technologies offer means to develop complex Systems on a Chip, also called SoCs. However, to take advantage of such devices in the production of advanced products while obeying time to market and physical constraints, designers must revise traditional design methods [1] [2] .
IP Core reuse is indicated as one of the main techniques to reduce SoC development time [3] . However, efficient hardware reuse implies heterogeneous systems where each module works at its optimum operating frequency, achieves ideal latency and throughput values, etc. Ensuring such constraints for each and every processing module in a modern SoC is a daunting task [4] . The main assumption of traditional synchronous design, i. e. the adoption of a unique clock signal for the whole system, is rapidly becoming a major limiting factor. This happens because of several factors: (i) constraining all modules to work at a single clock frequency leads to suboptimal designs; (ii) the clock distribution is responsible for a significant part of the whole power dissipation in a SoC [5] ; (iii) the problem of controlling clock skew in a submicron chip that needs to work at a high operating frequency is becoming intractable [6] .
The use of asynchronous techniques can be a solution to the above mentioned problems. Nevertheless, the design of fully asynchronous systems suffers from the lack of a widely adopted design technique like the RTL model for synchronous systems. Also, there is a lack of mature support CAD tools. A strategy that fits the gap between synchronous and asynchronous design is the adoption of a globally asynchronous and locally synchronous (GALS) techniques [7] . The adoption of GALS keeps the use of synchronous methods and CAD tools inside synchronous module design and transfers the synchronization problem to the interface among modules. In this context, the use of NoCs with GALS communication support comes out as a trend [8] [9] [10] [11] . NoCs provide more scalability than traditional shared busses and allow more communication parallelism as well.
GALS design partitions the clock tree, reducing clock buffering needs and clock tree size and spread. These have a potential to significantly reduce the chip overall power consumption [7] . Also, adoption of multiple clock domains still enables techniques for modular power reduction, such as dynamic frequency and voltage scaling (DFS, DVS).
This work presents the proposal and evaluation of power control mechanisms in NoC routers supporting GALS systems. This mechanism dynamically adapts the operating conditions of routers in response to communication requirements.
The remainder of the paper contains five Sections. Section 2 reviews NoCs supporting GALS design and NoC proposals providing power control features. Section 3 presents the architecture of two GALS routers, with their synchronization strategy and power control mechanisms. Finally, Sections 4, 5 and 0 respectively present experimental results, comparisons and conclusions.
Related Work
In this work, asynchronous communication supposes two synchronous modules with unrelated clocks or clocks with the same frequency but with unrelated phases. The last case is called a mesochronous (sub-)system. Two modules communicating asynchronously need that a synchronizing interface be interposed between them, what is called here an asynchronous interface. Thus, asynchronous interfaces are critical components for overall GALS systems performance [12] .
Several NoC proposals to support the GALS paradigm are present in the literature. These proposals can be classified according to the router implementation or according to the relative position of asynchronous interfaces w. r. t. routers and IP Cores.
In NoCs employing synchronous routers, each router may have its own independent clock or the clock signal can be shared by all routers. In this case router and IP Core are either asynchronous [10] or mesochronous.
Kim et al. [10] present a synchronous NoC with asynchronous interfaces between the router and its local IP Core(s). The advantage of the approach is to decouple IP Core and router operating frequencies, but the clock tree can still spread along the whole chip.
The DSPIN NoC [14] employs synchronous routers with bi-synchronous FIFOs for the communication between each pair of routers and between a router and an IP Core. In the first case, the FIFO is designed for mesochronous communication, while in the second case it is designed to allow communication between unrelated clock domains Bjerregaard et al. [15] present the architecture of a mesochronous NoC. The asynchronous interface employs a four phase handshake protocol.
NoCs implemented with asynchronous routers are another possibility to develop GALS systems [8] [11] . Since such routers have no clock signal, the dynamic power dissipation can be reduced. However, since asynchronous circuits may present important area overheads, its static power may increase the overall dissipated power in the NOC.
Although the above NoCs enable the construction of GALS systems, their proposal do not comprise any power control mechanisms on the NoC communication architecture.
Hsu et al. [16] present a frequency scaling low power mechanism, FSLP, to control the power dissipation of a SOC that uses a NoC as communication architecture. However, the power control mechanism is applied to IP Cores connected to the NoC and not to routers. The scheme controls the frequency of the IP Core depending on measured and required communication rates.
Simunic et al. [17] also propose a control system to determine the operating frequency of IP Cores connected to a NoC. The system uses power and QoS requirements to determine the best frequency for IP Cores.
Worm et al. [18] present a self calibrating mechanism to control the DVS in a NoC aiming a tolerable bit error rate. The proposed mechanism uses a module to detect the error and request retransmission. This mechanism can generate a large latency, due to retransmission, thus violating the communication QoS requirements.
Ogras et al. [19] propose a method for partitioning a NoC-based GALS SoC. The objective is to define voltage and frequency islands (VFIs). The flow creates clusters composed by routers and IPs operating at a same frequency and voltage. Within each island, power can be reduced and controlled through the use of DVS. The communication between islands uses interfaces to cross the frequency-voltage domain. The islands are defined and fixed at design time.
This work differs from the mentioned above due to the capacity to create frequency islands at run time based on message priorities. The power control mechanism is quite simple and implies low area overhead.
Router Architecture
This Section describes the Hermes GALS and the Hermes GALS Low Power NoC routers (Hermes-G and Hermes-GLP). Both routers are adaptations of the synchronous Hermes router [20] . Hermes-G and Hermes-GLP NoCs built with the respective routers have a 2D mesh topology, employ XY routing algorithm and use wormhole packet switching. Fig. 1 shows the general structure of a 3x3 Hermes NoC, where each router has an associated XY address. The detail shows the wire structure of a unidirectional link. 
The Hermes-G Router
To obtain the Hermes-G router, the original Hermes router had its input buffer structure modified, to allow the routing to provide communication among modules operating at distinct frequencies. The bi-synchronous FIFO proposed in [13] substitutes the original synchronous FIFO. The ability to enable data writing and reading in one single clock cycle motivated the choice of a bi-synchronous FIFO. Fig. 2 describes the structure of the bi-synchronous FIFO. The FIFO implementation uses two pointers, one defining the next writing position and another defining the next reading position. The FIFO state is either full or empty when both pointers refer to the same address. Thus, it is necessary to compare the pointers. Although this procedure is trivial in synchronous circuits, it implies some complexity in asynchronous devices as the bi-synchronous FIFO, because the pointers are generated by different clocks. The usual solution to solve this problem is to transfer and synchronize the writing pointer with the receiver clock domain which generates the empty signal, and mutatis mutandis for the full signal.
Pointer exchanging can be accomplished using synchronizers or clock stretching [12] . However, adding a handshake protocol to control the pointer exchanging implies additional latency in the FIFO. Cummings [13] presents an elegant solution to pointer exchanging using synchronizers. The addresses are translated to Gray code which guarantees that consecutive addresses are at a Hamming distance of 1. In this way, the metastability problem is confined to a single bit and synchronizers can be employed without handshake. In Fig. 2 , wptr represents the writing pointer and rptr the reading pointer and they are used for pointer exchange. In this FIFO, if synchronization fails the only consequence is the premature generation of empty or full signals, but there is no data loss or corruption.
Fig. 2 -Structure of a bi-synchronous FIFO.
The Hermes-G NoC may have a bi-synchronous FIFO in each router input port or just in ports connecting routers. These approaches allow each router to work with its own clock source or sharing a clock with its local IP core. To let the IP core connected to a router to work with a clock distinct from that of its router, a FIFO must also be inserted in the IP core Network Interface input port.
The Hermes-GLP Router
The Hermes-GLP router employs the same synchronization schemes developed for the Hermes-G router, but additionally allows the use of clock gating and dynamic frequency scaling.
The clock gating mechanism consists in disabling the clock of idle modules of the system [21] . This mechanism can be applied at different granularities. Hermes-GLP applies the mechanism at the router level. A router is considered idle when all of its ports are idle, meaning that none of these has any data to transmit. To implement clock gating, a clock control module was added in the router design. This module receives a signal from each input port and updates the respective port state accordingly. When all ports report no data to transmit, the clock control disables the router clock input. The router remains in this state until a writing operation is made in one or more of its input port, setting the port to transmitting state. This system is possible due to the fact that the each port have its own writing clock signal, but all ports share the same reading clock signal, used internally by the router.
As for the DFS mechanism, it was implemented using a simple glitch-free dynamic clock switching between two or more clocks. More elaborated clock scaling techniques could be used as well.
The use of DFS allows the self calibration of router operating frequency based on expressed communication requirements. Routers involved in a high priority communication operate at a higher frequency than routers involved in low priority communication.
To enable dynamic router configuration, each packet carries its own priority in a sideband signal. This signal is routed along with the packet. Each router receives a set of n priority signals, where n is the total number of ports in each router, and computes the greatest indicated priority, selecting the best operation frequency based on the result. In this way, latency constraints for all packets is more easily achievable, while providing a simple power control mechanism that requires small area overhead in routers. Fig. 3 illustrates a Hermes-GLP clock control module structure. Signals sel_clk_in at input ports are used to control the DFS and are each routed to the output signal sel_clk_out of some output port along with the packet. Signals Port_State indicate the state of each port and are used to control clock gating. These signals are used either to decide which sel_clock_in is used to choose the current router clock, or to perform clock switching. Only active ports are considered to determine the router clock frequency. Fig. 4 shows an example of dynamic clock determination for two concurrent communication flows with different priorities. In the first step (a), the NoC starts with a low priority flow established. In (b), a higher priority flow starts, forcing the routers in its path to work at a higher frequency. Routers that belong to both communication paths are configured to the higher frequency, to meet the most stringent flow requirement. In (c), the maximum frequency flow finished first, which sets router 11 back to a lower frequency and stops routers 21 and 01. When the first flow ends in (d) all routers reach an idle state and all clock routers are stopped. Fig. 4 -Example of dynamic clock determination and clock gating. The darker the router, the higher is its operating frequency.
Switching and Power Analysis
The average power dissipated in a synchronous system can be expressed, according to Najm [22] , as:
Here, Tc is the clock period, Ci is the total capacitance of node x i , P t (x i ) is the probability of transition occurrence at the gate output, also called switching activity and n is the number of logic gates or cells in the circuit [22] . According to Equation 1, dynamic power of a synchronous module is directly related to the module operating frequency. This relation is the basis that allows estimating power dissipation by computing the average activation rate of routers on the Hermes-GLP NoC. These are compared to the activation rate of Hermes-G. A router working at its maximum operation frequency is considered to have an activation rate of 100%. Routers operating at lower frequencies present activation rates proportional to the maximum available clock frequency and routers where the clock is stopped have an activation rate of 0%. The activation rate of a router is described as a function of its state. If at a time t a router x works at F op (t), then its activation rate is expressed by:
The analysis of the Hermes-GLP activation rate was made based on simulation. To determine the operating state of each router, SystemC code was written and employed to monitor when the router has its clock stopped. When this is not the case, the code considers in which frequency level the router is operating. This is computed at fixed intervals of 1 ns each. The state of each router is registered in a text file. These data are used to verify the average activation rate of each router, employing Equation 3 (a):
where n represents the number of times a SystemC monitor code extracted the router state. The average activation rate of the NoC can be described as the average of the average rates for each router, as shown in Equation 3 (b), where r represents the number of routers.
Results
This Section presents some results from the simulation of test scenarios employed to validate the functionality of the Hermes-GLP NoC. The test set allows evaluating the potential to reduce the router activation rate, a parameter directly related in the previous Section to power dissipation.
The employed simulation scenario is presented in Table 1 , where a traffic comprise always a pair of IP cores exchanging. Table 1 in the Hermes-G and Hermes-GLP when a 90% insertion rate is applied. It is possible to notice that the Hermes-GLP does not introduce significant additional latency when compared to Hermes-G. 
Conclusions
Hermes-GLP is a NoC developed to reduce power in SoCs. It is based on a simple DFS mechanism that allows significant power dissipation reduction with small latency penalty. Simulation results are encouraging that this strategy is useful and leads to low overhead not only in latency, but in area as well. One important point is that to ensure the robustness to the proposed DFS scheme, it is necessary to guarantee that the clock switching process is glitch free. The proposed NoC is currently being prototyped in FPGAs. Later versions of Hermes-GLP will aim at ASIC prototyping, example applications with real traffic constraints requirements and adding DVS schemes to the ASIC version of the NoC.
