Abstract-The Network-on-Chip (NoC) paradigm is now widely used to interconnect the processing elements (PEs) in a chip multiprocessor (CMP). It has been reported that the NoC consumes about a third of the total power consumption of the multi-core processor. To address this, asynchronous NoC routers have been proposed, to eliminate the clocking power associated with the NoC implementation, which is typically a large fraction of the NoC power consumption. In this work, we present a technique to reduce the standby power of a state-of-the-art asynchronous NoC router. In our approach, the router is put in a known input state when idle, and each gate in the unmodified router is replaced by a logically equivalent gate whose supply pin is connected to a PMOS device with a high threshold voltage in case its output in the idle state was 0. On the other hand, if the output of the unmodified gate in the idle state was 1, it is replaced by a logically equivalent gate whose ground terminal is connected to a NMOS device with a high threshold voltage. Our router is inserted in a NoC, and verified logically for correct routing functionality. We also simulated it at the circuit level using a 45nm fabrication technology, and show that it has a low wake-up time from sleep, and a minimal steady-state routing delay (13%) and area (23%) overhead, and a 8.1⇥ lower standby power, when compared to an unmodified asynchronous NoC router, which was also implemented. Our leakage improvement is achieved in part by using a novel method to control the leakage of the inverter chain used to drive the sleep signal, something which that is not possible with traditional leakage reduction techniques.
I. INTRODUCTION
Recent studies [1] show that asynchronous network-on-chip (NoCs) are an appealing solution to tackle the synchronization challenge in modern multicore CMP systems, using the Globally Asynchronous Locally Synchronous [2] (GALS) design paradigm. Due to the difficulty in distributing a global clock as technology scales and processing, temperature and voltage (PVT) variations dominate, the GALS paradigm has been gaining favor among researchers and practitioners. In the GALS paradigm, individual computing units (PEs of the CMP) are synchronous, while communication among PEs is performed asynchronously. This naturally suggests that the routers in a CMP be designed in an asynchronous manner, while the PEs are synchronous. This allows the designer to absorb the heterogeneity of timing constraints in the system interconnect of such systems.
Asynchronous NoCs, then, are the best means to accomplish such a GALS inspired CMP. In particular, in an asynchronous NoC, the router in the NoC is implemented in an asynchronous manner. They have several attendant benefits, such as a) they provide average-case instead of worst-case, guard-banded performance, b) they save on power since they do not need a clock signal, and hence the switching power of the clock tree is eliminated, and c) robustness to PVT variations due to the asynchronous nature of the design.
There are some downsides to asynchronous design (and hence asynchronous NoCs). The main issue is that they are harder to design and test in general than the synchronous NoCs. There is scant tool support when it comes to design, verification and testing of asynchronous circuits. As a consequence, designers rely on a full-or semi-custom approach for the design of such circuits.
A large fraction of the total CMP power is consumed by the NoC. For example, the NoC in the RAW [3] system consumes as much as 36% of the total power of the CMP. Hence, it is critical to reduce the power of the NoC. Leakage power has become a significant issue in modern VLSI design. As process feature sizes and operating voltages of VLSI designs shrink, (sub-threshold) leakage currents in modern VLSI designs become dominant. When a circuit such as an NoC router is in the standby mode of operation, its power consumption is due to the sum of leakage currents in all its devices.
The leakage current for a PMOS (NMOS) transistor is the I ds of the device when it is in the cut-off or sub-threshold region of operation. This current [4] is governed by the equation:
Here I0 and V of f 1 are constants, while vt = kT q is the thermal voltage (26mV at 300 K) and n is the sub-threshold swing parameter. Also, T is the absolute temperature, k is the Boltzmann constant and q is the electron charge.
From the above, leakage currents increase exponentially with decreasing threshold voltages (which are typically maintained at a fixed fraction of supply voltage). Hence, as supply voltages are reduced, leakage increases exponentially. It is expected that leakage power is comparable to switching power in recent VLSI technologies [5] .
A closer look at Equation 1 shows that I ds is significantly larger when V ds nvt. This is because of two effects: a) the last term of equation 1 is close to unity and b) with a large value of V ds , VT would be lowered due to drain induced barrier lowering 2 (DIBL) [6] , [4] . Therefore, leakage can be reduced by ensuring that the supply voltage is not applied across a single device. To exploit this, the authors of [7] devised a circuit design approach in which a circuit block is driven with a fixed input vector in the standby mode of operation. Based on the static outputs of each gate for this input vector, the gates were modified such that if a gate had a "1" output during standby, its ground terminal was connected to a high-V T NMOS switch (the modified cell is referred to as an H cell). Conversely, if a gate had a "0" output during standby, its supply terminal was connected to a high-V T PMOS switch (the modified cell is referred to as an L cell). This ensured that the supply voltage, during standby, was applied across at least two devices, reducing leakage significantly, while ensuring that the logic values of each gate output did not float, allowing for fast recovery (wake-up) from standby state once normal operation has to be resumed.
In this paper, we apply the leakage reduction ideas of [7] to the asynchronous router design that was proposed in [1] . The asynchronous router of [1] demonstrated a significant reduction in area, power and energy-per-flit over xpipesLite [8] , an existing state-of-the-art synchronous router. The approach of [1] utilizes a two-phase bundled data protocol on the inter-router links, as well as within the router itself. By allowing this router to sleep between packets using the design approach of [7] , we are able to demonstrate a 8.1⇥ reduction in leakage power, with a minimal wake-up time, and router delay once the router has been woken up from its standby state.
The key contributions of this paper are:
• We modify the asynchronous NoC router of [1] to make it compatible with the leakage reduction methodology of [7] . 1 Typically
T decreases approximately linearly with increasing V ds 978-1-4799-6492-5/14/$31.00 ©2014 IEEE
• We verify the correct operation of the modified NoC at the logic level, by validating its correctness in a 4⇥4 NoC.
• We simulated the design of the modified and unmodified [1] NoC in HSPICE using a 45nm predictive technology [9] , and show that the modified design exhibits 8.1⇥ lower leakage, with a low wake-up time of ⇠870ps, and a steady-state flit routing speed penalty of 13% over the unmodified design of [1] .
• We use a novel technique to reduce the leakage of the inverter chain that drives the sleep signal to the router blocks, which cannot be realized by existing leakage control approaches.
• Unlike traditional leakage control approaches, our approach allows the Input Port Modules (IPMs) and the Output Port Modules (OPMs) to sleep independently. With traditional leakage control approaches, then entire router must be put in sleep mode at once, reducing the leakage power reduction that can be availed in practice. The rest of the paper is organized as follows. In Section II, we discuss previous work in the area of asynchronous NoC design, and leakage reduction techniques. Section III describes our approach, and Section IV presents experimental results. Section V presents our conclusions.
II. PREVIOUS WORK
In this section, we first discuss some of the key works in asynchronous design, especially in the context of NoCs. Then we cover previous work in the area of low leakage VLSI design.
For NoC designs, the GALS paradigm [2] has become popular in recent times. Some of the more efficient GALS based NoC approaches have been published are [10] , [11] . Our work is based on the recent efficient asynchronous NoC reported in [1] , which uses a very efficient two-phase bundled data protocol, based on transition signaling. Most of the previous approaches, in contrast, utilize four-phase return-to-zero protocols, which require two round-trip messages to be sent per transaction, making them less practical for a state-of-the-art NoC design. The Mousetrap pipeline protocol [12] that is used in [1] is used in our work as well.
Several means for reducing leakage power have been proposed in recent times. Among the most common design approaches is the usage of dual threshold devices [13] in an MTCMOS (multithreshold CMOS) configuration 3 . Cell inputs and outputs as well as bulk nodes float in an MTCMOS design operating in standby mode. As a result, an MTCMOS designs can have a large range of leakage currents, and more problematically, their wake-up times from the standby (equivalently referred to as sleep) state is high. A scheme to fix these problems, while still yielding the leakage improvement of MTCMOS is the HL approach [7] , which we utilize in our approach.
In [14] , the authors address the problem of finding the best vector to utilize when the circuit is in standby mode. It was shown that the best vector has a leakage of about 50-75% of the worst leakage vector. Our approach uses a random vector for standby operation. Our leakage improvements results could potentially be improved by 2⇥ if a leakage vector selection algorithm was employed.
In [15] , the authors applied an HL-like technique to reduce power in an m-out-of-n design, using a 4-phase asynchronous handshaking protocol. They use a 90nm process technology, obtaining a 94% reduction in leakage power, with a 30% delay penalty, averaged over 5 small ISCAS89 benchmark circuits. In contrast, we demonstrate our leakage reduction technique for a significantly larger design (an asynchronous router), implemented using a superior 2-phase bundled data protocol implemented in a 45nm technology. Our results are significantly improved (8.1⇥ leakage reduction compared to the regular router, with a modest 13% steady-state flit routing delay penalty). In another work [16] , III. OUR APPROACH In this section, we first introduce the HL leakage reduction technique. Then we present a brief overview of the asynchronous NoC router [1] we based our work on. Next we discuss the Input Port Module (IPM) and Output Port Module (OPM) modules of the asynchronous router, along with a discussion on how we modified them to reduce their leakage currents.
A. Leakage Reduction with the HL Approach
With decreasing process feature sizes, a significant increase in leakage power as a fraction of total power has led to an increased focus on leakage current control. The simplest approach, multithreshold CMOS (MTCMOS), connects the ground (and supply) terminal of a cell to a high-threshold footer NMOS (and highthreshold header PMOS) transistor. Note that since the PMOS device threshold voltages are negative, the threshold voltage of the PMOS header is actually lower than that of other PMOS transistors in the circuit. In the remainder of this writeup, we refer to the absolute value of threshold voltages of any device. Note that the supply (and ground) signals are driven to each cell through the header (footer) devices. In practice, a single header or footer device can be used for a large number of logic cells. Variants of MTCMOS that are commonly used in industrial practice are the Header-only approach and the Footeronly approach. They are generally preferred since they involve the use of only high-threshold header (footer) devices, unlike MTCMOS, which requires both.
The disadvantage of the MTCMOS, Header-only and Footeronly approaches is that gate outputs float to non-rail values during the standby (equivalently referred to as sleep) mode of operation. This results in a variability in the leakage of the circuit, and large delays and inrush currents during wake-up. To fix this, the HL approach [7] ensures that the gate outputs maintain rail values during sleep mode. The circuit primary inputs are driven to a fixed vector during sleep. Simulating the circuit under this vector yields the gate outputs for each gate in the design (see the left part of Figure 1 ). The low-leakage HL circuit is obtained by:
• If a gate G has an output value of "1" when the sleep vector is applied to the circuit, then the ground pin of G is connected to the high-threshold NMOS footer device. The modified gate is referred to as a H cell, and it's leakage is reduced during standby since there are at least two devices between it's output and the ground terminal, one of which has a high threshold voltage.
• If a gate G has an output value of "0" when the sleep vector is applied to the circuit, then the supply pin of G is connected to the high-threshold PMOS header device. The modified gate is referred to as a L cell, and it's leakage is reduced during standby since there are at least two devices between it's output and the supply terminal, one of which has a high threshold voltage. The HL conversion of the circuit on the left part of Figure 1 is shown on the right side of the same figure. Note that the header and footer devices can be shared across several gates in practice. Our low-leakage asynchronous router shares header and footer devices across several gates. 
B. Overview of Asynchronous Router
The asynchronous router we base our work on is a reimplementation of the asynchronous router reported in [1] . This state-of-the-art router implements a 2D mesh topology, and utilizes a two-phase bundled data routing protocol. In the NoC, each PE has an associated asynchronous router. Routing is performed using 5 Input Port Modules (IPMs) and 5 Output Port Modules (OPMs). One input port and one output port are used to communicate with the associated PE, while the other 4 input and output ports connect to the corresponding ports of the asynchronous routers in the North, South, East and West directions respectively. The routers support a NoC of size up to 16⇥16 PEs, with an 8-bit address. A packet consists of a head flit, one or more body flits, and a tail flit. The router supports wormhole routing. In other words, once a head flit traverses a router, the routing path is reserved until the tail flit of the same packet leaves the router. Dimension-order routing (DOR-XY) is supported, with the X dimension routing being performed before the Y dimension routing. Each flit is 144 bits wide for our reimplementation of [1] as well as our low-leakage asynchronous router design. The 2 least significant bits of each flit indicates its type (11 for head, 10 for tail and 00 or 01 for body). The head flit reserves an additional 8 bits to store the packet destination address. A key requirement for leakage reduction in the HL methodology is that the inputs to a design stay at a fixed, predetermined value during standby. In our implementation of the HL-based lowleakage asynchronous router, we allow the router to enter standby operation between packets, but not within a packet (i.e. between flits). This is because sleeping between flits would not give the power supplies enough time to reach their steady state values if we sleep between flits, since the capacitances on the supply and ground nodes in the design is high.
Also, we assume that every packet has an even number of flits. This ensures that the REQ and ACK signals have a fixed and identical value (assumed to be "0" in our simulations) before and after a packet traverses the router. As a result, both REQ and ACK have a "0" value during standby, allowing us to maximize leakage power improvements. Without this restriction, the REQ and ACK signals could both be "0" or "1" during standby, making it impossible to assign an H or L type to the gates in the fanout of the REQ and ACK signals, thereby increasing leakage power.
1) Modified Input Port Module (IPM) for Reduced Leakage: The role of the IPM is to forward flits to the appropriate OPM in the router, and handle wormhole routing. The schematic of IPM4 is shown in Figure 3 . Other IPMs are identical, with slightly altered signal indices.
The original unmodified logic of the IPM is shown in the dotted region of Figure 3 . The modifications to the IPM to enable leakage reductions are shown in Figure 3 in the solid boxes.
Routing begins with a header flit of a packet arriving at IPM4. For the following discussion, we assume it's destination address requires that it exit from OPM1. Initially, REQIN and ACKIN are "0" as discussed. DATAIN is driven to IPM4, with the header flit contents. Based on the packet destination address in the header flit, the Packet Route Selector block determines which OPM the
, and it is latched as REQX. The rising of REQX causes the RouteSelected1 signal to go high, which is latched through the SR latch, so that PacketPathEnabled4 1 goes high. This signal reserves the route from IPM4 to OPM1 until the tail flit is detected by OPM1 (indicated by the TailPassed1 4 signal. This is how wormhole routing is implemented. Only one RouteSelectedi signal can be high at a given time. Note that the REQX signal is driven to all OPMs as REQ4 i, since ACKi 4 are initially all "0". Once the flit is driven out from OPM1, the ACK1 4 signal will be driven high, ensuring that REQ4 i gets driven to "0" for i = 0, 2, 3. Finally, the ACKX signal, which is the XOR of the four ACKi 4 signals, goes high, indicating to the sender that a new flit can be driven on DATAIN. Once the tail flit of the packet passes through OPM1, the TailPassed1 4 signal is driven high, tearing down the wormhole route and driving PacketPathEnabled4 1 low.
The low-leakage HL version of the IPM circuit uses the circuits labeled X, Y and S in Figure 3 . Circuit S shows that when all PacketPathEnabled4 i signals are low (no flits being routed), REQIN and ACKIN are both zero (no packet being routed), and the 2 LSBs of DATAIN are not "1" each (no header flit detected), the IPM sleep signal is asserted. When a new packet arrives (header flit is driven on DATAIN), the IPM sleep signal is driven low. The power supplies are restored to their rail values during this wake-up phase. Once they have been restored, the REQIN signal is driven high, to begin the process of routing in an identical manner as in the paragraph above (for the unmodified router). While flits are being routed, one of the PacketPathEnabled4 i signals is asserted, preventing the IPM from sleeping. This essentially means that the IPM cannot sleep between flits, but only between packets. When the last flit of a packet passes through OPM1, the TailPassed1 4 signal is driven high, driving PacketPathEnabled4 1 low. At this time, the sleep signal is driven high again.
Circuits X and Y show that when the IPM is in standby, all 144 DATAIN signals, as well as the REQIN and the ACKi 4 signals are driven low, with the L-type AND2 gates. This is because during sleep, the input vector we select for the IPM is the vector in which all the inputs to the IPM are "0". We refer to circuits X and Y as forcing circuits.
Based on the sleep vector inputs, all the gates in the IPM (in the dotted box) are appropriately changed to H or L gates, for leakage reduction.
2) Modified Output Port Module (OPM) for Reduced Leakage: The role of the OPM is to forward flits to the next link, if it is selected by the IPM, based on the destination address of the packet being routed. The schematic of OPM4 is shown in Figure 4 . Other OPMs are identical, with slightly altered signal indices.
The normal logic of the OPM is shown in the dotted region of Figure 4 . The modifications to the OPM to enable leakage reductions are shown in Figure 4 , in the solid boxes.
For the following discussion, we assume the destination address of the packet requires that it exit from OPM4, and the packet arrives from IPM1. As a consequence, PacketPathEnable1 4 is high. Suppose that one or more other PacketPathEnablei 4 are also high. In this case, the 4-input MUTEX block arbitrates which IPM gets to drive it's packet. The structure of the 4-input MUTEX block is explained in [1] .
The selected IPM (assume it is IPM1 for our discussion) latches its REQ1 4 signal, and if the ACKOUT of the OPM is different from it's REQOUT, the data (DATA1 4) is latched and driven out of the OPM as DATAOUT. Simultaneously, REQOUT changes, signaling the next link that data is available, while IPM1 is sent an acknowledge signal ACK4 1 so it can send the next flit. Once the last flit of a packet is detected (by means of the TailPassed4 1 signal going high, the PacketPathEnabled4 1 is driven low (by the IPM circuit, see Figure 3) .
The low-leakage HL version of the OPM circuit uses the circuits labeled S, X and Y in Figure 4 . Circuit S shows that when all PacketPathEnabled4 i signals are low (no flits being routed), REQOUT and ACKOUT are both "0" (no packet being routed), then the OPM sleep signal is asserted. When a flit is awaiting being routed through the OPM (i.e. at least one PacketPathEnabled4 i signal is high), the OPM is woken up, and kept awake until all PacketPathEnabled4 i signals are low. Since PacketPathEnabled4 i stays asserted for the length of the packet, wormhole routing is achieved, without the OPM going into standby between flits of the same packet. Similarly, if REQOUT is not equal to ACKOUT, the OPM stays awake. Assuming the OPM is asleep, the rising of one or more PacketPathEnabled4 i will wake up the OPM. The power supplies are restored to their rail values during this wake-up phase. Finally, after the entire packet has been routed, the OPM may sleep again when REQOUT equals ACKOUT (transaction completed) and all PacketPathEnabled4 i signals are low.
Circuits X and Y show that when the OPM is in standby, all 144 DATAi 4 signals (for i = 1, 2, 3, 4), as well as the REQi 4 and the ACKOUT signals are driven low, with L-type AND2 gates. This is because during sleep, the input vector we select for the OPM is the vector in which all the inputs to the OPM are "0". We refer to circuits X and Y as forcing circuits.
Based on the sleep vector inputs, all the gates in the OPM (in the dotted box) are appropriately changed to H or L gates, for leakage reduction.
3) Discussion on HL versus MTCMOS, Header-only and Footer-only Approaches: There are two key benefits to using HL based leakage control versus the MTCMOS, Header-only or Footer-only approaches.
Sleep Inverter Chain Optimization: Consider the inverter chain required to drive the sleep signal for a design, as shown in Figure 5 a). MP and MN are the high-threshold PMOS and NMOS transistors respectively. In general, it is impossible to connect the 4 inverters I1 through 14 (which buffer the sleep signal) to MP and MN power gating transistors. This is because once the circuit is in sleep mode, the 4 inverters would be non-functioning, and the circuit would not be able to exit sleep mode again. As a consequence, these inverters would need to be always-awake, and hence would leak significantly, especially since they are typically large since they drive a large load. Now, with the HL approach, we can power gate these 4 devices as well, thus reducing leakage power further. Consider Figure 5 b) . In this circuit, I1 and I3 are connected to a high-threshold voltage power gating PMOS device MP1. I2 and I4 are connected to a high-threshold voltage power gating NMOS transistor MN1. Two additional always-on inverters (implemented with long-channel, high-threshold voltage devices to reduce leakage power) I5 and I6 are connected to the sleep signal. These inverters drive the gate of MN1 and MP1 respectively. The sleep signal is delayed (using a pair of highthreshold voltage, long channel inverters) before it drives the inverter chain I1 through I4. Now consider the case that the circuit is awake, and going into sleep mode (i.e. the signal sleep is rising). MP1 and MN1 immediately go to sleep, since they are driven (undelayed) by S and S, the outputs of I6 and I5 respectively. However, after a delay D (the delay of the delay line), the output of I1 still falls since it is connected to the ungated ground signal. Similarly, the output of I2 rises, and the outputs of I3 falls as well. Finally, the output of I4 rises. Now the power gating devices of the circuit (MP and MN) turn off, and the circuit enters sleep mode.
Consider the case that the circuit is in sleep mode, and is exiting sleep (i.e. the signal sleep falls). In this case, I5 and I6 immediately cause MN1 and MP1 to exit sleep. After a delay D, the sleep signal now drives I1 through I4 in turn (since they are now awake), causing MP and MN to exit sleep as desired. A similar circuit cannot be devised for the MTCMOS, Headeronly or Footer-only approaches. When the sleep signal rises, the inverter chain would immediately enter sleep mode, making it impossible for MN and MP to enter the sleep state. For these alternate leakage approaches, therefore, the inverter chain driving the sleep signal needs to be always awake, increasing leakage power significantly (since the inverter chain typically contains large inverters on account of the large size of MP and MN).
Without the use of the circuit of Figure 5 b), our router achieved a ⇠6⇥ leakage improvement over the unmodified router. With the use of the circuit of Figure 5 b), this number improved to 8.13⇥.
Fine-grained Per-IPM and OPM Sleep Ability: In our HL approach, the sleep functionality of the router is achieved in a per-IPM and per-OPM fashion, as described in Section III-B.1 and Section III-B.2. However, in the MTCMOS, Header-only or Footer-only approaches, the sleep functionality is only possible to implement on a coarser, per-router basis. This is because the sleep logic for the IPM and OPM (the circuit marked "S" in Figure 3 and Figure 4 ) requires internal router signals (such as PacketPathEnabled4 i, which, in the MTCMOS, Header-only or Footer-only approaches is a floating signal). The ability to sleep on a per-IPM or per-OPM basis would avail a much greater leakage reduction under normal operation of the CMP.
IV. EXPERIMENTAL RESULTS
We implemented the asynchronous NoC router of [1] (henceforth original router), along with our modification HL-based low-leakage asynchronous NoC router (henceforth HL router) in HSPICE [17] , in a 45nm PTM [9] high-performance process. We also conducted our experiments using the low-power 45nm PTM process cards, and obtained much higher leakage improvements, but with router delays (for the unmodified as well as the modified designs) that were about 3⇥ larger. Since a NoC router needs to be able to route packets at high speed, we focus on the results from the simulations with the high-performance model cards.
The first step was to design both the original router in Verilog. To obtain the HL router, the IPMs and OPMs of the original router were driven with a sleep vector in which all the inputs had the "0" value. The router was simulated logically, and all cells with a "0" output were replaced by L cells, and those with "1" outputs were replaced by H cells.
After this, we thoroughly simulated both the original router and the HL router to verify logical correctness. Both routers were placed in a 4⇥4 NOC, and were tested at the logical level to ensure correct routing functionality. In the case of the HL router, correct assertion of the sleep signal was verified as well.
The Verilog netlists of both the original router and the HL router were next converted to HSPICE netlists using the V2S [18] tool.
Next, we sized the PMOS header and NMOS footer devices for both the IPMs and OPMs of the HL router. While the router was routing packets in HSPICE, we sized the headers and footers so that there was at most a 100mV droop in the supply signal, and a 100mV bounce in the ground signal of any OPM or IPM. The data flits being driven through the router were designed for maximal stress on the power and ground networks. The threshold voltage of the headers and footers were 200 mV above the nominal threshold voltage of PMOS and NMOS devices in the design.
Next, we simulated the original and HL routers to measure their delays, dynamic power, static power, and wake-up time (for the HL router). The results of these HSPICE simulations are presented next.
Note that the dynamic power numbers that are reported are the maximum value of dynamic power, when packets are being routed through every IPM and OPM (in other words, 5 packets are being routed through the asynchronous router), with maximal switching of the bits between flits being routed.
The leakage power numbers that are reported are for the condition when all IPMs and OPMs are in the sleep state. Note that for our HL router, the sleep functionality is implemented on a per-IPM and per-OPM basis, so our router can benefit from a "partial" sleep condition in which some IPMs and OPMs are active, while others are inactive (and can thus enter sleep mode). In the case of the MTCMOS, Header-only and Footeronly approaches to leakage control, the sleep functionality can only be implemented on a per-router basis, dramatically reducing the ability to avail of opportunistic leakage power reductions when only some IPMs and OPMs are active. Table I reports the delay and dynamic power for the first flit of any packet. Table II reports the delay and dynamic power for the subsequent flits of any packet. Note that for all tables, the actual numerical values are presented for the original router, and relative values are presented for our router. We note that the delay for the first flit is about 2.73⇥ higher for our approach. This is because the time for the IPM to wake up is included in this delay. In general, the delay for the first flit is higher than for subsequent flits, since the first flit requires address decoding and path setup. Importantly, the delay overhead for our approach for subsequent flits is only 13%. The dynamic power consumption for the first flit for our router is about 0.54⇥ that of the original router, while the dynamic power consumption for subsequent flits for our router is about 98% that of the original router. Table III reports the leakage numbers and wake-up times for the asynchronous routers. Our router achieves a 8.13⇥ improvement in leakage compared to the original router. The wake-up time of our router was about 870 ps. This number can be improved further at the cost of some leakage improvement. However, this number is less than the cycle time of a typical 1 GHz CMP, and is quite acceptable in practice. Table IV reports the active area numbers of the IPMs and OPMs. We note that the 8.13⇥ leakage improvement of our approach comes at a nominal 23% area cost. Table V reports the area breakdown of the various components of our router. Note that the gates in the forcing circuits contribute the most to the area overhead of our approach.
V. CONCLUSIONS In this paper, we take a state-of-the-art asynchronous router for a Network-on-Chip (NoC), and modify it for leakage improvement. We use a powerful leakage control technique, which allows internal circuit nodes to stay at rail values during sleep mode. We present a novel leakage reduction approach for the inverter chains that drive the sleep signal, further improving the leakage reductions obtained. This technique cannot be used in traditional leakage control approaches. We implemented and verified (both at the logical and circuit levels) the unmodified and the modified routers, and show that the modified router obtains a 8.13⇥ reduction in leakage in sleep mode, with a 13% delay penalty and a low wake-up delay.
