Although the technology scaling has enabled designers to integrate a large number of processors onto a single chip realizing chip multi-processor (CMP), problems arising from technology scaling have made power reduction an important design issue. Since interconnection networks dissipate a significant portion of the total system power budget, it is desirable to consider interconnection network's power efficiency when designing CMP. In this paper, we present a variable frequency link for a power-aware interconnection network using the clock boosting mechanism, and apply a dynamic frequency scaling (DFS) policy, that judiciously adjusts link frequency based on link utilization parameter. Experimental result shows that history-based DFS successfully adjusts link frequency to track actual link utilization over time, demonstrating the feasibility of the proposed link as a power-aware interconnection network for system-on-chip (SoC).
Introduction
The technology scaling has enabled designers to integrate a large number of processors onto a single chip, realizing chip multi-processor (CMP). High performance CMP architectures have been gaining the attention of high performance computing community in the past few years. As the demand for network bandwidth increases for CMP, the idea of network-on-chip (NoC) becomes more promising because of performance, power, and scalability requirements for an SoC design [1] .
Although today's processors are much faster and far more versatile than their predecessors using high-speed circuits and parallel processing, they also consume a lot of power. Moreover, an interconnection network dissipates a significant fraction of the total system power budget. For instance, the MIT raw on-chip network consumes 36% of the total chip power and Alpha 21364 microprocessor dissipates 20% of power in interconnection network [2] . Therefore, an interconnection network must be designed to be power-aware.
In this paper, we motivate the use of dynamic frequency scaling (DFS) link, where the frequency is dynamically adjusted to minimize power dissipation while maintaining the performance demands. First, we propose a novel DFS link which adopts a clock boosting mechanism [3] , providing fast response time for frequency transitions and low hardware overhead. Next, a DFS policy is introduced that includes the link utilization estimator, DFS algorithm, and link controller. Finally, the DFS policy is applied to the proposed DFS link, demonstrating the power saving in an on-chip interconnection network. To the best of our knowledge, this is the first investigation of power reduction for on-chip interconnection network based on the clock boosting mechanism.
The rest of this paper is organized as follows. Section 2 addresses existing work on power saving for on-chip interconnection network. The proposed DFS link based on the clock boosting mechanism is introduced in Section 3. The implementation and the experimental results are presented in Sections 4 and 5, respectively. Finally, conclusions are drawn in Section 6.
Backgrounds

Dynamic voltage/frequency scaling
A communication link in NoC is capable of scaling power consumption gracefully commensurate with traffic workload. This scalability allows for the efficient execution of energy-agile algorithms. Suppose that a link can be clocked at any nominal rate up to certain maximum value. This implies that different levels of power will be consumed for different clock frequencies. One option would be to clock all the links at the same rate to meet the throughput requirements. However, if there was only one link in the design that required to be clocked at a high rate, the other links could be clocked at a lower rate, consuming less power.
The total power consumption in an SoC is the combination of dynamic and static sources. In this paper, our focus is on the dynamic power consumption which arises from circuit switching activity, due to charging and discharging of the switched capacitance. The dynamic power consumption depends on four parameters: a switching activity factor (a), physical capacitance (C), supply voltage (V), and the clock frequency (f )
ARTICLE IN PRESS
Eq. (2) establishes the relationship between the supply voltage V and the maximum operating frequency f max , where V th is the threshold voltage, and Z and b are experimentally derived constants. Dynamic power consumption can be reduced by lowering the supply voltage. This requires reducing the clock frequency accordingly to compensate for the additional gate delay due to the lower voltage. The use of this approach in run-time, which is called dynamic voltage scaling (DVS), addresses the problem of how to adjust the supply voltage and clock frequency of the link according to the traffic level. The basic idea is that because of high variance in network traffic, when a link is under-utilized, the link can be slowed down without affecting performance. However, DVS requires thousands of clock cycles during transition between voltage levels and additional hardware overhead for each link.
The other way to manage power consumption is DFS. DFS only adapts the system clock frequency by setting all links in the network to the same voltage, but it does not always reduce the total energy consumption. For instance, the power consumed by a network can be reduced by halving the operating clock frequency, but if it takes as long to forward the same amount of data, the total energy consumed will be similar. DFS is valid when the target system does not support DVS or the goal is to reduce peak or average power dissipation, indirectly reducing the chip's temperature [4] . An alternative to save link power is to add hardware such that a link can be powered down when it is not used heavily.
Related works
System level power management has been applied to some interconnection networks. Wei and Kim proposed chip-to-chip parallel [5] and serial [6] There are three kinds of approaches for DVS. One is an on-line scheme which adjusts the link speed dynamically, based on a hardware prediction mechanism by observing past link traffic activities. Shang et al. [7] developed a history-based DVS policy which adjusts operating voltage and clock frequency of a link according to the utilization of link and input buffer. Worm et al. [8] proposed an adaptive low-power transmission scheme for onchip networks. They minimized the energy required for reliable communication, while satisfying QoS constraints. One of the potential problems with hardware prediction scheme is that a misprediction of traffic can be costly from performance and power perspectives.
Li et al. [9] proposed a compiler-driven approach where a compiler analyzes application code and extracts communication patterns among parallel processors. These patterns and the inherent data dependency information of the underlying code help the compiler decide the optimal voltage/frequency to be used for communication links at a given time frame. Shin and Kim [10] proposed an off-line link speed assignment algorithm for energyefficient NoC. Given the task graph of a periodic real-time application, the algorithm assigns an appropriate communication speed to each link, while guaranteeing the timing constraints of real-time applications.
Soteriou et al. [11] proposed a software-directed methodology that extends parallel compiler flow in order to construct a poweraware interconnection network, by combining both on-line and off-line approaches. However, current DVS techniques require not only thousands of clock cycles to shift between voltage levels, limiting their ability to respond to high frequency changes in network bandwidth demands [12] , but also additional hardware overhead such as transmitter, receiver, PLL, and adaptive power supply regulator for each link.
Kim et al. [13] proposed dynamic link shutdown (DLS), which powers down links intelligently when their utilizations are below a certain threshold level and a subset of highly used links can provide connectivity in the network. An adaptive routing strategy that intelligently uses a subset of links for communication was proposed, thereby facilitating DLS for minimizing energy consumption. Soteriou and Peh [2] explored the design space for communication channel turn-on/off based on a dynamic power management technique depending on hardware counter measurement obtained from the network during run-time. Chen et al. [14] introduced a compiler-directed approach, which increases the idle periods of communication channels by reusing the same set of channels for as many communication messages as possible. Li et al. [15] proposed a compiler-directed technique in order to turn off the communication channels to reduce NoC energy consumption. Even though it saves power significantly during idle period, it has reactivation penalty including delay and additional power consumption during a transition.
Hsu [16] saved 30% of power consumption in the MPEG core by applying DFS power management mechanism using only three frequency levels (25, 50, and 100 MHz). However, DFS was applied to a core, not to interconnection network, in a tile-based NoC architecture. To the best of our knowledge, this paper is the first proposal which addresses DFS for interconnection network.
The novel contributions of our work are:
(1) a DFS link proposal for on-chip interconnection network which offers not only fast response time reducing the frequency transition penalty, but also reduces hardware cost, as compared to DVS link, suitable for system integration; (2) use of narrow control period, as compared to conventional DVS control, reducing misprediction penalty that occurs in a hardware prediction scheme by adjusting the frequency more often; (3) implementation of a variable frequency link that judiciously adjusts link frequency based on the link utilization estimation, reducing power consumption.
Variable frequency link
The clock boosting router was proposed to increase throughput and reduce latency of an adaptive wormhole router [3] . The key idea of clock boosting mechanism is the use of different clocks in a head flit and body flits because body flits can continue advancing along the reserved path that is already established by the head flit, while the head flit requires the support of complex logic, increasing critical path. Thus, it reduces latency and increases throughput of a router by applying faster clock frequency to a boosting clock in order to forward body flits.
DFS only adapts the system clock frequency by setting all links in the network to the same voltage. The clock boosting router can be modified to support a variable frequency link that is applicable for DFS with negligible hardware cost and fast response time to frequency changes. In addition, the operating frequency of a system is not limited by the critical path of the route decision logic because it only changes clock frequency for the body flit transmission. Thus, the proposed method not only provides variable frequency link but also increases interconnection network performance. Also, fast response time of the clock domain variations makes it possible to use narrow control period for DFS, where clock frequency is adjusted more frequently. Fig. 1(a) shows an example of the proposed variable frequency link using clock boosting mechanism. The system has multiple clock frequencies represented by F i . Link controller selects boosting clock frequency for the clock boosting router among supported clock frequencies by using link utilization level. Fig. 1(b) shows the time-space diagram for variable frequency links. In this example, the link supports three different frequencies (F 1 , F 2 , and F 3 ). A conventional variable frequency link changes clock domain for the entire control period, while the proposed variable frequency link applies different clock frequencies only to the body flit transmission. The original clock frequency (F 1 in this example) is still used for the head flit transmission as well as for idle cycles.
In this paper, we use multiple clock frequencies (1Â, 2Â and 4Â) as the boosting clock frequency for the body flit transmission, in order to reduce implementation complexity. Table 1 summarizes the characteristics of a single link with different boosting clock frequencies using TSMC 90 nm technology. By boosting clock frequency, throughput of a link is increased at the expense of more power consumption, demonstrating the application of a DFS link for an NoC. The power saving is from the adaptation of boosting clock frequency according to the traffic load at the given time period. Because head flit is a small part of a packet, we can achieve power saving by adjusting the operating frequency for body flit transmission. Even though the DFS link in this paper supports three different levels of clock frequency, experimental results show power saving potential for any interconnection network.
History-based DFS
Problem formulation
The DFS link allows different frequencies: f 1 ; f 2 ; . . . ; and f s . In addition, changing frequencies does not incur too much overhead.
There are two repetitive jobs: making a path decision (j h ) for a head flit and forwarding body flits (j b ). A path decision is processed before forwarding body flits (j h ! j b ) ( Table 2 summarizes parameters used for this study).
DFS link control is performed for each control period T c . The ith period is the time interval of ½ði À 1ÞT c ; iT c . Let N ¼ f1; 2; . . . ; ng. We use f i to indicate the frequency during the ith period, f i 2 ff 1 ; f 2 ; . . . ; f s g and i 2 N. Suppose w h i and w b i are the number of head and body flits arriving at a link during ith period, respectively. It takes w h i =f 1 to make a path decision at frequency f 1 and w b i =f i to forward body flits at frequency f i . Thus, packets forwarding of the ith workload takes d i time as follows:
From Eq. (1), dynamic power consumption in ith period is
Therefore, the problem of power minimization in DFS is to find a frequency (the value of f i for i 2 N) to minimize power consump- Table 2 Symbols and meanings of the parameters used for this study.
Tc
Control period for the DFS link control Boosting frequency in the ith period 
Link utilization estimation
In applying DFS to a system, how to predict future workload with reasonable accuracy is a critical problem. This requires knowing how many packets will traverse a link at any given time. Two issues complicate this problem. First, it is not always possible to accurately predict future traffic activities. Secondly, a subsystem can be preempted at arbitrary times due to user and I/O device requests, varying traffic beyond what was originally predicted. In order to estimate future work load, we adopt link utilization as an indicator, which is a direct measure of traffic through a link in each unit time. Lower link utilization reflects the more idle cycles in a link caused by network congestion with heavy traffic or sparse workload in the incoming port. Conversely, higher link utilization implies that more active cycles in a link passing flits to the destination router.
The link utilization is measured by sampling a link at a given time during a pre-defined control period (T c ). The direct link utilization is defined, where k denotes the number of samples in 
The direct estimator only measures the link utilization whether a link is occupied or not. It does not consider the number of flits traversing through a link during the given time. For instance, even though the link utilizations are the same in time durations Dt 1 and Dt 2 , the number of flits passing through the link can be different according to the boosting clock frequency of the router at those times. Direct estimation can be realized with a counter, reducing the complexity of the estimator and additional hardware overhead caused by the DFS link controller. A counter at each output port gathers the total number of cycles that are used to pass a flit in each control period by counting uðtÞ with 4Â clock for accurate measurement ( Fig. 2(a) ). Function uðtÞ is assigned to the write enable signal of the router, and the counter value is sampled in each control period to complete the measurement of the link utilization.
Network workload exhibits transient fluctuation and longterm transitions. In order to filter out transient fluctuations from link utilization and to predict future communication workload, distributed history-based DVS policy was proposed [7] .
History-based link estimator uses exponential weighted average utilization to combine current (U L ðnÞ) and past (C L ðn À 1Þ) link utilization history, smoothing and predicting future link utilization C L ðnÞ as follows:
where C L ð0Þ ¼ c 0 , i 2 N, and weight is the contribution factor of current link utilization level to the history-based link estimator. The hardware overhead is an important factor for the design of the estimator. Soteriou [2] realized the history-based estimator with two shifters and an adder by setting weight equal to 3, reducing additional hardware overhead caused by the prediction mechanism. Fig. 2(b) shows the hardware circuit for the exponential weighted average. The result of direct estimator is fed to the exponential weighted average calculator to predict the link utilization. The history-based estimator is a cascade of direct link utilization estimator and an exponential average calculator.
DFS algorithm
Given the link utilization, the DFS algorithm dynamically adapts its frequency to achieve power savings with minimal impact on performance. It prescribes whether to increase boosting clock frequency to higher level, decrease boosting clock frequency to lower level, or do nothing. Even though the link utilization estimator predicts correctly the workload, determining how fast to run the network is nontrivial. The algorithm controlling DFS link trades off power and performance. Intuitively, if a link utilization is going to be high (C L Xp u ), the boosting clock frequency will be increased. On the contrary, when a link utilization falls below the threshold value (C L op l ), the boosting clock frequency is reduced. The threshold values (p u and p l are the threshold values to increase and decrease the boosting frequency, respectively) can be set to a single value for p u and p l for the simplest method. Also, multiple thresholds can be set corresponding to each state (three sets of thresholds from p l 1x to p l 4x and p u 1x to p u 4x ). In addition, threshold values can be pre-defined in designtime, or optimized in run-time. The pseudo-code of our DFS policy is shown in Algorithm 1.
Algorithm 1.
Dynamic frequency scaling 
Link controller
The link controller is implemented with a Moore machine. Each state represents the boosting clock frequency such as f 1x , f 2x , and f 4x with a two-bit value, and the machine output, equal to the state value, is passed on the clock domain multiplexer. In our link controller, there is no change between state (f 1x ) and state (f 4x ). Clock domain transition only occurs between adjacent clock frequencies. We assign the state values such that the Hamming distance between state transitions is 1. Clock is the most
ARTICLE IN PRESS
Counter clock 4x Register current important and sensitive signal in a system and glitches between clock domain transitions make the system unstable, resulting in erroneous signals. To ensure constancy of the clock phase during clock domain changes, we set the control period to multiples of the clock period of the original clock frequency; and the link control function is performed in each control period.
Physical characteristics
A logic description of our router's component has been obtained by the synthesis tool from the Synposyst using TSMC 90 nm technology. Table 3 summarizes the physical characteristics of each element. Clock boosting router operates up to 425:5 MHz and the FIFO operates up to 1:72 GHz. For this paper, FIFO depth is set to eight. The wider control period imposes more area cost in DFS control logic because it requires a wider register, an adder, and a shifter.
The overall design including a router, eight FIFOs with size 8 and six DFS controllers occupies an area of approximately 0:143 mm 2 in 90 nm technology. 1 However, the ARM11 MPCoret and PowerPCt E405, which provide multi-CPU designs, occupy 1.8 and 2:0 mm 2 in the same technology, respectively [17, 18] . If the link was integrated within a CMP as an interconnection network, the area overhead imposed by the network would be reasonable showing the feasibility.
Experimental results
Experimental setup
In order to evaluate the performance of the proposed DFS link, each component was modelled in VerilogtHDL. Fig. 3 shows the experimental setup for evaluating characteristics of a single DFS link. For this experiment, the source router sends packets to the sink router, and a FIFO is located between the routers. There are four arriving packets to the source router from North, West, South, and Internal node that should be forwarded to the East output port. The power consumed by an intermediate FIFO depends largely on the amount of buffering and the architecture. The depth of a FIFO between links is fixed to eight in this experiment. For the measurement of throughput and adjusting incoming traffic, we adopted a standard interconnection network measurement setup [19] , where the packet generation is placed in front of an infinite depth source queue and an input timing of each packet is measured whenever it is generated.
The power consumption of the interconnection network was extracted using 90 nm technology. The RTL description is synthesized to the gate level net-list with Synopsys Design Compilert [20] using technology library. As part of this step, physical information such as RC parasitic value files (SPEF), standard delay format (SDF) and design constraints file (SDC) are also generated to be used for gate level power analysis. The gate level simulation extracts latency information of each packet and generates a value change dump (VCD) file for the power analysis. Power analysis with Synopsys PrimeTimet PX tool [20] creates nanosecond detailed power waveform using switching and physical information. for a 7 Â 7 network. It confirms the presence of high variance in network traffic. In this analysis, power consumption under a given traffic pattern is investigated. Even though this traffic pattern cannot realistically reflect all types of traffic that will traverse the network, using this traffic pattern provides a reasonable measurement for the performance of this method.
DFS link characteristics
The DFS link characteristics for each boosting clock frequency are obtained by simulation under the given workload (see Table 4 ). The 1Â boosting router finishes the entire packet transmission in 24:34 ms, spending more time than 2Â and 4Â boosting router.
It also has the highest average and peak latency. The 2Â boosting router reduces average latency by about 81% at the expense of 16% more dynamic power in contrast to the 1Â boosting router. Similarly, 4Â boosting router is much better as compared with the 1Â boosting router in terms of latency; however, it consumes 21% more dynamic power, reducing average latency to 87%. It also reduces the average latency around 32% at the expense of only 5.1% more dynamic power in comparison with 2Â boosting router.
2 These experimental results demonstrate the feasibility of clock boosting router for the DFS link for a power-aware on-chip interconnection network for an NoC platform.
History-based DFS
The history-based DFS policy was applied to the DFS link.
Threshold values for the link controllers, p u and p l , were set to 80% and 50%, respectively. Fig. 4 shows the simulation result of the history-based DFS policy when control period (T c ) is eight cycles. Link utilization estimator predicts future workload based on the history of workload (Fig. 4(b) ). The DFS policy dynamically adjusts link frequency according to the link utilization level (Fig. 4(c) ). Fig. 4(d) shows that history-based DFS successfully adjusts link frequency to track actual link utilization over time, reducing Fig. 3 . Experimental setup. 1 The clock boosting router has two disjoint sub-networks for the west-to-east and east-to-west for deadlock free operation, utilizing eight input FIFOs [3] . 2 From Eq. (4), dynamic power of 2Â and 4Â operations are 26% and 40% more than the power of 1Â operation (P D2x ¼ 1:26 Á P D1x and P D4x ¼ 1:4 Á P D1x ), by assuming P D1x ¼ router consumes 1.85 and 2.12 mW, respectively, demonstrating the possibility of run-time power management for the given workload. Choosing a wider control period further slows down the adaptation of link frequency for the given traffic, exacerbating latency. While there is a trade-off in power and performance for the control period from 8 to 64 cycles, history-based DFS with 128 control periods consumes more power. It also increases the latency because of selecting very long control period for the given workload. For on-chip interconnection network, the latency can be a suitable indicator to measure the performance of a network. Trade-off between power consumption and latency depends on the length of control period for the DFS policy. Even though a longer control period saves more power, it suffers from excessive latency. For the given workload, choosing control period of eight cycles is preferable for the DFS when an application requires tight timing requirements. However, a longer control period might be enough to cope with system requirements, saving more power dissipation.
ARTICLE IN PRESS
In general, each application has its own power and performance demand to complete an assigned task within the desired time budget. A designer should keep in mind the system requirements in applying DFS for the on-chip interconnection network.
Conclusions
We have presented the notion of DFS link with fast response time using clock boosting mechanism for on-chip interconnection network. The history-based DFS policy allowed link frequency to be adjusted judiciously commensurate with the workload, balancing between power dissipation and latency penalty. The proposed DFS link and the corresponding algorithm require a simple hardware implementation, making the proposed new idea a practical option for the future NoC designs.
