In this paper we present NoCEE, a fast and accurate method for extracting energy models for packet-switched Network on Chip (NoC) routers. Linear regression is used to model the relationship between events occurring in the NoC and energy consumption. The resulting models are cycle accurate and can be applied to different technology libraries. We verify the individual router estimation models with many different synthetically generated traffic patterns and data inputs. Characterization of a small library takes about two hours. The mean absolute energy estimation error of the resultant models is 5% (10% max) against a complete gate level simulation. We also apply this method to a number of complete NoCs with inputs extracted from synthetic application traces and compare our estimated results to the gate level power simulations (mean absolute error is 5%). Our estimation methodology has been integrated with commercial logic synthesis flow and power estimation tools (Synopsys Design Compiler and PrimePower), allowing application across different designs. The extracted models show the different trends across various parameterizations of Network on Chip routers and have been integrated into an architecture exploration framework.
INTRODUCTION
Networks on chip has arisen as a solution to the poor wire scaling and increasing complexity of large System on Chip (SoC) design. NoCs aim to replace long shared bus wires with scalable switched networks with higher performance, predictable wiring and better interconnect properties. These networks can be made latency insensitive, simplifying the design of complex systems since communication and computation problems can be treated separately.
Although NoCs provide many benefits, they add additional logic complexity to the communication architecture. Power consumption in the communication architecture is a major bottleneck in current design [17] and hence, energy-aware optimizations for NoCs is of primary importance. Several optimization methods have been proposed to reduce energy consumption through application specific customizations. These include: customized router buffer sizing [15] ; custom topology generation [25] ; adaptive routing [16] ; and mapping processing elements to tiles [14, 21] .
Motivation for this work
Various NoC architectures have been proposed, each with a differing implementation and hence, varying energy consumption characteristics. In order to explore the design space for energy minimization, a method for estimating energy is needed. Typically, a design's energy consumption is evaluated using a register transfer level (RTL) or gate level power simulator. Power simulations are essential for finding implementation bottlenecks but their lengthy run-times make them unsuitable for early design space exploration (days for a typical trace). For this reason, techniques are needed to quickly extract a fast and accurate estimation model. Estimation models also provide useful insights, especially towards architecture parameterization trends, and identification of potential energy reduction areas.
While energy estimation for a whole system is of paramount importance, the energy of specific components allows the system to be laid out to reduce hot spots in the final design, thus improving reliability of * National ICT Australia is funded through the Australian Government's Backing Australia's Ability initiative, in part through the Australian Research Council.
the whole design. Cycle accurate power estimation allows peaks of the power spectrum to be more accurately determined, so that the design can be re-engineered to extend battery life, as battery life is heavily influenced by power peaks. Therefore, this work is motivated by not only having to estimate energy of a router, but being able to rapidly estimate the consumption of each of the major components, and also being able to estimate energy in a cycle accurate manner. Finally, in an era where technologies are rapidly changing, it is important to be able to characterize a router built upon different technologies rapidly.
In this paper we present NoCEE 1 , a methodology for obtaining a fast and accurate energy macro-model for a synthesizable packet switched NoC router. The NoC router is decomposed into constituent router components and a model is built for each component. Linear regression is used to obtain a good fit between the component's events and observed cycle energy obtained from a gate-level simulator. The combination of individual component models is used to predict the cycle energy consumption of a complete router. A single traffic pattern consisting of various traffic loads and random distributions is used to stimulate the NoC router during characterization. We apply our methodology to several 3, 4 and 5 port wormhole mesh routers with uniform FIFO depths at various levels, and validate that good correlation exists between the model and gate level power simulations. The models of several routers are used together to predict the energy consumption of a complete NoC. The predicted energy consumption is compared against several complete NoC gate level power simulations to evaluate its accuracy.
The rest of the paper is organized as follows: Section 2 surveys the related work and states our contribution. Section 3 gives some background and theory. The NoC model is described in Section 4 and Section 5 presents the experimental results. Section 6 concludes the paper.
RELATED WORK
Early work [22] in power estimation attributed a fixed energy to a structural block. These pattern independent models were inaccurate as they did not account for input statistics. Thus, models were formulated to capture the changes in power dissipation when inputs are stimulated [19] . Many pattern dependent models use linear regression to obtain a good fit between the model parameters and energy consumption. Regression-based macro-modeling at the RTL has been extensively applied to various combinational and sequential circuits [2, 13, 20] , as well as application specific instruction set processors [7, 12] . However, this technique has yet to be applied to packet switched routers. For the first time, we apply linear regression-based macro-modeling for energy model extraction of an NoC packet switched router.
NoC power estimation has been addressed by several recent research works. In [29] , Ye et. al. proposed an energy estimation flow to derive bit energy models that are used to evaluate different switch fabrics (e.g fully-connected, Banyan, Batcher-Banyan) in network routers. Their methodology derives the energy cost for transmitting a bit in the network router from ingress to egress ports but ignores clock power and leakage power. In [26] , Wang et. al. created a network simulator to estimate dynamic power of router components from CACTI scaling equations [23] . This simulator was augmented in [6] to support leakage power. The same authors have used this simulator for architectural exploration to explore energy savings through segmented crossbars [28] , and to trade off of router complexity and energy [27] . Banerjee et. [1] addressed leakage power modeling in mesh routers and derived a power-state machine for a wormhole router based on SPICE net-lists. The power-state machine allowed power to be estimated on a cycle accurate basis. However, both [26] and [1] contain models that are tightly coupled with circuit implementations. As such, these models cannot be migrated to different technology libraries without a large amount of re-characterization.
The authors in [4] automate the extraction of a power model for the STBus, a high performance communication architecture supporting shared buses as well as crossbars. They use regression based techniques to obtain a relationship between average power and the architectural parameterizations of the STBus. A packet switched router contains additional components that are not present in the STBus.
Our characterization differs from the models in [1, 26, 29] by being more adaptable to changing technologies and differing router types. We use a semi-automatic system containing linear regression to characterize each router, while the models in [1, 26, 29] have to be manually extracted.
While the work in [4] automates the extraction of a power model for an STBus using linear regression, their work does not address packet switched networks, nor cycle accurate models.
The novel contributions of our work are: 1. the ability to derive accurate and system-level cycle-accurate estimation models for power consumption of synthesizable NoC routers; 2. a methodology to automate the extraction of the power estimation for an NoC router library based on regression analysis between control signals and cycle-based power; and 3. that we demonstrate the use of this methodology to predict NoC power at the router level, NoC level and within a fast estimation framework.
BACKGROUND AND THEORY

The NoC Architecture and NoC Router
In NoCs, longer interconnect wires are broken down into shorter channels with routers or relay stations forming a micro-pipeline between processing elements. Each router is buffered to allow it to be latency insensitive and the communication between the routers uses a predefined protocol and flow control. Although many NoC router variations exist, the major component subsystems remain the same. These are: (i) link controllers; (ii) crossbars switches; (iii) routing and (iv) arbitration units (see Figure 1) .
As NoC routers have a well-defined structure, the subsystems can be modularized with similar interfaces for different implementations. For example, two routing algorithms, X-Y and west-first routing can be designed to have the same inputs and outputs (address input and destination port). Standardizing the interfaces between subsystems allows the variations to be created and reused in different architectural parameterizations.
CMOS Power Consumption and Estimation
There are two broad categories of power consumption: (i) dynamic power and (ii) static power. Dynamic power is dissipated when the circuit is active caused by the charging and discharging of the internal and load capacitance of a gate. Static power is dissipated when a circuit is inactive, consisting mostly of source to drain sub-threshold leakage current where the gate does not turn off completely.
Commercial gate-level power simulators use look up tables based on input net transition delay and output net capacitance to determine 
METHODOLOGY
Much of this work is inspired by existing work on regression-based macro-modeling used to characterize smaller circuits at the RTL. We refer the interested reader to [3] for an overview of macro-modeling techniques. In this section, we first describe the procedure to construct an NoC energy macro-model. Next, we describe how model variables are chosen and construct macro-models for each of the major NoC subsystems in a wormhole NoC router.
Overview
There are two major parts in our energy estimation flow: (i) characterization and (ii) energy estimation (see Figure 2 ). The characterization step builds a macro-model given a set of technology libraries, NoC parameters and traffic patterns (Steps 1 -7). The technology libraries and NoC parameters are used to create the synthesizable NoC while the traffic patterns are used to simulate the router under various conditions. The second part, energy estimation, predicts the cycle energy of an NoC router from the macro-models and the event timings from a high level simulation (Steps 8 and 9).
As NoCs have a well-defined structure, macro-models can be built for each parametrization of the major subsystems and reused. We characterize each of the subsystems separately by extracting their models while they are operating within a complete NoC router. This allows multiple subsystems to be characterized together in a single simulation while capturing the frequently occurring behaviors of individual subsystems. A range of possible router configurations are created in order to build models for various subsystem parameterizations (Step 1). We sweep through a range of possible parameters and create a number of NoC router implementations using the NoC generator [5] .
Macro-model Inputs and Traffic Generation
To obtain a good model, a single traffic pattern must be generated that exercises the circuit under a wide range of possible conditions (Step 2). For an NoC router, it is possible to control these traffic patterns by varying the packet injection and acceptance rates, packet destination ports and switching activity on the data bus (changing the hamming distance between successive flits). Varying the injection rates and acceptance rates will synthetically exercise the router under different levels of contention. Meanwhile, adjusting the switching activity can be used to reveal the NoC router circuit's dependency on the input data. A calibration trace is created that includes a mixture of traffic patterns.
The power and signal values for every cycle are profiled via two separate simulations. The NoC configuration is synthesized and the cycle power is obtained through the gate-level power estimation flow (Steps 3 and 4). An RTL simulation is performed using the same calibration simulation to capture the signal waveforms on important control and data signals (Step 5). The signal waveforms are compared against the power simulation to see which signals may affect power dissipation. Events relating to control and data signal values are chosen for inclusion in the macro-model based on their importance to the energy behavior (Step 6).
Using Regression Analysis to Obtain the Macro-model
The energy consumption and signal values at each cycle are used as observations in the regression analysis. Regression analysis is performed between the cycle energy and the macro-model signal observation (Step 7). Statistical parameters such as the R 2 and p-values (probability values) from the regression output are used as a metric of the goodness of fit, and importance of variables [11] . The model parameters are modified and, if necessary Steps 5 to 7 are repeated until a good fit is found. Steps 3 to 7 are repeated for each of the configurations until the entire library is characterized.
Energy Estimation Using the Macro-model
Once a macro-model is built, it can be used to predict cycle power. Network statistics such as number of packets sent and bit switches can be used to quickly evaluate the total energy by substitution into the model. Multi-dimensional interpolation is used to obtain an estimate of the model coefficients, if the lie between two characterized points. A cycle accurate system-level network simulator is required to reproduce cycle-power at the system level. If a network simulator is able to produce a trace of the events (Step 8), a power waveform can be produced by substituting the timing of these events into the macro-model (Step 9).
Selection of Macro-model Variables
A macro-model is built by choosing variables that have a strong relationship to energy consumption. In this context, variables X i are events that occur in the NoC router. Two types of events are considered for inclusion: control events, and data events. Control events occur when a control signal triggers energy consumption; these are: one signal transitions (0 to 1, 1 to 0), and two, the value of a particular signal being either 0 or 1 in a particular cycle. In some cases, a control event may affect energy in a future cycle (for example pipelined operations). Control events can be time-shifted such that a vector of events match the energy consumption. In NoC routers, energy is dominated by several major components such as FIFO buffers and multiplexors. The control signals of these major components have a strong influence on the total energy consumption.
A larger amount of activity on the data inputs also contributes to higher energy consumption. Hence, the hamming distance of the data inputs is also considered. Selecting inputs from these two domains results in the following expression for the energy of a component.
where n is the number of events, β residual is the energy that is independent from the model variables, and β event i is the regression coefficient for the data or control event X i . X i takes the value 1 for control events when a signal is present or 0, when it is not present. In cases where X i is a data variable such as hamming distance, it is represented using a integer. In the characterization of our router, we only use hamming distance of the combination of all the input ports together. In routers design or situations where individual router ports consume more energy, more data variables may be needed.
Model variables should be selected to be independent whenever possible. The decision to include or exclude a model variable can be aided by the use of statistical parameters such as the p-values. The process of model selection can be automated by using a fixed criterion using a threshold value for removal or selection of parameters with model variables iteratively removed. Clever algorithms such as branch and bound can be used to avoid fitting all combinations of variables [11] . In NoC routers, there are very few signals that have a large effect on 
Evaluation of Macro-model Energy Consumption
The cycle energy consumption of a component E component can be expressed as:
where β i is the fitting energy coefficients for model parameter X i , and β 0 is cycle energy not related to any of the predictor variables, and P leakage is the average component leakage power, and t clk is the cycle period. Thus, given clock frequency f clk , cycle power can be calculated as follows:
Leakage power and Clock power
The two major sources of independent power are transistor leakage power and clock-related power. For the purposes of system-level modeling, we assume that the variance of average leakage power inputs is small. We observed this phenomena experimentally, as leakage power for individual components are nearly invariant in the power simulations for the same configuration. We use average leakage power figures available in the power simulation report as the architecture's fixed leakage power. Clock-related power includes dissipation in the clock network i.e. buffers, register clock gates and clock inputs to flip-flops. All these exhibit a static power dissipation because a similar number of transitions occur on every clock cycle. In the following models, the uncorrelated energy composed of the clock and leakage energies will be labeled as β x0 where x is the component name.
NoC Router Model
We develop energy macro-models for a complete router using the procedure described in the Section 4.1. For the sake of brevity, we only consider the model for an input-queued router; the output link controller is considered as pass-through wires that consume no energy. Although significant, energy consumed between the routers is not modeled as layout information was not available. The capacitance of wires can be found using logic synthesis tools once the length of wires is known and applied to the macro-model. For each of the major router components a macro-model is derived; each can be summed to obtain the total router energy consumption as follows:
where E ilc , E xbar , E arb , E route and E other is the energy values derived from the component macro-models for the input link controller, crossbar, arbiter and routing unit and the miscellaneous glue logic respectively.
In the following sections, we apply this method to the NoCGEN router libraries [5] but this method should be applicable to other NoC libraries such as Xpipes [8] and Proteo [24] .
Link controller
We consider the energy macro-model for a single input-queued link controller (ILC). The significant contributors to the power consumption of the link controller are FIFO memory elements (implemented as RAM or register files) and their control logic (Figure 3(a) ). The empty and full signals are determined based on an internal counter. In the target routers, there are two FIFOs in each link controller port ( f i f o0 and f i f o1) due to separate data and address buses. Hence, the power consumption of the link controller can be modeled as:
where E f i f o0 and E f i f o1 are the FIFO energy consumptions, E count is the counter energy consumption and E ilc other is energy of the less significant components in the link controller.
When a flow control digit (flit) arrives at the link controller, it is written to the storage element. Similarly, when a flit is acknowledged by the next router, the next read address is incremented and the outputs are updated. Energy can be attributed to the reading and writing of the memory elements, hence the read en and write en control signals are good candidates for inclusion. When contention occurs, the counters begin incrementing, resulting in a small increase in energy consumption. Hence, the link controller model can be formulated as:
where E dread and E dwrite are the read and write control signals for the data FIFO ( f i f o0); E aread and E awrite are the read and write control signals for the address FIFO ( f i f o1); and E count en is the counter enable control signal.
Crossbar Switch
A two-port crossbar is depicted in Figure 3 (b). In this crossbar implementation, an enable control input sel en disables the output when a port is unused. Through experimentation, it was found that the two major contributing factors to energy consumption are data bus switching and the initial selection of the crossbar port when the first flit arrives. The initial selection of the crossbar port is mirrored by the 0 → 1 transition of the sel en signal. The crossbar traversal energy is:
where E dist is the input hamming distance of the inputs of the crossbar and E sel 0→1 is the selection energy when a packet first enters the crossbar.
Arbitration
When an arbitration decision is made, the selected port is stored in a register that is used to control the crossbar port selectors. In a simple priority arbiter, the energy can be decomposed into storage and computation. We selected a signal that occurred in the same cycle as computation and included it in the macro-model. The arbitration energy can be estimated by:
where E sel energy consumption during the cycle that arbitration occurs and E sel t−1 is the energy related to control input from the previous cycle. E sel t−1 is an example of the same signal time-shifted to reflect energy dissipated due to the signal's value in a previous cycle.
Route Unit
The route units for both the X-Y routing algorithm and street sign routing algorithm are purely combinational and are determined based on the address presented by the input link controller. Based on this observation, some energy will be dissipated initially when the head flit of a packet E head is passed through the routing unit and a smaller amount in following flits E f lit . The model is as follows:
Other
Several other miscellaneous glue logic circuits consume a small amount of total energy in the circuit. As the dynamic energy changes are small, we do not characterize these individual components. Instead, the average energy E average is included as part of the additional power uncorrelated power that is added to the model.
EXPERIMENTAL RESULTS
All experiments were conducted on a dual Opteron 224 with 2 GB of RAM. The routers were generated from the NoCGEN [5] library and generation tool (Step 1). Two types of routers were used in experiments: custom routers (capable of a variable number of ports) and mesh routers. All routers were configured with 32-bit data buses and 16-bit address buses. FIFO buffers were implemented using two-port clock-gated register files with counter logic determining FIFO full and empty signals. The routers were characterized at an operating frequency of 125 MHz, 250 MHz and 500 MHz. At these three frequencies, the model coefficients were similar with clock-cycle leakage energy scaled. The results shown here are for 250 MHz.
A C++ program was written to generate a trace files for the VHDL simulation given a random distribution (Poisson, Gaussian etc) for packet arrival times and packet length (Figure 2, Step 2). This trace file contains send packet, idle and receive packet commands that are use to control the traffic in and out of the router ports. Each send packet is annotated with the source and destination addresses to enable control of the destination and the data bus switching activity.
All router designs were synthesized using Synopsys Design Compiler and a 90 nm TSMC standard cell library (tcbn90gtc) with automatic clock gating enabled. Modelsim was used for all the VHDL simulations. As layout information is unavailable until after place and route, we assign an estimated load of 50 fF to the router outputs. Synopsys PrimePower was used to obtain the power estimates and the power waveform. A program was written in C++ to combine the outputs of PrimePower and Modelsim into a time-series comma separated file. The statistical package GNU R was used to perform regression analysis. Perl scripts were used to link all the tools together.
Experiments Conducted
Three separate experiments were performed to ensure the validity of our models: at the single router level; within a complete NoC (multiple routers connected together in a mesh arrangement); and in a fast power estimation environment. First, we characterize the NoC library for two library parameters: number of ports and FIFO depth using the method described in Section 4. We evaluated the model prediction error against several gate-level power simulations with a separate set of randomly generated traffic patterns (72 traces in total) with varying injected loads. Second, we investigate port scaling energy characteristics of different components in the router. Next we create and synthesize various complete NoCs and stimulate these using applications traces. The errors between the model and gate simulation experienced when running these applications on complete NoCs are also provided. by {ports, FIFO depth}) are shown in Table 1 . Characterization took about five to ten minutes per configuration and only a couple of hours for the complete set of configuration points. Note, that characterization only needs to be done once for each sub-component. For example, the input link controller circuit for a single port may be identical in 3 port and 4 port routers, hence, the model for a 3 port router can be used in the 4 port router model. Table 1 shows that the coefficients for the 3 port and 4 port routers exhibit similar coefficients.
Validation
Once the routers were characterized, we used these macro-models to predict the energy consumption for a set of traffic patterns that were different from the calibration set. These patterns were randomly generated with fixed applied load. Each router port trace consists of 500 packets sent to random destination ports with a given fixed applied load varying from 0.1 to 0.6 flits per port with each packet containing five flits. Figures 4(a -c) show the errors across different traffic loads for the 3 port, 4 port and 5 port routers respectively. From these figures the maximum absolute energy estimation error is 9.9% and the average absolute error is 4.6%. Given the application statistics such as number of packets, data switching rate and time in contention, it takes less than a second to determine the energy consumption. a -c) shows that the model tends to slightly under-predict power when there is a high applied load. This is caused by the fact that our model captures most but not all dynamic effects when there is a large amount of port contention. For cycle-based prediction, the absolute cycle mean relative error is less than 20% across the validation set. A proportion of the power waveform from the validation trace of the 4 port mesh-router with FIFO depth of 4 with 0.2 flits per port applied load is shown in Figure 5 (a). The PrimePower waveform is the dotted line and the predicted is the unbroken line. Figure 4(d) shows the distribution of the cycle based energy estimation errors (measured energy -predicted energy) for the same trace as in Figure 5 .
Custom Router and Router Port Scaling
An exploration of the energy coefficients trends was undertaken for several larger non-mesh routers. We parameterized the NoC generator to create a custom topology router with destination port embedded in the address with a varying number of ports from 4 to 32 and a uniform 4-flit FIFO buffer. Each router was synthesized using the same methods as in previous experiments and models were extracted. Figure 6 shows the trends in the link controller and crossbar switch. shows that for a single link controller, residual energy β ilc0 remains nearly constant. For a complete router, there is one link controller for every input port, hence the residual energy will increase linearly with the number of ports. Figure 6 (c) shows the increase in the crossbar hamming energy E dist with the increase in the number of ports. The observed relationship is linear and can be explained by increased logic depth in larger crossbars. Meanwhile, the crossbar residual energy (Figure 6(d) ) increases quadratically with the increase in the number of ports. This is due to the quadratic increase in area as the number of inputs ports and output ports increase.
The macro-models from the previous section can be used to rapidly predict the energy consumption of an application running on any supported network topology (2 × 2, 4 × 2 or 4 × 4 etc). We integrate the energy macro-model into a graph-based NoC performance estimation framework to predict the energy for a complete NoC. This framework estimates the performance and power given the application as a task graph, the NoC architecture as a topology graph, and energy model as specified in Section 4 (see Figure 7 ).
Task graphs model the application dependencies in a graph with three types of nodes: communication, computation and handshake nodes. Graph edges define the dependencies between the different types of nodes. The task graph format is similar to the Communication Analysis Graphs presented in [18] . Task graphs are obtained from various sources including synthetically generated task graphs and realistic hand-crafted task graphs [9] . We estimate the energy consumption us- The synthetic task graphs were created using TGFF 3.1 [10] with arbitrary packet sizes and task lengths. These were created to give larger examples. TGFF is configured to generate a task graph containing computation tasks, communication dependencies, communication volume and processing element execution times. The task execution time and communication properties are summarized in Table 2 . Each task is randomly assigned to a processing element. Communication between tasks that have been assigned to the same processor is omitted from the analysis as this is not transmitted over the network. Table 2 : TGFF Benchmark Parameters Two benchmarks, auto-industry and telecom, were collected from the E3S benchmark suite [9] . These benchmarks contain 24 and 30 tasks respectively. These benchmark consist of a task graph containing computation tasks (nodes), communication volumes (edge property), communication dependencies, and a library of processing elements that can perform the computation task (node properties). Each task is manually assigned to a processor based on the library of processing elements. This determines the length of a computation task in clock cycles.
To estimate performance and energy, a method similar to that used in [18] was used to estimate the timing of network events. Each router port is reserved when requested and where there is contention, waiting times are computed. We modify their algorithm to support arbitrary network topologies, routing algorithms and NoC protocol timings. The event durations are annotated in the graph and event timing was back annotated into the task graph. Once the graph is completely annotated, a breadth-first search longest path algorithm is performed to compute the starting times and duration of each node. An event list with starting and ending times was used to substitute into the energy macro-model. Table 3 shows a summary of the different simulations and their respective predicted energy and measured energy. The fast power estimation technique produced a power waveform and energy estimate in under a second while the gate-level simulation took several minutes to hours to estimate the total energy and power waveforms, e.g. 30 minutes for the TGFF4 simulation. Table 4 summarizes the execution times for the full gate-level simulation and the predicted simulation. The energy estimation errors from the extracted models are well below the estimation errors of commercial estimation tools (which is typically around 20% abs. error with gate-level simulations) and the simulation time is often three to four orders of magnitude faster than a complete gate simulation. As such, we conclude that our models are accurate enough for the purpose of high-level design space exploration for NoC router architectures. 
CONCLUSIONS
We have presented a technique for extracting a fast and accurate energy macro-model from a Network on Chip library. Using this technique, we have characterized a set of routers generated using the NoC-GEN framework. Validation between the macro-model and gate level estimates show that for the randomly generated test patterns, the energy estimation error had a mean absolute error of less than 5%. We are currently applying these models for use in an architecture synthesis environment.
