Abstract-A Network-on-Chip (NoC) is an energy-efficient onchip communication architecture for Multi-Processor Systemon-Chip (MPSoC) architectures. In earlier papers we proposed two Network-on-Chip architectures based on packet-switching and circuit-switching. In this paper we derive an energy model for both NoC architectures to predict their energy consumption per transported bit. Both architectures are also compared with a traditional bus architecture. The energy model is primarily needed to find a near optimal run-time mapping (from an energy point of view) of inter-process communication to NoC links.
I. INTRODUCTION
In the Smart chipS for Smart Surroundings (4S) project [1] we propose a heterogeneous Multi-Processor System-onChip (MPSoC) architecture with run-time software and tools. The MPSoC architecture consists of a heterogeneous set of processing tiles interconnected by a Network-on-Chip (NoC) as depicted in Figure 1 . The size of a processing tile is assumed to be less then 5 mm2 in 0.13 jum technology. By exploiting the available parallelism of the processing tiles they can run at a relatively low frequency (below 500 MHz) to achieve enough performance. The architecture including the run-time software can replace inflexible ASICs for future mobile systems. Mobile systems are typically battery powered and have to support a wide range of applications so they have to be flexible as well as energy-efficient. We consider a set of streaming applications that run for a considerable period (seconds and more): e.g. wireless baseband processing (DAB, DRM, DVB), multi-media processing (MPEG-2, MPEG-4). To map these applications on a parallel architecture like a MPSoC we assume the application is represented as communicating parallel processes. One possible representation is a Kahn based Iv I Jens E. Becker, Jiirgen Becker University of Karlsruhe Institute for Information Processing Technology (ITIV) D76128 Karlsruhe, Germany process graph model [2] , which is a directed graph with node. representing sequential processes and edges representing FIFC communication between processes. The MPSoC architecture of the 4S-project is controlled by a central operating system called OSYRES [3] , that runs or one of the GPPs of the MPSoC. The main task of OSYRES it to manage the system resources. It tries to satisfy Quality oi Service (QoS) requirements, to optimize the resources usage and to minimize the energy consumption.
To reduce the energy consumption of the overall applicatior we map the processes on the processing tile that can execute it most efficiently. This spatial mapping of processes is performed at run-time by the spatial mapping tool (SMIT) [4] . OSYRES determines when the spatial mapping tool is called. Due to the mapping of processes to processing tiles on the MPSoC communication is introduced, because data has to be moved to the successive processing tiles.
Traditionally communication between processing tiles is based on a shared bus. But for larger MPSoC with many processing tiles it is expected that the bus will become a bottleneck from both a performance, scalability and energy point of view [5] . Therefore, we propose a multi-hop Networkon-Chip, where the network consists of a set of router. interconnected by links.
In this paper we will derive a simple energy model of twc Network-on-Chip architectures. This is primarily needed foi the spatial mapping tool. Using this model the tool can finc a near-optimal mapping (from an energy point of view) oi of inter-process communication to NoC links. Therefore a first-order estimation of the energy consumption is needed and sufficient. A complicated energy model would hampel the spatial mapping tool. A second motivation of deriving an energy model is that we can compare different NoC options (see also section IV). We compare the energy consumption oi a solution based on a packet-switched wormhole router with virtual channels, a circuit-switched router with a separate besi effort network and a traditional bus.
One of the first power modeling tool was Orion, a cycleaccurate network power-performance simulator, that was proposed in [6] . The capacitance of each network componeni is derived based on architectural parameters, and activities al each cycle trigger calculations of network power.
The rest of this paper is organized as follows. The evaluated network routers are briefly described in section II. The energ)
consumption of the logic can be determined as described For the NoC we defined two networks (packet-switched and circuit-switched) that can both handle guaranteed throughput (GT) traffic and best-effort (BE) traffic. The guaranteed throughput traffic is defined as data streams that have a guaranteed throughput and a bounded latency. The best-effort traffic is defined as traffic where neither throughput nor latency is guaranteed. The BE traffic handles traffic like configuration data, interrupts, status messages etc.
A. Packet-Switched Network
The packet switching router implements wormhole switching with virtual channel flow control. The advantage of wormhole routing is the packet-size independent buffer-size. The virtual channels are used to decrease the chance of blocking and enables the routing of guaranteed throughput traffic.
The packet-switched router described by Kavaldjiev [7] has five input and five output ports and four virtual channels (VCs) per port. The flits (atomic unit) of a packet are labeled with their virtual channel number and they are buffered in four flit deep queues at the input ports. Per port four queues are available -one queue per virtual channel.
The access to the crossbar is arbitrated by 5 round-robin arbiters -one arbiter per crossbar output. This arbitration is sufficient since a conflict can only arise when more than one queue contains flits destined to a same output port. Due to the predictable round-robin arbitration the router is able to handle guaranteed throughput traffic if one single data stream is assigned to a VC.
The best-effort packets can be assigned to the same output VC. All of the packets competing for a same output VC are tagged by the sender with an unique identifier. Each router has a global counter that counts permanently and whose value is distributed to all inputs. When an output VC is freed the next packet that takes it is the one whose id equals the current counter value. The uniqueness of the id guaranties conflict free arbitration, but does not guarantee bandwidth or latency. Since, at any time, the counter value is generally random, fairness is provided.
B. Circuit-Switched Network
The second network is a guaranteed throughput circuitswitched router [8] in combination with a separate best-effort network [9] . By using dedicated techniques for both types of traffic (BE and GT) we can reduce the total area and power consumption.
For the moment the circuit-switched router has five bidirectional ports where one port is connected to a processing tile and four ports via a bi-directional link (16 bit wide per direction) to their neighboring circuit-switched routers. The bidirectional link between two routers consists of uni-directional lanes (e.g. four lanes in each direction). Each lane can be used by a unique data-stream and more than one lane per link increases the flexibility as in time division multiplexed systems. Four lanes of four bits per link have been chosen to reduce the number of wires between routers, but it requires serialization of the 16 bit data items of the processing tiles. The serialization is handled by the data-converter that connects the (16 bit) tile interface to the small (4 bit) lanes.
To minimize energy consumption the circuit switching has fully separated data and control paths and cannot serve besteffort traffic. The best-effort traffic is handled via a separate ring network [9] that can transport packets (16 bit data, 16 bit address) to all the processing tiles and circuit-switched routers. Via the configuration interface of the circuit-switched router a single best-effort packet can configure 1 lane. On average we can transport the reconfiguration data in less than 1 ms over the BE configuration network. This is fast enough, because the configuration of the crossbar will not change frequently due to the long-life guaranteed throughput data streams between processing tiles.
III. POWER MEASUREMENTS NETWORK ROUTERS
Benchmarking a NoC router is not a trivial task, because as far as we know no general method has been defined for on-chip networks. In this paper the power estimation of the logic is performed by modeling the design in VHDL. The synthesized VHDL-design is then annotated via a set of test-scenarios. We can estimate the power consumption per scenario using Synopsys Power Compiler [10] and the annotated design.
We expect that the power consumption of a single router is at least dependent on four parameters: 1) The average load of every data stream. This varies between 0% and 100% of the available bandwidth of a single lane/link. 2) The amount of bit-flips in the data stream. This varies from no bit-flips (ie. transmitting constant values) to continuous bit-flips.
3) The number of concurrent data streams through the router, which in our case has a maximum equal to the number of lanes or virtual channels (20). 4) The amount of control overhead in the router (e.g. buffers, arbitration)
A. Used Traffic Patterns
To test the parameter sensitivity of our router we defined a set test-scenarios for traffic patterns. This set has three levels for the number of bit-flips:
. Best case (no bit-flips, transmitting only zeros) . Worst case (continuous bit-flips) . Typical case (random data with 50% bit-flips). Furthermore, to vary the amount of traffic which concurrently traverse the router we defined ten scenarios. The scenarios have a variable number of concurrent data-streams with an variable load between 0% and 100%. The ten scenarios are listed in Table I . Two times the number of streams of 7 9 15 Three times the number of streams of 7 10 20
All the lanes / virtual channels are occupied The left graph of Figure 2 depicts the dynamic power consumption depending on the offered load for typical data of the packet switched network. The middle graph of Figure  2 depicts the dynamic power consumption of the circuitswitched network + best-effort router depending on the offered load for typical data. The power consumption of the extra required best-effort network is measured with a separate testbench [9] . The power consumption of this small extra router varied between 8.4 and 12.3 ,uW/MHz. In this paper we use the measurement of the guaranteed throughput traffic and added the worst-case power consumption of the besteffort network to find the worst-case power consumption of the combination. We noticed a relative high offset in the dynamic power consumption. This could be reduced by including clockgating to switch-off the inactive lanes. This resulted in the right graph of Figure 2 , where the remaining offset is mainly determined by the best-effort network.
IV. COMPARING COMMUNICATION ARCHITECTURES
In this section we compare the energy consumption of a bus based system with the two described networks. For the power consumption of the wires between the components we use a simple linear model that is derived in [9] and is based on the work of Banerjee [ 1] . Plfnkdyn= (0.39 + 0.12 * lwire) -Nwires * Llink Where the 'wire is the length of the wire in mm, Nwires the number of wires and Llink the average load of the link. In the next sections we use the energy that is required to transport a single bit over a wire. In these cases the Nwires and Lli,k are both equal to 1.
A. Energy Consumption Model Packet-Switched Router
In Figure 2a we see a high offset in the dynamic power consumption of 55.34 ,uW/MHz. Above the offset an almost linear dependency between the load of the streams, number of streams and the power consumption of the router is visible [12] . It is assumed that the bus system is organized as a regular grid of NxN processing tiles. In a single master bus system it is assumed that all slave-ports have to switch, which results that the data has to be transported over all wire segments. The minimum In section IV-A and IV-B we derived the amount of energy to transport a single bit between processing tiles over a network-on-chip. This bit can be used as an address or data bit by the processing tiles. To make a fair comparison between the networks-on-chip and the bus we assume that 50% of the bits are used for address-bits. The energy required to transport this data bit is therefore twice the energy described by equations 2 and 3. Using the equation 5 and the compensated equations 2 and 3 we compare the average dynamic energy required to transport a data bit between 2 processing tiles. We assume a regular grid of NxN processing tiles with a size of 4 mm2 each. This will result in a wire segment length (iwire) equal to 2 mm. The average number of hops in a network-on-chip communication architecture depends on the distribution of the traffic. For uniform distributed traffic Nhop = 2 N. More local oriented traffic will decrease the average number of hops. Figure 3 depicts the average required energy per bit depending on the number of tiles in the MPSoC. For the bus we added an extra line, which models a segmented bus structure with 2 equally sized segments. It is assumed that this will half the number of wire segments that are used in a bus-transfer. The benefit of the Network-on-Chip is clearly visible for larger number of tiles.
V. CONCLUSION In this paper we presented two Network-on-Chip architectures that are compared with a traditional bus architecture. For each architecture we derived a simple energy model that can be used for the spatial mapping tool to optimally map the on-chip communication streams. The energy model for all the architectures are relatively simple due to the derived first-order equations.
The energy models showed a lower energy consumption per bit for the Network-on-Chip architectures. Especially for larger number of processing tiles the Network-on-Chip architectures consume less energy per bit. The circuit-switched network is the most energy efficient solution due to the small amount of control and buffering.
For the circuit-switched router a clock-gated implementation was also evaluated. The clock-gated design disabled the clock for in-active (not configured) lanes. The implementation showed a relative large decrease ofthe offset in dynamic power consumption.
