Abstract: Due to the increasing bandwidth demand for the network-on-chip (NoC), interconnection networks become a dominant source of energy consumption in systems-on-chip (SoCs) and chip multi processors (CMPs). Therefore, energy efficient NoC is key to a successful SoC development. This paper presents an overview of different techniques to achieve energy efficiency at the different levels of NoC design including: a component level where dynamic voltage scaling (DVS) and dynamic link shutdown (DLS) techniques are reviewed b circuit level, e.g., voltage swinging of signals c architectural level, where specialised tools, such as Wattch and Orion are discussed.
Introduction
With an ever growing number of transistors over a single chip (millions), the chances of integration offered today are unprecedented (Brooks and Martonoshi, 2001a; Jayaseelan and Mitra, 2009 ). The following two are the most popular directions that take advantage of such possibilities. The first approach, chip multi-processors (CMP), integrates multiple cores over a single chip, while the second approach, systems-on-chip (SoC), integrates all of the necessary components of a computing system, e.g., memory (Adler and Friedman, 1997) . Usually, CMPs are of general purpose, while SoC design is related to specific applications, such as consumer electronics (Benini and Micheli, 2001 ) and multimedia (Chen et al., 2006; Koziri et al., 2007) .
Both SoC and CMP involve large volumes of interprocessor communication. The network-on-chip (NoC) paradigm has been widely proposed as the interconnect fabric for SoC (Benini and Micheli, 2001; Dally and Peh, 2000; Dally and Towles, 2001; Sgroi et al., 2001 ) and CMP (Sankaralingam et al., 2003; Taylor et al., 2002b) . Under the NoC paradigm, network components consisting of multiple point-to-point links and switches are integrated in the chip to handle communication. The NoC paradigm has multiple advantages over the traditional global-wiring bus approach. First, NoC handles interconnection in a systematic way according to networking theory. Second, NoC allows multiple asynchronous clocking that is particularly useful in SoCs. Third, NoC scales better compared to the bus approach, and is closer to achieve the 'linear effort property' principle (Tenhunen and Jantsch, 2003) , under which, adding a new component should incur an integration cost that depends only on the number of existing components and not on the complexity of the added component.
NoC must be able to provide tight-delay guarantees (Taylor et al., 2002a) . For this reason, prior micro-architectures were performance-aware rather than energy-aware (Patel et al., 1997) . However, as the number of on-chip components increases, the tight-delay guarantees lead to more bandwidth requirements and as a consequence the NoC design complexity increase. With higher complexity, the energy consumed by NoC also becomes significant (Ghosh et al., 2009; Shang et al., 2002) . For instance, MIT RAW NoC consumes 36% of the total system power (Shang et al., 2004) . Therefore, power-aware NoC design is of paramount importance to energy-efficient SoC and CMP development. In this paper, we present an overview of the key techniques used in achieving the aforementioned objectives.
For an older survey on the field, the interested reader is referred to Raghunathan et al. (2003) . Compared to Tenhunen and Jantsch (2003) , our contributions are twofold: a we include more recent research results b we cover thermal issues.
Concerning the latter, many researchers, e.g., Shang et al. (2004) have suggested including thermal issues in power-aware SoC design. This becomes more important when considering that temperature rises exponentially with the increase in power consumption. Therefore, a small increase in power consumption results in a rapid temperature rise (Brooks and Martonoshi, 2001b ) that may not be efficiently tackled by the current state-of-the-art cooling systems (Shang et al., 2004) .
The rest of the paper is organised as follows. Section 2, discusses on-chip architecture and NoC building blocks. In Section 3 optimisation methods at the component level are summarised, with components being routers and links. Circuit-level power reduction techniques are presented in Section 4, while Section 5 surveys architectural-level tools. Thermal efficiency issues are included in Section 6. Finally, Section 7 concludes the paper. To increase readability, Table 1 lists all the acronyms used in the paper. 
Preliminaries
Under the NoC framework, all inter-module SoC or CMP communications, such as between processors, memories. and peripherals is done using packet transmissions. Therefore, the modular design of SoC and CMP processors is facilitated. Moreover, the structured networks in a NoC, experience well-controlled electrical parameters, such as low and predictable cross-talk (Dally and Poulton, 1998) . These controlled electrical parameters eliminate timing iterations, increase bandwidth, and enable the use of high-performance circuits. Furthermore, these signalling circuits also can reduce power consumption by a factor of ten and increase propagation velocity by three times (Dally and Towles, 2001) . To further illustrate the inner workings of a NoC, an example NoC architecture specified in Tiwari et al. (1998) is discussed in the subsequent text. A chip is assumed to be split into tiles that are pieces of silicon die. Each tile is used to accommodate one or more client modules, such as processors, DSPs, peripheral controllers and memory subsystems. Figure 1 shows an example division of the chip into nine (3 mm × 3 mm) tiles. The client modules communicate with each other through the network. There are no top-level connections other than the network wires. The network logic occupies a small amount of area between the tiles and consumes a portion of the top two metal layers of the chip, that is, the layers of a chip that are used for accommodating interconnection wires. The NoC components are routers and links. An example NoC architecture is shown in Figure 2 where n client modules (A 1 ..A n ) are directly connected in a single router. The router is responsible for receiving and directing packets towards their final destination. To understand the energy efficient NoC models, we first define 'flit'. A flit is the smallest unit of information that comprises of the header, encoding, and packet size information (Chen and Peh, 2003) . As per Wang et al. (2003) the energy consumed by a flit during transmission comprises of:
where E wrt is the average energy dissipated when a flit is written on buffer, E rd is the average energy dissipated when reading a flit from the input buffer, E buf = E wrt + E rd is the average buffer energy, E arb is the average arbitration energy, E xb is the average crossbar traversal energy, E lnk is the average link traversal energy, and H is the number of hops traversed by this flit. Because buffering, arbitration, and crossbar traversal are all part of the routing mechanism, we can say E R = E buf + E arb + E xb . Equation (1) can be reformulated as:
where E R is the average router energy, E wire is the average link wire transmission energy per unit length assuming optimally-placed repeaters, and D is the Manhattan distance between source and destination. The Manhattan distance is the shortest distance between two routers along the perimeter of the chip. We can observe from Equation (2) that minimising E R will result in minimising E flit , when all other variables are held constant.
In the subsequent text, we present existing research on energy-efficient NoC component design. Before continuing though, we would like to mention that the NoC terminology has been used in the past to characterise aspects varying from gate level physical implementation, across system layout aspects and applications, to design methodologies and tools. One of the reasons for the widespread adaptation of this terminology lies in the readily available and widely accepted abstraction models for networked communications (Bierrgaard and Mahadevan, 2006) . The open system interconnection (OSI) model of layered network communication can be easily adapted for NoC usage as proposed in Benini and Micheli (2002) .
Achieving energy efficiency in NoC components
The first work recognising the need to consider power consumption constraints in interconnection network design was Benini and Micheli (2002) that proposed a power consumption model for routers and links. Wang et al. (2003) studied power models for different NoCs and proposed several micro-architectures for key router components, such as segmented crossbar, cut-through crossbar, and write-through buffer. They also studied the power saving potential of an existing NoC architecture, termed 'Express Cube'. Maheshwari and Burleson (2001) proposed splitting monolithic bus architecture into a layered architecture that is more energy efficient due to reduced capacitive load during bus transfers.
A number of papers use dynamic voltage scaling (DVS) to reduce power consumption by NoC. The DVS technique was first proposed and widely used in microprocessors (Burd, 1998) . DVS exploits the variance in processor utilisation to lower the frequency (by means of lower the supply voltage) when the processor is lightly loaded, while increasing it to maximum when the processor is overloaded with instruction sets. Shang et al. (2002) illustrated a power optimisation mechanism for interconnection networks by applying DVS to network links. As with the instruction workload of a microprocessor, link utilisation also varies, depending on applications' communication patterns.
Although DVS might reduce energy consumption, yet there are two pitfalls to be avoided. First, by reducing supply voltage and frequency we also increase leakage current that might in fact lead to higher power consumption (Jejuriker et al., 2004) . To minimise leakage current, processors support various shutdown modes. For example, Transmeta Crusoe processor supports various sleep modes, such as normal, auto-halt, quick start, deep sleep, and off for various types of workload (Jayaseelan and Mitra, 2009 ). The second challenge has to do with one of the primary reasons that made digital circuits more popular compared to analogue ones, i.e., noise immunity. Digital circuits exhibit non-linear voltage transfer characteristics. However, due to smaller supply voltage, noise immunity becomes extremely difficult to maintain. This is the case for instance in the deep submicron (DSM) technology (Hedge and Shanbhag, 2000) . Therefore, operating at maximum or minimum operating voltage (and as a consequence, frequency) is seldom the optimal option. In fact, the optimal voltage and frequency values can be defined to be the ones corresponding to a computational speed at which energy consumption per workload is the minimum (Irani et al., 2003) . Last, to achieve full benefits of the DVS technique, digital circuits must be designed to accommodate large ripple in supply voltage .
To minimise the communication links' power consumption, Taylor et al. (2002b) proposed the dynamic link shutdown (DLS) technique. DLS is based on the premise that shifting the load that goes through underutilised links towards a subset of highly utilised links will allow the complete shutdown of the first ones. To benefit from DLS, Taylor et al. (2002a) present an adaptive routing strategy that intelligently uses a subset of links for communication, thereby facilitating dynamic link shutdowns for minimising energy consumptions. They also showed that the proposed DLS technique can provide moderate energy savings with minimal degradation in average network latency.
Finally, Chen and Peh (2003) proposed a model for managing leakage power in the interconnection network. They demonstrated that router buffers are the prime candidates for leakage power optimisation and explored power-aware buffer policies that managed to save up to 96.6% of the total buffer leakage power. Table 2 summarises the primary features of the presented techniques for energy optimisation at the NoC component level. Wang et al. (2003) compared several crossbar (router based) power reduction techniques, namely, segmented crossbar, cut-through crossbar, and write-through crossbar. The aforementioned techniques differ in the methodology utilised to access the crossbar switch. Although achieving the goal of reducing power consumption, DVS (Shang et al., 2002) , Segmented crossbar (Wang et al., 2003) , and cut-through crossbar (Wang et al., 2003) suffer from poor signal-to-noise ratio (SNR). On the other hand, DLS (Taylor et al., 2002b) and Express cube (Wang et al., 2003) reduce power consumption without sacrificing SNR. However, both Express cube and DLS may impact the overall NoC latency. 
Circuit-level energy optimisation
Circuit-level tools deal mostly with encoding and synchronisation issues at the datagram (flit) level. To optimise power consumption at the circuit-level of NoC various schemes were proposed, with the most popular ones being low-swing voltage driver and reduced supply voltage technique. Voltage swing is the maximum peak voltage that the output circuit can produce before it starts clipping (Zhang et al., 2000) . In Dumitras and Marculescu (2003) , a low-swing (low output) voltage driver using dynamic diode-connected driver (DDCD) architecture is proposed. The resulting DDCD circuit had a simple inverter as a receiver and met the desired goals of low complexity single-wire low-swing driver, albeit at the cost of low SNR values. Lee et al. (2000) proposed an input-multiplexed transmitter to reduce clock load. The transmitter multiplexes the signal and then the multiplexed signal is fed to the clock. To reduce jitter, precision timing circuits based on delay locked loop (DLL) were used. DLL is a digital circuit that compares the phase of one of its outputs to the input clock, in order to generate an error signal which is then integrated and fed back as the control of all delay elements. A sensitive capacitive trimmed receiver also was proposed to ensure reliable operation at very low signal levels needed to reduce power consumption. Another way to minimise power consumption is to reduce the input voltage swing by using a multi-phase low-frequency clock instead of a high-frequency one as proposed in Lu (2008) . The most widely accepted technique is to reduce the voltage swing of signals on wires. To understand the relationship between voltage swing and energy consumption in wires, Wei and Horowitz (2000) proposed the following relationship:
where α is the switching activity per clock cycle of the signal that is being transmitted, C is the physical capacitance switched during signal transitions, V DD is the supply voltage, and V swing is the voltage swing across the wire. According to Equation (3), reduced voltage swing results in lower wire power consumption, when all the other variables are held constant. Various schemes, such as static driver with reduced supply (Nakagome et al., 1993) , differential interconnect (Burd, 1998) , dynamically enabled drivers (Colshan and Jaroun, 1994) , and low-swing bus techniques (Yamauchi et al., 1995) are based on Equation (3). Zhang et al. (2000) measured the performance of these schemes using a worst-case noise analysis method. This method uses a driver circuit that converts a full-swing input into a reduced-swing interconnect signal that is converted back to a full-swing output by the receiver. Results in Zhang et al. (2000) showed that differential interconnect (Burd, 1998) achieves energy savings up to four-folds and exhibits the highest SNR compared to all other schemes mentioned above. To reduce both system area and power consumption, Srinivasan and Adve (2003) proposed a novel hybrid NoC structure and a dynamic job distribution algorithm that can reduce system area and power consumption. This optimisation technique reduces packet drop rate in a variety of applications, such as MPEG4 and MP3 decoders, global positioning system (GPS), and orthogonal frequency division multiplexing (OFDM) demodulators. Kim and Hwang (2008) . However, reducing supply voltage results in circuit delays. To circumvent such an anomaly, Raghunathan et al. (2003) proposed the shortening of critical data link paths. However, reducing supply voltage to the data link paths introduces new problems, such as increase in the routing cost and also exposes the network to lower SNR (Bierrgaard and Mahadevan, 2006) . Therefore, the network becomes more susceptible to interwire crosstalk, supply power noise, and radiation induced defects. Such a noisy interconnect behaves as an unreliable transport medium introducing errors at the transmitted signals. In Lauter et al. (2005) , channel coding was used to restore network's reliability. This method is based on adding redundancy to the information symbol vector resulting in a longer coded vector that is nevertheless distinguishable at the output of the channel (Bierrgaard and Mahadevan, 2006) . Besides reducing voltage swing and supply voltage, Fu and Ampadu (2008) also proposed an energy-efficient multiwire error control scheme using Hamming product codes. To design a reliable and energy efficient NoC, Lauter et al. (2005) advocated that packet retransmission schemes are more energy-efficient than error protection schemes for long wires and strong codes. Strong error correcting codes have higher fault tolerance; however, coding schemes tend to use heavy compute power and as a consequence are energy inefficient. The same work also showed that error control schemes should be implemented at network level (end-to-end from source-to-destination) rather than link level (switch-to-switch). Table 3 lists the pros and cons of the aforementioned circuit-level optimisation methods. The methods discussed in Lee et al. (2000) and Lu (2008) require small system area, have good SNR, but require extra power supply. On the other hand the methods presented in Lauter et al. (2005) and Zhang et al. (2000) do not need extra power supply but suffer from low SNR. Last, Ferretti and Beerel (2001) combine the benefits of the aforementioned works to some extend; however, the representative model is complex.
Architecture-level power modelling tools
Power models have been proposed for a variety of network types, such as on-chip FPGA networks (Zhang et al., 2000) and IP (Ye et al., 2000) . All of these studies focused on investigating the power aspect of different network topologies. Their goal was to build tools for estimating average power consumption based on transistor count (Patel et al., 1997) and switch width (Zhang et al., 2000) . However, the previous studies did not explore architectural-level information. Therefore, worst-case scenarios cannot be efficiently analysed.
Circuit-level power estimation tools, such as DVS (Shang et al., 2002) and DLS (Taylor et al., 2002a (Taylor et al., , 2002b provide excellent accuracy but at the expense of long execution time. Improving simulation time has motivated architectural-level power simulators for processors and memories, such as Wattch (Brooks et al., 2000) and Orion (Wang et al., 2002) . Wattch (Brooks et al., 2000) is a framework for analysing and optimising microprocessor power dissipation at the architectural-level. Orion (Wang et al., 2002 ) is a power-performance interconnection network simulator that is capable of providing detailed power characteristics. In addition, it also explores power-performance trade-offs at the architectural-level.
As an architecture-level tool for achieving energy efficiency in NoC routers, Shang et al. (2006) proposed a dynamic power management (DPM) scheme, termed PowerHerd (Shang et al., 2006) . PowerHerd is a distributed scheme for dynamically satisfying peak-power constraints in interconnection networks. In PowerHerd, each router dynamically maintains a local power budget, controls its local power dissipation, and exchanges spare power resources with its neighbouring routers to optimise network performance. Simulations demonstrate that PowerHerd can effectively regulate network power consumption and can meet peak-power constraints with negligible network-performance penalty.
Achieving thermal efficiency in NoC
Chip temperature is an accumulated effect of both processing and communication components. Because networks consume a significant portion of the chip power budget (Benini and Micheli, 2002; Wang et al., 2003) the power consumption induces a substantial thermal impact. Understanding the joint thermal behaviour of all on-chip components, both processors and networks, is key to achieve efficient thermal design. Authors in Brooks and Martonoshi (2001a) and Chen and Peh (2003) were among the first to consider thermal issues in microprocessors. Unlike centralised microprocessors, networks are distributed in nature imposing unique requirements on thermal modelling and management.
Different dynamic thermal management (DTM) schemes have been proposed in Brooks and Martonoshi (2001b) and Huang et al. (2001) for high performance microprocessors. Shang et al. (2002) proposed a model, termed ThermalHerd. ThermalHerd is a distributed run-time mechanism that dynamically regulates network temperature. Evaluation using NoC traffic traces from the UT-Austin TRIPS CMP (Hu and Marculescu, 2003) , demonstrated that ThermalHerd can effectively regulate network temperature and eliminate thermal emergencies. Moreover, ThermalHerd proactively adjusts and balances the network thermal profile to achieve a lower junction temperature. Furthermore, Borkar (1999) proposed to reduce the die (defined in aforementioned text) size. However, reducing the die size results in increased supply current and decreased performance. Therefore, Gonzalez et al. (1997) suggested the term, energy delay product to be considered to overcome the issue. The energy delay product is the product of packet-switching delay of the router and energy consumed by the NoC. However, minimising energy delay product requires scaling of threshold voltage, which results in increased leakage current. Gwennap (1996) suggested that the overall chip energy consumption and thermal efficiency may limit not only what can be integrated onto a chip, but also how fast the chip can be clocked. Gowan et al. (1998) proposed the use of hierarchical clocking scheme to lower the clock power consumption and overall temperature of a chip. However, the proposed scheme in Gowan et al. (1998) reduces chip performance and increases delay (circuit-switching delay of the chip).
To lower the energy consumption and increase thermal efficiency, Reinman et al. (2002) suggested small size cache for the router memory. However, because of the limiting size of the cache, the proposed technique is not scalable. Buyuktosunoglu et al., (2002) presented two techniques that reduce queue power consumption at routers. Evaluation revealed that for a given NoC, the proposed techniques resulted in less power consumption and consequently reduced thermal dissipation. Pering et al. (1998) proposed a model that includes a thermal temperature sensor in hardware and an interrupt capability to notify software when a threshold temperature has been reached. The model also can include an instruction cache throttling mechanism that allows the NoC's bandwidth to be reduced when the system reaches a preset temperature (Seng et al., 2000) . However, these models were mainly geared towards improving battery life for portable machines. The Transmeta Crusoe processor includes 'LongRun' technology that dynamically adjusts supply voltage and frequency to reduce power consumption and increase thermal efficiency (Transmeta, 2000) . Although voltage and frequency tuning are quite effective at reducing power consumption, the delay in triggering these responses is high. Ghiasi et al. (2000) advocated a model based on the advanced configuration and power interface (ACPI) specification, in which hardware and software cooperate with each other to manage the NoC temperature dynamically. ACPI involves actions, such as turning on or off input/output (I/O) devices and managing multiple batteries. Rohou and Smith (1999) considered temperature measurement feedback to guide the system in controlling the temperature of NoC. Sanchez et al. (1997) performed simulations using Wattch (Brooks et al., 2000) to correlate power dissipation with other NoC parameter statistics. The results showed that the model reduces NoC temperature significantly with a slight degradation of performance.
All the noise sources, such as coupling capacitances between two neighbouring wires collectively induce a noise voltage on the channel that follows a Gaussian distribution (Sylvester and Keutzer, 2000) . This induced voltage results in increased temperature of the NoC (Wingard, 2001) . Simulation results in Worm et al. (2002) showed that tangible savings in energy can be attained while achieving more robustness to large variations in actual workload and noise.
Several routing algorithms have been developed to achieve thermal efficiency in NoC. Schurgers and Srivastava (2001) proposed an algorithm that minimises communication energy consumption and also balances out the spatial distribution of energy consumption in the network. Hu and Marculescu (2003) suggested an algorithm that avoids local hot spots and thereby simplifies SoC thermal management. Martonoshi (2001a, 2001b) proposed a speculation based algorithm that uses dynamic thermal management to reduce the cooling system costs of NoC.
Conclusions
Energy efficient NoC design is necessary to optimise power consumption at SoCs and CMPs. Energy efficiency can be achieved through careful optimisation at different levels, such as component, circuit, and architectural level. This paper reviewed different techniques and tools to achieve energy efficiency at each of the aforementioned levels. Adopting a combination of techniques each aiming at different level is presumably the most viable approach in designing energy efficient interconnection network. That said, one must also tackle thermal issues for the NoC to be truly implementable.
