Abstract-Power supply noise (PSN) is a growing concern in modern multiprocessor system-on-chips (MPSoCs). The advent of new architectures, such as the network-on-chip (NoC), the standard for on-chip communication in MPSoCs, has given rise to new challenges in maintaining reliable and energyefficient operation. The growing NoC power footprint, increase in the transistor current, and high switching speed of the logic devices exacerbate the peak PSN in the NoC power delivery network (PDN). Hence, preserving power supply integrity in the NoC PDN is critical. In this paper, we propose IcoNoClast, a collection of a novel flow-control protocol (PAF) and an adaptive routing algorithm (PSN-aware routing), to mitigate the PSN in NoCs. Our best scheme achieves ∼15% and ∼12% improvements in the regional peak PSN and energy efficiency across a range of PARSEC benchmarks, with a 4.1% performance overhead and marginal area and power footprints.
I. INTRODUCTION

S
UPPLY voltage integrity is a growing concern in modern multiprocessor system-on-chips (MPSoCs). The varying current demand due to the simultaneous switching of the logic devices creates a noise in the power delivery network (PDN), resulting in a drop in the effective supply voltage. This power supply noise (PSN) has a detrimental effect on the performance, reliability, and energy efficiency of various system components. Unfortunately, as we scale down on technology nodes, this problem is poised to grow significantly due to the decreasing feature size, high device density, and interaction among many connected components. As current and upcoming MPSoCs are embracing Network-on-Chips (NoCs) as their de facto standard for on-chip communication, the PSN will negatively impact fault-free communication on them.
Modern day NoCs can consume a significant fraction of the total chip power (∼36% in the 80-tile TeraFLOPS at 65 nm [13] ). Moreover, researchers are dedicating an independent power-grid network for the NoC to enable efficient power management [26] . Consequently, conventional Manuscript received July 20, 2016 ; revised November 13, 2016 and January 9, 2017; accepted February 8, 2017 . Date of publication March 20, 2017 ; date of current version June 23, 2017 . This work was supported by the National Science Foundation under Grant CAREER-1253024, Grant CCF-1318826, Grant CNS-1421022, and Grant CNS-1421068. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.
The authors are with the USU BRIDGE LAB, Electrical and Computer Engineering, Utah State University, Logan, UT 84322 USA (e-mail: prabalbasu1989@yahoo.com; jsrajesh34@gmail.com; koushik.chakraborty@usu.edu; sanghamitra.roy@usu.edu).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2017.2673808
techniques [12] , [22] to tackle the PSN in cores have little impact on the noise in an NoC PDN. Collectively, these trends make it imperative to control the PSN in an NoC PDN, to ensure fault-free and energy-efficient communication.
In this paper, we uncover an intriguing circuit-architectural insight by establishing a strong correlation between traffic patterns and the peak PSN in the NoC. Using a rigorous crosslayer analysis, we demonstrate that simultaneous and sudden rise in traffic loads within proximal regions in an NoC can lead to a significant voltage noise. Subsequently, we show that existing NoC flow-control protocols and congestion-aware routing algorithms are unable to mitigate the PSN problem effectively. For example, a representative congestion-aware routing scheme (DBAR [19] ) shows negligible 0.1%-1% average PSN improvements in real workloads, over a deterministic dimension order (DOR) routing.
Guided by this cross-layer insight, we explore a combination of static design time and low-complexity runtime approaches to efficiently mitigate voltage noise in an NoC PDN. Our flow-control protocol intermittently allows high and low flit receptions within a single NoC component, while systematically applying a hierarchical approach for scalability. To further improve voltage noise characteristics, we explore a flow-control cognizant adaptive routing that proactively disperses the flit routes. Collectively, our proposed mechanisms incur marginal circuit level implementation overheads, while promoting energy-efficient communication over the NoC by dampening voltage noise on its PDN. To the best of our knowledge, this paper is the first of its kind to investigate voltage noise-aware flow-control and routing algorithms for an NoC. Our contributions in this paper are discussed next.
A. Contributions
1)
We thoroughly study the trends in interconnect circuit parameters and their impact on the peak PSN in NoCs (Sections III-D1 and III-D2) . 2) We analyze the correlation between the peak PSN and the NoC router activity (Section III-D3). 3) We show that congestion-aware routing schemes cannot effectively mitigate peak PSN (Section VI-B). 4) We propose a couple of runtime solutions, collectively referred to as IcoNoClast, to mitigate the PSN in NoCs.
IcoNoClast comprises a novel PSN-aware flow-control (PAF) protocol and an adaptive PSN-aware routing (PAR) algorithm (Section IV). 5) Our best scheme can reduce the regional peak PSN by ∼15% and improve the energy efficiency by ∼12% (across a range of PARSEC benchmarks) compared with a representative routing scheme (DBAR), with a nominal 4.1% average performance overhead and marginal area/power overheads (Section VI).
II. RELATED WORK
Several previous research efforts study the significance of PSN on the chip power performance [20] , [28] . The combined effect of the aggravating transient current and accelerated on-chip component density has raised severe reliability concerns [25] . Work related to our effort of reducing peak voltage noise can be categorized in three domains: 1) characterizing and mitigating PSN in microprocessors; 2) understanding voltage noise in NoC; and 3) flow-control and routing techniques in the NoC.
A. Characterizing and Mitigating Peak Noise in Microprocessors
Supply noise characterization in the microprocessor is a well-researched domain with works dating back two decades. Earlier studies in this domain considered only the switching activity within a core for uniprocessors or processors with small core count [12] . With the growth of manycore processors, recent works have also comprehensively studied the global (chip-wide) effect of synchronization and intercore resonance on-chip supply voltage noise [14] , [22] . Miller et al. [22] observe and eliminate the voltage emergencies caused due to barrier synchronization that causes destructive core-to-core interference. Xing et al. proposed Orchestrator to reduce voltage droops caused by resonance when multiple cores exhibit the same power activity [14] . But none of these works explore the impact of interconnect fabric (NoC) on the PSN. With the growing power footprint of the NoC in MPSoCs, reduction of peak supply noise in NoCs is critical to reliable and energy-efficient computing.
B. Understanding Voltage Noise in NoC
Penolaazi and Jantsch's high level power model for the Nostrum NoC was one of the earliest efforts to develop an empirical function to accurately estimate power fluctuations for an NoC load [24] . Recently, Dahir et al. [8] developed a dedicated tool for the NoC PSN analysis based on their detailed workload model that captures both on-chip communication patterns as well as power-grid dynamics. They furthered their study and proposed an activity density-based application mapping technique to minimize supply noise in NoCs [7] . Our work, in this paper explores flow-control and routing techniques in the NoC to lower the peak voltage noise due to NoC activity.
C. Flow-Control and Routing Techniques
Traditionally, flow-control techniques have been developed to improve the communication efficiency and fault tolerance in NoCs. Michelogiannakis et al. [21] propose elastic buffers to improve the peak throughput and average latency (communication efficiency) in an interconnect network.
Kang et al. [16] explore fault-tolerant flow controls in their work while Jafri et al. proposed an adaptive flow control, which can dynamically adapt to varying loads to maximize performance and minimize the energy consumption. Furthermore, routing algorithms have been developed to tackle a wide range of NoC challenges. Abundant congestionaware schemes have been developed to improve communication efficiency under high loads [9] , [17] , [18] . Similarly, plethora of fault-tolerant routing techniques have also been proposed [10] , [23] , [30] . Researchers have even conceived routing algorithms to exploit process variation [27] and tackle aging in the NoC [2] . But no previous work explores the use of flow control and routing in peak noise mitigation in the NoC.
III. MOTIVATION
In this section, we discuss the impact of the PSN in NoCs. We present the components of supply noise (Section III-A), discuss the supply noise estimation methodology (Section III-B), demonstrate the inefficacy of the core noise mitigation techniques in reducing the NoC PSN (Section III-C), and present the PSN trends (Section III-D) to motivate the need for a PAF protocol (Section III-F).
A. Power Supply Noise
The sources of voltage noise in a PDN are: 1) resistive drop (IR) and 2) inductive drop (L( i/ t)). Voltage drop across the resistances of the power delivery wires causes IR drop, which is proportional to the current (I ) in the circuit. Inductive drop, on the other hand, is caused by the wire inductance (L) of the power grid and is proportional to the rate of change of current through the inductance.
B. PSN Estimation Methodology
Accurate estimation of the peak PSN presents methodological challenges. These challenges stem from cycle-accurate tracking of the pipeline activities of an NoC router, and evaluating its effect on the PDN. We briefly present our rigorous cross-layer methodology to tackle these challenges (details in Section V). First, to model emerging trends, we consider a dedicated standalone PDN for the NoC [26] , to discount the effect of the processing cores' activities on the NoC PDN. Second, we collect energy expended by different pipeline stages of the router using the DSENT tool [29] . Third, we estimate the interconnect RLC parameters for various technology nodes from the ITRS report [31] . Fourth, we convert a recently proposed MATLAB-based PSN tool [8] to C++ and integrate it with the Booksim2.0 architectural simulator [15] . Using router activity traces generated by running real workloads, interconnect circuit parameters, and router pipeline energies, the integrated PSN tool derives cycle-accurate supply noise statistics of the NoC.
C. Limitation of Noise Resilient Microprocessor Schemes in Tackling the NoC PSN
Noise resilience schemes in microprocessors are ineffective in tackling the NoC PSN problem due to several key factors. First, researchers are dedicating an independent powergrid network to NoCs for flexible power management [26] . Consequently, conventional techniques to tackle the PSN in cores will have little impact on the noise in an NoC PDN. Second, specific control flow and microarchitectural event sequences may cause voltage emergencies in the microprocessor pipelines (e.g., a high power instruction followed by a low power one [5] ). As a result, pipeline techniques are designed to alter such event sequences dynamically. Pipeline droop mitigation techniques can only alleviate voltage noise in the NoC if there is a strong correlation between voltage droop events on the NoC and similar events at the core pipeline. Table I presents our experimental data to demonstrate that, in reality, there is negligible correlation between large voltage droops at the core pipeline and large voltage droops at the NoC. For example, assuming a 10% droop as a threshold for high droop, we find a low 0.02 correlation for the benchmark Barnes. This data are collected using a 64-core Intel Xeon (X5550) processor simulated using the Sniper simulator [6] , running representative benchmarks. We carefully analyze occurrence timestamps of processor and NoC droops to estimate their correlation. This intriguing result stems from fundamental differences in the activity characteristics within a processor pipeline and the NoC. Unlike a processor pipeline, the NoC activity is driven by traffic load, rather than individual power footprints of interleaved instructions. Consequently, there is a critical need to explore techniques to mitigate voltage droop in the NoC.
D. Results of PSN Trends
In this section, we investigate the impact of technology scaling on the PSN using our cross-layer framework. Fig. 1(a) shows gradual reduction in the global interconnect pitch (M9 global wire [4] ) with technology scaling (ITRS 2013 [1] ). As the interconnect width decreases, the resistance per unit length of the metal layer increases rapidly. Fig. 1(b) shows the trend of RLC parameters at smaller technology nodes. The NoC is expected to grow in its area footprint to enable communication among an increasing pool of onchip components. As a collective impact of these trends, communication through an NoC will entail an increasing current for charging/discharging a potentially growing pool of charge reservoirs (device and wire capacitance), while incurring a greater IR due to the increased interconnect resistance.
1) Impact of Scaling on Interconnect Parameters:
2) Impact of Scaling on the Peak Supply Noise: Simultaneous switching of transistor devices causes a large variation in current, leading to a high inductive noise due to the wire inductance. Such a high inductive noise is responsible for the intermittent peaks in the cyclewise noise profile of a system. As the power supply voltage scales down, the peak supply noise (as a percentage of the supply voltage) increases at lower technology nodes. Fig. 1(c) shows that, in an 8 × 8 NoC running uniform-random synthetic traffic, the peak noise increases from 40% of the supply voltage at the 32-nm technology node to about 80% of the supply voltage at the 14-nm technology node, if the power distribution strategy remains unchanged.
3) Peak Supply Noise Versus Router Activity: Fig. 2 shows the traffic load change (measured as a difference in the number of incoming flits in two consecutive cycles) on the most exercised router (DBAR) of the NoC, for a few representative traffic patterns. The y-axis denotes the total percentage of cycles that have a difference of x (x can be 3, 4, or 5) flits, served in two consecutive cycles. For example, for the tornado traffic pattern, the most exercised router spends nearly 12% of the execution time with a high change in the number of flits served in consecutive cycles, denoting sudden bursts of data. Fig. 2 also implies that for a considerable fraction of the total cycles, the router activity changes dramatically, resulting in a large PSN. In Fig. 3 , we observe the peak and average voltage noise characteristics for various synthetic traffic patterns, by employing a representative congestionaware routing algorithm-DBAR. The peak voltage noise data indicates a correlation of the PSN with the bursty nature of synthetic traffic. In this case, tornado traffic pattern suffers with high PSN due to the high percentage of load change per cycle, compared with other traffic patterns. The average PSN across traffic patterns is fairly consistent, as the router undergoes nominal change in router activity for most of the total execution time.
E. Inefficacy of Existing Routing Schemes in Mitigating PSN
Many congestion-aware routing schemes choose either XY or YX routing paths based on the congestion in the network. Fig. 4 illustrates a concrete example of why such a strategy is ineffective for PSN mitigation. In this example, nodes 0, 1, 2, and 6, inject flits to the destination node 5, in a network employing wormhole flow control. Cases 1 and 2 use XY and YX routes, respectively. We notice that, although the routing paths are different, the per cycle maximum activity (e.g., flit reception) in region A remains unchanged if all flits are delivered in the same cycle. The possibility of multiple incoming flits in the same cycles is presented in Section III-D3. Hence, both the cases incur similar PSN in region A. Since existing routing schemes do not consider the temporal nature of flit delivery to a router, they are inefficient in mitigating the PSN. We present concrete quantitative evidence of this negligible impact of congestion-aware routing schemes on the PSN in Section VI-B. Our proposed schemes monitor and adaptively alter the temporal nature of flit delivery to effectively reduce the PSN in NoCs.
F. Impact of Flow-Control on the NoC PSN
Existing flow-control protocols (e.g., wormhole) are also unable to mitigate the PSN. Simultaneous activity in proximal routers causes a high switching current to be drawn from the NoC PDN, leading to a large drop in the supply voltage. Stateof-the-art flow-control policies govern the flit propagation in the NoC using a credit system, which keeps track of the buffer availability in downstream routers. In a 2-D mesh NoC topology, a router can potentially receive a maximum of four flits (not considering the injection/ejection port) in a cycle. As the availability of the credits of a router is visible to the adjacent nodes, proximal routers can potentially receive a large number of flits in the same cycle causing a local spike in the PSN profile of the NoC. As the range of possible credit utilization goes up in high radix topologies (e.g., butterfly), the peak PSN is substantially exacerbated.
Inspired by these circuit-architectural insights, we explore novel flow-control and routing algorithms that work in harmony to mitigate noise in an NoC PDN.
IV. PSN-AWARE RUNTIME ADAPTATIONS
In this section, we present IcoNoClast, a collection of a novel PAF protocol and an adaptive PAR algorithm to mitigate the PSN in an NoC. IcoNoClast aims to dampen high simultaneous current loads in proximal regions, by dynamically altering their respective flit acceptance potentials and proactively dispersing the flit routes in the network. We outline the design challenges (Section IV-A), before presenting the PAF and PAR techniques in detail (Section IV-B and IV-C). We summarize the advantages of our techniques (Sections IV-D) and conclude with the implementation details (Section IV-E).
A. Design Challenges 1) Performance Impact: Runtime adaptations to mitigate PSN should have a low performance overhead.
2) Starvation Avoidance: Throttling the flit acceptance potential of a router can create buffer backpressure in the upstream routers. Under a high flit injection rate, the backpressure can grow so large that it may lead to a starvation. It is important to guarantee freedom from starvation in IcoNoClast. (4) is advertised by the least congested router p in cycle x. In cycle y, the router q is the least congested and advertises the highest FLAP (3). In both cycle x and cycle y, the aggregated FLAP of the routers corresponds to the respective MCL-based regional FLAPs. The FLAP in Wormhole flow-control is congestion-aware but agnostic of the regional load.
3) Scalability:
A PSN improvement technique should scale with the size of the communication fabric. It is imperative to minimize its implementation overhead so as to sustain its efficacy in future exascale computing.
B. Design of PAF
The design of the PAF involves a hierarchical approach to dictate the maximum current load (MCL) 1 across the NoC, while ensuring a minimal performance impact. We outline an overview of the PAF in Section IV-B.1, an illustrative example in Section IV-B.2, and present optimizations of the PAF in Section IV-B.3.
1) Hierarchical MCL Allocation:
High concurrent switching of proximal regions is avoided by carefully adjusting the MCL allocated to each region. To realize the MCL allocation principles at different granularities, we define a metric FLAP. For a given input channel of a router, the FLAP is set to 1 when it can receive an incoming flit (otherwise it is set to 0). For a router, the FLAP indicates the aggregate FLAP of its input channels. Similarly, the FLAP of a particular region represents the aggregate FLAP of the routers in that region.
At any given time, the FLAP of a router employing wormhole flow control in a 2-D mesh with four input channels is 4, when all of its input channels can receive at least one flit. The PAF allocates variable MCL to each region by dynamically throttling their FLAPs, irrespective of the space availability in the input channel's buffers.
MCL allocation is a hierarchical process that can be applied at multiple spatial granularities. For example, a large region consists of many smaller subregions. The allocated MCL for the large region is distributed among the subregions, ensuring that proximal subregions are not simultaneously allocated with high MCLs. At the lowest granularity, each router's FLAP is managed in a manner that is consistent with the MCL allocation of the entire subregion.
2) Illustrative Example: Fig. 5 depicts the PAF technique using a 4 × 4 2D-mesh NoC, divided into four regions (A, B, C, and D), each comprising four routers. In cycle x, the PAF allocates a high MCL to region A and low MCLs to the proximal regions (B, C, and D). To ensure a fair provisioning, the PAF redistributes the MCL allocation in cycle y, so that region B is allocated with a high MCL, while its proximal regions are allocated with low MCLs.
The allocated MCL translates to a regional FLAP, which is distributed among the routers of a region. For example, in cycle x, a regional FLAP of 13 is distributed among the routers of region A. Router p advertises an FLAP of 4, while the other routers (q,r and s) advertise three FLAPs each.
3) Optimizations of PAF: The generic PAF technique needs multiple optimizations to efficiently tackle the design challenges (Section IV-A). a) Minimizing performance impact:
We explore a few complementary approaches to retain a high performance in the PAF.
• Judicious FLAP management: To avoid a large flit delay in a given region, the PAF allows intermittent high and low FLAPs in a router. For example, in contrast to cycle x, router q advertises more FLAP (3) in cycle y compared with the other routers.
• Topological awareness: The PAF can be adapted based on the network topology and expected traffic pattern. For example, central routers in a mesh typically experience a high resource demand. We can meet this demand by allocating greater FLAPs to the central routers.
• Congestion awareness: We explore two broad classification of the PAF. b) Congestion agnostic PAF: This variant of PAF statically allocates high and low FLAPs to the regional routers based on a round-robin fairness scheme. The FLAP allocation policy is not influenced by the network buffer occupancy. We refer to this variant as PAF-Static. c) Congestion-aware PAF: This variant of the PAF manages the FLAP allocation based on the relative congestion of the network buffers. We consider the following two congestion awareness at different granularities.
Channel granularity: The FLAP of the least congested channel of a router is set to 1, so that it can always receive an incoming flit. The other channels' FLAPs are dictated by the aggregate FLAP of the router. We call this variant of the PAF as PAF-CG.
Router granularity: The least congested router of a region is allocated with a high FLAP. However, the other routers are allocated with low FLAPs to avoid high simultaneous switching. The aggregate FLAPs of the routers is consistent with the allocated MCL of the region. For example, in cycle y in Fig. 5 , the least congested router q advertises more FLAP (3) compared with the other routers, each of which, advertises one FLAP. The aggregate FLAPs (6) of the routers match the allocated MCL-based regional FLAP. This variant of the PAF is referred to as PAF-RG. d) Avoiding starvation: Repeated blocking of the flits at the same input channel of a router in successive cycles can cause a starvation. To avoid starvation, the PAF adopts a round-robin fairness scheme to restrict flit reception across all the input channels of a router. Moreover, the PAF uses deterministically routed escape VCs, allowing all the possible turns without a deadlock situation. e) Scalability: The PAF is a hierarchical technique that uses local network information at the smallest regional granularity to ascertain the FLAPs of the routers. As the size of the smallest region remains the same even for a larger NoC, the PAF can scale efficiently with the network size (Section VI-G).
C. PAF Aware Adaptive Routing Algorithm
Dynamically throttling the FLAP of a router may cause an intermittent upsurge in the local PSN due to an increased resource contention. We propose the PAR, a PAF cognizant routing algorithm, to circumvent this upsurge, by steering the flit toward an unthrottled downstream path. Fig. 6 depicts the conceptual overview of the PAR. The PAR primarily makes the routing decision based on the relative regional congestion information, aggregated solely along the minimal paths. If the chosen output channel has a throttled FLAP, the PAR reroutes the flit to an orthogonal output channel, strictly maintaining the minimal path constraint. This strategy reduces local current spike and the PSN by relieving router contention, but may occasionally increase the network latency by routing some flits toward more congested downstream paths. In a scenario, where both the minimal paths are blocked due to throttled FLAPs, the flit adheres to the initial channel assignment and waits in the upstream router for another cycle. The PAR incurs no additional circuit overhead as it utilizes the same information required for the PAF.
D. Advantages of PAF and PAR
In this section, we summarize the benefits of the proposed flow-control protocol (PAF) and adaptive routing algorithm (PAR). The PAF obviates the sudden congregation of flits in proximal regions of an NoC, by advertising high and low FLAPs in alternate epochs. As a result, events of large current surge are avoided, leading to a smooth noise profile of the NoC. However, the PAF can unobtrusively aid to the local PSN in the upstream paths, due to a limited reception of flits. To further alleviate such local noise, the PAR strives to route some flits toward the readily available downstream paths. A reduction in the peak noise essentially translates to a lower voltage guardband for fault-free operation of the NoC. As the routers will consume less energy under a lower voltage guardband, our techniques promote energy-efficient system design by moderating the maximum noise in the NoC. Fig. 7 illustrates the implementation of a IcoNoClast router, that involves the FLAP management and congestion management units.
E. Implementation
1) FLAP management: Reception of flits in a router is managed by sending a credit_valid signal to the upstream router. We use the credit_valid signal, along with a statically managed, low overhead, round-robin logic, to ascertain the FLAP of a router. Additionally, we feed the credit_valid signal with one of the output bits of a simple one-hot encoded ring counter, to sporadically restrict an incoming flit. 2) Congestion management: We create a low-bandwidth monitoring network to propagate the congestion information among the adjacent routers in a region. The monitoring network involves an aggregation and a propagation module at the router's low overhead port preselection logic [11] . The aggregation module combines the weighted congestion values from the downstream routers and the propagation module transmits the congestion information to the adjacent routers of a region.
V. METHODOLOGY Fig. 8 represents the hierarchy of our cross-layer methodology. Our evaluation can be classified into two stages. Section V-A describes the PSN estimation technique, and Section V-B discusses various performance metrics to evaluate the efficacy of the comparative schemes (Section VI-A).
A. Power Supply Noise Estimation 1) PDN Simulation: Challenges and Solution:
SPICE-based simulation of the PDNs of the modern VLSI circuits is computationally prohibitive, due to a large number of grid nodes and circuit elements. On the other hand, growing complexity of the PDNs (different topologies and ON/OFF-chip capacitors, for example) makes it harder to accurately model them. Therefore, establishing precise correlation between architectural events and voltage noise is highly contingent on fast and accurate simulation of the PDN.
Many researchers in the past decade have proposed several models to curtail the simulation time while providing a reasonable accuracy for the node voltage. Using a fast and direct model proposed by Zheng et al. [32] , Dahir et al. [8] recently devised a MATLAB-based tool to estimate the PSN for the NoC power grid. The power grid in this tool is modeled as a distributed RLC network. Constant voltage sources and switching capacitors are used to excite the network, and to model the on-chip activity, respectively. Zheng et al. have demonstrated that the model speed up the simulations by several order of magnitude compared with SPICE, with a maximum error of 5% in PSN calculation [32] . 2) Cycle-Accurate Noise Calculation: We integrate Dahir's PSN tool with Booksim2.0, to tightly couple the stages of architectural evaluation and PSN estimation. The consolidation of the two stages further boosts the simulation speed and helps diagnose the impact of architectural events on the PSN. We generate cyclewise statistics of the voltage at each powergrid node for real workloads. We collect the following data for accurate estimation of the PSN.
• Interconnect RLC parameters: We compute the R, L, and C values of the grid interconnect for the 32-nm technology node using the ASU PTM interconnect model [31] . We obtain the aspect ratio and pitch of the grid interconnect, based on the ITRS interconnect predictions [1] .
• Router pipeline energies: We use the recently proposed DSENT [29] to evaluate the energy of the router pipeline stages, using the router microarchitectural parameters for the 32-nm technology node. DSENT models an NoC router as a combination of input/output buffers, virtual channels, switch allocators, and two-stage crossbars.
• Traffic and router activity dump: We instrument Booksim2.0 in order to dump the router activities (e.g., flit reception, VC allocation, and so on) at each cycle, by running PARSEC benchmarks on an 8 × 8 regular 2-D mesh NoC. To mimic the traffic generated by multiple coscheduled applications in an MPSoC, we superimpose heavy random traffic (with a flit injection rate of 0.15) on top of the original application-induced traffic of the PARSEC benchmarks.
B. Performance Evaluation
Table II details the simulation parameters used in the performance evaluation based on the following metrics.
1) Regional Peak PSN:
When an application runs, each router in the network endures a varied peak PSN [8] . So, running the entire NoC with a single operating voltage to ensure 100% fault coverage is energy inefficient. On the other hand, providing each router with a separate operating voltage increases the complexity and footprint of the voltage regulators. So, we divide an 8 × 8 mesh NoC into 16 regions, each containing four routers, and assign minimum operating voltage at the regional granularity, to ensure fault-free communication. We evaluate the regional peak PSN of the comparative schemes in Section VI-C.
2) Average Network Latency: We use Booksim2.0 as our architectural simulator to run network simulations of the comparative schemes using real workloads. We initially run the simulation for one million cycles and wait for all the flits in the network to get drained. We report the performance overhead of the comparative schemes in terms of overall average network latency in Section VI-D.
3) Energy Delay Product: Mitigating the peak supply noise reduces the minimum voltage guardband required for 100% fault coverage. As a result, all the routers in the network can operate at a reduced supply voltage and consume less energy. We analyze the improvement in router energy using DSENT [29] , and estimate the energy efficiency using energy delay product (EDP).
4) Area and Power:
We modify the RTL of the open source Stanford Verilog model of a modern virtual channel NoC router [3] to implement the IcoNoClast techniques. The router is assumed to be a part of a 2-D mesh topology with fiveinput/output ports and eight VCs per port. We synthesize the augmented router RTL with the TSMC 45-nm library using Synopsys Design Compiler and calculate the area and power overheads.
VI. EXPERIMENTAL RESULTS
In this section, we analyze the efficacy and overheads of various comparative schemes (Section VI-A). First, we present the impact of a representative congestion-aware routing scheme on the PSN (Section VI-B) . We analyze the improvement in regional peak PSN and performance overheads of IcoNoClast in Sections VI-C and VI-D, respectively. We evaluate the energy efficiency of the schemes in Section VI-E. We provide a comparison of the mean PSN for three synthetic traffic patterns at various injection rates in Section VI-F. Finally, we report the area and power footprints of IcoNoClast in Section VI-G.
A. Comparative Schemes
We consider the comparative schemes presented in Table III . Each scheme is a combination of a flow-control protocol and a routing algorithm. Fig. 9 shows the improvement in the regional peak PSN with a representative congestion-aware DBAR routing scheme compared with DOR XY routing. Both the routing schemes are used, along with wormhole flow-control. DBAR shows average peak PSN improvements of only 0.1%-1%, across all the benchmarks. Some regions show worse peak PSN with the DBAR, as the DBAR cannot prevent intermittent influx of traffic to proximal uncongested regions, leading to an increase in the PSN. Fig. 10 shows the percentage improvement in regional peak PSN of various comparative schemes, with respect to the baseline. The diversity of the improvement stems from a high degree of skew in the router loads for the real benchmarks. We notice that PAF-SP, PAF-CP, and PAF-RP show more pronounced improvements, as the PAR can mitigate local PSN TABLE III  COMPARATIVE SCHEMES by reducing the intermittent upsurge in resource contention. In the PAF-SP scheme, considering the ferret benchmark, we observe that 14 of the 16 regions see an improvement in the peak PSN, and of these regions, eight regions benefit from a peak PSN improvement greater than 10%. The reduction in peak PSN translates to a lower voltage guardband for these regions resulting in improved energy efficiency. The respective maximum regional PSN improvements observed in all the schemes (Section VI-A) are 8.1%, 13.2%, 6.6%, 14.8%, 7.8%, and 14%, with respective average PSN improvements as 4.7%, 5.7%, 4.1%, 5.5%, 5%, and 5.8%. Some regions show slightly worse peak PSN compared with the baseline, due to occasional increase in local congestion, incurred by the PAF flow-control. Fig. 11 shows the network latency overheads of the comparative schemes, with respect to the baseline. PAF-SP, PAF-CP, and PAF-RP incur slightly more overheads, compared with the other schemes, as PAR sometimes takes more congested downstream paths in the network (Section IV-C). We also notice that PAF-SD performs slightly better than PAF-RD due to PAF-Static's inherent fairness in allotting the FLAPs among the regional routers. As PAF-CG does not throttle the FLAPs of the least congested channels, both PAF-CD and PAF-CP incur very low performance overheads. There is a maximum performance degradation of 5.7% (Ferret in PAF-RP) with an average degradation of 4.1%, across all the schemes. Fig. 12 shows the improvement in energy efficiency of the comparative schemes, in terms of EDP. To calculate the network energy, we assume that each region of the NoC is running with a minimum voltage guardband, required for fault-free communication (Section V-B1). The guardbands are dictated by the peak PSN observed with the pertinent schemes. We notice that all the variants of the PAF incur better EDP, when used along with PAR routing (PAF-SP, PAF-CP, and PAF-RP). We observe a maximum EDP improvement of 12.2% (Swaptions in PAF-SP), with an average improvement of ∼10%, across all the schemes. PAF-SP shows maximum improvements in EDP, among all the schemes. Fig. 13 demonstrates the variation of the mean PSN with packet injection rate, for three traffic patterns (Transpose, Uniform, and Tornado), employing baseline and PAF-RP schemes. The number of flits per packet is 20. In general, the PSN increases with the injection rate due to increased switching activity in the routers at higher injection rates. We observe that PAF-RP consistently incurs lower PSN compared with the baseline, at all the injection rates. The reduction in the PSN (with PAF-RP) also varies across the traffic patterns, and the most PSN mitigation at the highest injection rate is observed in Transpose. 
B. Can Congestion-Aware Routing Tackle PSN?
C. Regional Peak PSN Comparison
D. Performance Overhead
E. Energy Efficiency Comparison
F. Mean PSN Comparison for Synthetic Traffics
G. Area and Power Footprint
We observe marginal overheads from the synthesized hardware of the PAF variants (Table IV) . The congestion management unit incurs more overhead, compared with the simple FLAP management unit.
VII. CONCLUSION
In this paper, we demonstrate that the contemporary flowcontrol protocols and routing algorithms are ineffective in mitigating voltage noise in an NoC PDN. We propose IcoNoClast, a collection of a novel flow-control protocol (PAF) and an adaptive routing algorithm (PAR), to improve the peak PSN in NoCs. Our best scheme improves the regional peak PSN by ∼15% and the EDP by ∼12%, with 4.1% average performance overhead, and marginal area and power footprints.
