BoostNoC: Power Efficient Network-on-Chip Architecture
for Near Threshold Computing
Chidhambaranathan Rajamanikkam Rajesh JS Koushik Chakraborty Sanghamitra Roy
USU BRIDGE LAB, Electrical and Computer Engineering, Utah State University
{chidham, rajesh.js}@aggiemail.usu.edu {koushik.chakraborty, sanghamitra.roy}@usu.edu

ABSTRACT
While near threshold design space provides a promising approach towards energy-efficient computing, it is plagued by
sub-optimal performance. Application characteristics and hardware non-idealities of conventional architectures (optimized
for the nominal voltage) prevent us from fully leveraging the
potential of NTC systems. Further, the popular approach of
increasing the computational core count to compensate for
the performance loss severely burdens the on-chip communication fabric with an increased communication demand.
In this work, we quantitatively analyze the performance
bottleneck created by a conventional NoC architecture in manycore NTC systems. To reclaim the performance lost due to
a sub-optimal NoC, we propose BoostNoC— a power efficient, multi-layered network-on-chip architecture. BoostNoC
improves the system performance by nearly 2× over a conventional NTC system. Further, we improve the energy efficiency by 1.4× with the use of drowsy routers.

1.

INTRODUCTION

Modern many-core chip design is plagued by barriers of
prohibitive energy constraints and restrictive power budgets.
Near threshold computing (NTC) comes as a saving grace to the
energy-efficient computing paradigm by aggressively operating all computing platforms with a supply voltage close to
the transistor threshold voltage. However, the tremendous
increase in energy efficiency comes at the cost of a steep performance loss and performance variability (due to process
variation) [5]. Further, traditional many-core architectures
designed to perform at nominal voltages yield sub-optimal
performance at NTC. While a majority of existing literature
focuses on optimizing the computing cores, research on the
on-chip communication’s impact at NTC has taken a back
seat. In this context, we meticulously evaluate the application level and hardware performance characteristics of manycore NTC systems to specifically isolate the impact of the onchip communication fabric—network-on-chip (NoC).
NTC circuits typically employ more devices to exploit application parallelism and compensate for the performance
loss of a single device. [5]. A direct consequence of this approach is the increased communication demand on the NoC
owing to simultaneous interaction of many cores. This heightened communication demand, along with the following three

Figure 1: Limitation due to application characteristics.
prominent factors, delivers a severe blow to the on-chip communication latency and performance. First, we see an increase in the inter-core packet hop distance by virtue of an increase in the computational core count. Second, the supply
voltage scaling to near threshold results in a massive reduction of the NoC operational frequency. Finally, the unavoidable
effects of process variation (PV) presents a tremendous challenge in NTC systems. In this work, we demonstrate that the
traditional on-chip communication fabric creates a severe performance bottleneck in NTC systems. In addition, we seek a solution to regain the lost performance without compromising
on the energy efficiency of the system.
Contemporary research on NoC topology and architectures
such as clustered NoC [17], hierarchical NoC [12] and tile
based NoC [7,18], have aimed to reduce the inter-core packet
hop distance. While these works are an important step forward, they do not adequately address the challenges of reduced operational frequency and PV induced performance
variation posed by the NTC regime. Hence, to improve the
NoC performance without compromising on the energy efficiency, we propose BoostNoC— a power efficient, multi-layered
NoC architecture that efficiently caters to the demands of
many-core NTC systems. BoostNoC is made up of two architecturally homogeneous layers contrasted in their design
characteristics. While one layer is optimized for power, the
other is optimized to boost the NoC performance under high
communication loads. To the best of our knowledge, this is the
first work to exploit the unique opportunity presented by the variation in communication load across epochs to efficiently boost the
NoC performance in NTC regime.
We make the following contributions in this paper:

• We analyze the factors affecting performance in many core
NTC systems and isolate the impact of the on-chip communication (Section 2).
• We explore the detailed design of a power efficient multilayerd NoC architecture called BoostNoC to improve the
performance under high communication load in many core
NTC systems (Section 3).

Bottleneck
Estimation

System Comparison

Interconnect
Memory

Ideal
system 1

Ideal system +
NoC Latency
(1 cycle/hop)
Ideal system +
Memory Access Latency
(∼45ns)

Table 1: Test configurations used to quantitatively analyze the
cause of performance bottleneck in many-core NTC systems.

• We study the traffic characteristics and network utilization
of our BoostNoC architecture and determine that the use
of drowsy routers can further improve the energy efficiency
(Section 4).
• Using a rigorous cross-layered circuit-architectural analysis (Section 5), we evaluate the performance, energy efficiency and peak power improvement of BoostNoC architecture. Our analysis reveals that BoostNoC delivers up to
1.4× higher performance per watt compared to a conventional NoC operating at NTC and nearly 3× that of a NoC
operating at super threshold computing (STC) (Section 6).

2.

MOTIVATION

In this section, we quantitatively assess the performance
bottlenecks in a NTC many-core system. The performance
of a many-core NTC has two major contributing factors: application level and hardware performance characteristics. To
understand application level characteristics, we study the performance scalability of various applications under an idealized hardware1 in Section 2.1. To carefully understand the
impact from hardware performance characteristics, we decouple two of its major components: off-chip memory latency (Section 2.2.1) and on-chip interconnect latency (Section 2.2.2). Our rigorous experimental data clearly demonstrates
that on-chip interconnect latencies are the most dominant performance bottlenecks in a NTC many-core system.

2.1 Application Performance Characteristics
Application speedups from parallel execution are bound
by the prevailing fraction of serial code and do not improve
linearly with an increase in the computational core count.
Since the fraction of serial code varies across applications,
it is critical to understand this application level bottleneck
when we comparatively analyze STC and NTC systems.
Figure 1 shows the effective application speedups obtained
when a representative set of parallel workloads (SPLASH2
benchmarks) are executed on ideal hardware by scaling the
processor count from 1 to 128 cores. The evaluation methodology used for this analysis is presented in detail in Section 5.
We observe that only a couple of applications in this diverse
set of benchmarks, can effectively scale beyond 60 cores. Benchmarks like radiosity, cholesky and barnes, have nearly ideal
speedup indicating very little overheads due to the serial
portions of the code. Other applications like water.sp and raytrace have large portions of serial code. Deploying these applications in NTC systems with hundreds of cores will result
in decidedly sub-optimal performance.
1
Ideal hardware signifies a system with no hardware performance penalties.
The associated hardware penalties related branching, memory access, on-chip
communication, among others, are all considered to be 1 cycle.

2.2 Hardware Performance Characteristics
To quantify the impact of notable hardware characteristics
such as memory access latency and inter-core communication on system performance, we consider a popular tile based
128-core architecture as our baseline NTC system [11, 18].
The 128 cores are organized as 32 tiles (8 × 4) interconnected
by a mesh network, with each tile consisting of 4 cores. The
test configuration parameters are shown in the Table 1 (Section 5 presents a detailed discussion of the methodology).

2.2.1 Memory Access
Figure 2a illustrates the performance degradation due to
off-chip memory access latency in a 128-core NTC system.
Our analysis proves that memory access is not a prominent cause
for performance bottleneck in NTC systems. We observe that the
average performance degradation due to memory access latency is a mere 0.7% and the highest degradation suffered is
1.5% for the fft application. The baseline is considered to be
an 128-core NTC system with ideal memory access latency as
shown in Table 1.

2.2.2 On-Chip Communication
Figure 2b shows that the system performance degrades
significantly due to the on-chip communication (networkon-chip) latency in a 128-core NTC system. Compared to an
ideal system, the average performance degradation is a significant 50%, while radiosity and fft suffer from nearly 90%
degradation in performance.
Our evaluations reveal that the following three factors play
a decisive role in the degradation in NoC performance.

• Increase in communication demand: When comparing the volume of packets injected in a 128-core NTC system to an
isopower 16-core STC system 2 , we found that the volume
of injected packets increased by more than 3× in the NTC
system. The rise in core count results in the increase of
both inter-core, as well as, cores-memory communication.
• Diverse latency distribution in NTC: Figure 3a illustrates the
distribution of communication latency in a 128-core NTC
system. We observe that, on an average, more than 30% of
the packets have a latency greater than 10 cycles. A similar
analysis in the STC system showed that a mere 5% of the
packets have a latency greater than 10. This diversity in
latency distribution is the resultant of increased inter-core
packet hop distance owing to a rise in the core count.
• Reduced NoC operational frequency: Figure 3b shows that the
packet latency degrades by more than 6×, on average, in
a tile-based 128 core NTC system. Applications such as fft
and radiosity, suffer a latency degradation of nearly 16×.
The increase in inter-core packet hop distance, along with
the added detriment of reduced operating frequency, considerably increase the average packet latency.

2.3 SIGNIFICANCE
The degradation in performance due to application (Section 2.1) and hardware characteristics (Section 2.2) help us
characterize the demand in NTC systems. Our findings clearly
demonstrate that the on-chip communication is a severe bottleneck in many-core NTC systems. Hence, we propose BoostNoC, a novel power-efficient NoC architecture for NTC systems to efficiently reclaim the lost performance.
2
While scaling the 16-core STC system to a 128-core NTC system we ensure
that both systems have a constant power budget.

(a) Performance degradation due to off-chip memory access.

(b) Performance degradation due to a NoC.

Figure 2: Quantitative analysis of hardware characteristics to identify the cause of sub-optimal system level performance in manycore NTC systems.

(b) NTC packet latency normalized to its STC counterpart.

(a) Distribution of packet latency.

Figure 3: Characterizing the loss in NoC performance in NTC. Figure 3a presents the distribution of packet latency (in cycles) and
Figure 3b shows the degradation in packet latency in NTC systems.

3.

BOOSTNOC ARCHITECTURE

In this section, we provide a detailed description of the
BoostNoC architecture. Section 3.1 presents the design overview
and we establish the insight behind our approach in Section
3.2. In Section 3.3, we detail our two layers, and analyze the
intricacies of switching between the layers in Section 3.4. Section 3.5 reveals the required hardware control mechanism.

3.1 Design Overview
We envisage a multi-layered NoC architecture, where the
layers are architecturally homogeneous but optimized to contrasting design considerations. Our work in this paper demonstrates a novel incarnation of this concept—BoostNoC— that
exploits the temporal nature of communication demand in
NTC systems. The temporal nature refers to the variation of
communication load across different epochs due to the inherent application characteristics.
Figure 4 illustrates the framework of our novel BoostNoC
architecture. BoostNoC combines two architecturally homogeneous layers that are optimized to contrasting design parameters. Based on the communication load, BoostNoC dynamically switches between the layers. While one layer is
optimized for power efficient data transmission, the other
layer is used to bolster the NoC performance. We detail the
technicalities of BoostNoC in the following sections.

3.2 Temporal Communication Demand
Figure 5 shows the on-chip communication network utilization trend of 4 representative applications, running on a
128-core NTC system. The x-axis represents consecutive intervals during the application runtime. In most benchmarks,
we see discernible patterns in the communication demand
that fluctuates between epochs. In few epochs the cores are

highly voluble3 creating a high load on the communication
fabric, while in other epochs most cores are quiet (low communication demand). This temporal variation of network
utilization can be correlated to the volume of injected packets experiencing long inter-core packet hop distance4 . Figure
6 illustrates this correlation for the fft benchmark. We see a
sharp rise in network utilization in epochs with a high volume of long distance packets.
Our novel BoostNoC architecture aims to exploit this temporal variation in communication demand by trading off chip
area to bolster the NoC performance and energy efficiency.

3.3 BoostNoC Layers
Two architecturally homologous layers of NoC routers are
interconnected in a mesh topology to frame the BoostNoC architecture. The two layers share the links between the routers
as shown Figure 4. The two layers are:

• Frugal power usage layer (FruPUL): The routers in this
layer are optimized to operate in the near threshold voltage regime to provide power-efficient operation at a low
communication load.
• Boost performance layer (BoPeL): The routers in this layer
are optimized to operate at the nominal voltage to bolster
the NoC performance under a high communication load.
The objective of BoPeL is to drain the in-flight packets at a
quicker rate and offset the latency degradation caused by
voluminous long distance communication.
At any given time, only one layer plays an active role in
the communication fabric and the other layer is turned off.
FruPUL is the default active layer as the cores are consid3

high volume of inter-core communication
Considering the tiled architecture, we define packets needing more than 3
hops to reach their destination as long distance communication.
4

Figure 4: BoostNoC Architecture. The figure also shows the functional diagrams of the router and layer controllers.

(a) barnes

(b) cholesky

(c) radiosity

(d) raytrace

Figure 5: Temporal variation of communication load for 4 different benchmarks. The plots illustrate network communication load
(in %) during consecutive intervals of 20000 cycles for the whole application runtime. We see discernible patterns in all applications.
Figure 7 illustrates the process of switching between layers.
Keeping the defined constraints in mind, we envisage four
operational phases of the switchover mechanism explained
below in conjunction with Figure 7.

(a) Communication load - fft

(b) Volume of long distance communication.

Figure 6: Correlation between communication load and volume
of long distance communication for the fft application. Figure
6b shows the volume of long distance packets in consecutive
epochs of 20000 cycles. The x-axis represents the application
runtime in cycles.

Figure 7: Operational phases of the switchover mechanism.
ered to be operating in the NTC regime. During epochs with
high communication loads, BoPeL is activated (and FruPUL
deactivated) to meet the demand and boost the NoC’s performance. The layer switchover mechanism and the cost associated with it are discussed in Section 3.4.

3.4 Switchover Mechanism
The switchover between the layers is the crux of the BoostNoC architecture. The primary constraint while switching
between the two layers is to maintain lossless communication of packets while incurring minimal switching overheads.

• Pre-initiate: During normal NoC operation, one of the layers is active and the other is powered off. In this interval,
the aggregate buffer occupancy of the routers in the active
layer is carefully monitored. The buffer occupancy information serves as an indicator of the communication load
on the network. It is the cardinal parameter behind the
decision making process involved in switching between
the layers. In Figure 7, we observe that FruPUL is active
and the communication load is being monitored. When
the load increases, the decision to switch to BoPeL is made.
• Initiate: Based on the decision, BoPeL is signaled to switch
on. During the same time, all the routers in FruPUL are instructed to process the in-flight flits in each router and forward them to the input buffers of their respective downstream routers. The flits already present in each router’s
input buffers maintain status quo. We call this process
flit safeguarding. The process of flit safeguarding is allowed
to continue and complete until BoPeL (the other layer) is
switched on and ready to handle traffic.
• Transfer: Once BoPeL signals ready, the packets in the input
buffers of routers in FruPUL (one layer) are transferred to
the corresponding routers in BoPeL (the other layer). The
novel buffer content transfer mechanism overcomes the
need to drain packets from the network and is elaborated
in Section 3.5.1.
• Terminate: On receiving a signal from FruPUL that the buffer
content transfer is successful and that all its buffers are
empty, the layer is signaled to be powered off. Simultaneously, BoPeL is waved to begin normal operation.

3.5 Hardware Control Mechanism
BoostNoC architecture requires specific hardware enhancements to carry out its functions in an orderly fashion. We

Algorithm 1 Layer Controller Operation
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:

Initialize: Routers = N;
⊲ Number of routers
Acknowledge: ack_LX
WaitForAcclimatizationPd();
for k = 1 → Routers do
Evaluate BufferOcupancy(k);
end for
for k = 1 → Routers do
Evaluate RouterLocation(k);
if ( Bu f f erUsage > Usage threshold) then
RouterRequested++;
end if
end for
if ( RouterRequested > Router signi f icant) then
Enable reqactiv_LX;
end if
for k = 1 → Routers do
Enable init_ f s ( k);
end for
WaitFor respactiv_LX;
if (respactiv_LX ) then
Enable init_transbu f ();
end if
WaitFor resp_bu f empty;
if (resp_bu f empty) then
Enable term_LX (OLD );
Enable begin_comm;
end if

adopt two hardware control mechanisms known as Layer Controller and Router Controller to efficiently resolve and regulate the layer operations in the NoC. Each controller plays
a definitive role to efficiently boost the NoC performance in
NTC systems.
Layer Controller (LC): The role of the layer controller is to
monitor the network communication load by aggregating the
information sent from individual router controllers. It functions like the brain of BoostNoC, and plays a central role in
the decision process to switch between layers. Algorithm
1 shows the basic operation of the LC. As an initial setup,
LC acknowledges the active layer (ack_LX) and records the
buffer occupancy of the routers in that layer (lines 1-6). It
then continually monitors the information sent by individual router controllers during each epoch and based on the
rules set in lines 7 − 15, it decides if a switchover in layer
will yield a better outcome. Once the decision is made to
switch between layers, LC signals to turn on the alternate
layer (reqactiv_LX) and instructs the individual router controllers (RC) to trigger flit safeguarding (init_fs). On receiving a response from the newly activated layer (respactiv_LX),
it instructs all RCs to begin inter-layer buffer content transfer (init_transbuf ) and waits for all RCs to signal for transfer
completion (resp_bufempty). At this point, the LC terminates
the old layer (term_LX), activates the new layer (begin_comm)
and goes back to monitoring the communication load.
Router Controller (RC): RCs are distributed agents with a
three-fold functionality: (a) to sense local changes in the network, (b) to report gathered information to the LC and (c) to
actuate responses when directed by the LC. Each individual
RC reports its buffer occupancy to the LC at regular intervals (report_bufoc) and waits for a decision. On receiving the
init_fs signal, the RC performs buffer content transfer as detailed in Section 3.5.1, reports successful transfer back to the
LC and waits for begin_comm to restart communication in the
active layer.
Figure 8 illustrates the sequence of handshake signals between LC and RC, highlighting the operation of BoostNoC.

LC, additionally ensures that once a layer is activated, it stays
active for a set minimum period known as acclimatization period. The acclimatization period is added to amortize the cost
associated with the layer switchover and to avoid the effect
of thrashing between layers.

3.5.1 Inter-Layer Buffer Content Transfer
The router controller (shown in Figure 4) plays a critical
role in the inter-layer transfer of packets. The router in FruPUL
is connected to its counterpart in the BoPel using a bi-directional
physical link controlled by the RC. The router in each layer
consists of n buffers. Once the process of flit safeguarding
is complete, RC evaluates the buffer occupancy of the active layer. The buffer contents of the active layer are serially
copied to the buffers of the router in the alternate layer by selecting the appropriate MUX and DeMUX signals. A counter
keeps track of all transactions between the two layers and
once the value matches the buffer occupancy estimated before the process, the RC signals the successful completion of
buffer transfer. This process happens simultaneously in the
entire network. The serial transfer and transaction tracking
between the two layers ensure a lossless transition between
the two layers during a switchover. The cost associated with
the switchover directly correlates to the buffer occupancy at
the start of the process and the worst case switchover overhead depends on the buffer size of the routers.

4. DROWSY ROUTERS IN BOOST LAYER
In this section, we study the traffic characteristics and network utilization of the BoostNoC and propose the use of drowsy
routers in BoPeL to further improve the energy efficiency.
Breaking down the layer switching rule in Algorithm 1, we
can rationalize that when the communication load on the network is high, BoPeL is activated. Figure 9 exhibits an interesting trend in traffic when BoPeL is active. In most benchmarks,
only a small number of tiles are responsible for the high communication load, indicating that the other routers are usually idle for long duration. On an average, nearly 60% of
the routers are idle when operating in BoPeL. We exploit this
phenomenon to improve the energy efficiency of the NoC by
replacing the input buffers in this layer with drowsy SRAMs.

4.1 Design Details
The routers in the BoPeL operate at the nominal voltage
and hence have a significantly high power consumption. By

Figure 8: Handshake communication between Layer and
Router controller.

Parameters
Architecture
Cores
Voltage
Frequency
Technology

STC
Configuration

NTC
Configuration

Intel Xeon Processor E5 Series
16
128
1.0V
0.35V
2.3GHz
200MHz
22nm
22nm

Table 2: STC and NTC system configuration parameters.
Figure 9: Percentage of idle routers in the BoPeL.

Chen et al. showed that the maximum delay deviation due
to within-die PV is a colossal 200% for the NTC regime at
22nm and thus cannot be discounted [3]. We therefore used
this delay variation to model PV-affected NTC core.
NoC Simulation: We model a 8x4 2D mesh NoC mimicking
a 32 tile-based NTC system on the Booksim Simulator [8].
The router has a 4-stage pipeline of route computation, virtual channel allocation, switch allocation and switch traversal. We simulate the traces collected from Splash2 benchmarks and observe the NoC behavior and study various traffic characteristics. We implement the BoostNoC architecture
with functionality detailed in Section 3 and evaluate the performance of the NoC. Our evaluation carefully considers the
impact of PV on the NoC performance.

5.2 Circuit Layer
Figure 10: BoostNoC cross-layer methodology.
introducing drowsy SRAMs as buffers in the router, we add an
additional low power operation mode to improve the energy
efficiency. In this mode, a low voltage is supplied to the inactive routers, thereby reducing the leakage current. The idle
routers are periodically put into drowsy mode and are woken
up when the upstream router requests credit information. A
single cycle cost is added to wake up a router in the drowsy
state [6]. The decision to put the idle routers into the low
power mode can be made by the router controller based on
buffer utilization changes. We evaluate the improvement in
energy efficiency due to drowsy routers in BoPeL in Section 6.

5.

METHODOLOGY

Figure 10 presents the comprehensive cross-layer methodology we use to evaluate the efficacy of BoostNoC architectures using three metrics: peak power, performance and energy
efficiency. Architectural simulations are performed to assess
the performance (Section 5.1), while the circuit layer analysis contributes valuable information regarding the design
footprint and power characteristics (Section 5.2). Section 5.3
presents the procedure for device level analysis to obtain process variation parameters and STC to NTC scaling data.

5.1 Architectural Layer
Multi-core Simulation: We model an Intel Xeon E5 series
processor on Sniper multi-core simulator [2] with the configuration shown in Table 2. The STC system models 16 cores
interconnected using a NoC (4 × 4 2D mesh topology). The
NTC system models 128 cores in a tile based architecture interconnected using a 8 × 4 2D mesh NoC, with each tile housing 4 cores [11]. We use highly parallel large-set workloads
from the Splash2 benchmark suite to assess the performance
of these systems and collect traces of the communication. We
use booksim 2.0 [8] to simulate and evaluate the NoC behavior. Splash2 benchmark suite consists of parallel and welldiversified applications that can scale to 128 cores [21].

To estimate the design footprint and hardware overheads
of our architecture, we augment the open source NoC router
RTL [1] with the hardware control mechanisms discussed in
Section 3.5. We synthesize the NoC router RTL using the
32nm standard cell library using Synopsys Design Compiler.
We use the DSENT power modeling tool [19] to determine
the NoC leakage and dynamic power estimates considering
the PV parameters evaluated in the device layer. The network and router configuration are identical in Sniper, Booksim, as well as, DSENT to maintain uniformity.

5.3 Device Layer
We obtain the 22nm PTM model for HSPICE simulations
and customize it in order to generate leakage and dynamic
power behavior at STC and NTC regimes [22]. NTC circuits are highly susceptible to process variation. Our HSPICE
evaluations model the effect of PV based on VARIUS-NTV
[16] and we use these results while scaling from STC to NTC.
The details of our scaling methodology follows.

5.3.1 Power Scaling from STC to NTC
Scaling the entire power from the STC to the NTC region
presents a methodological challenge. HSPICE simulation of
an entire NoC architecture is computationally intense. To
manage the complexity, we scale the STC power to NTC using the following three categories [4].

• Combinational logic: This is scaled using the STC/NTC characteristics of the canonical 31 fanout-of-4 inverter-chain as
the representing circuit [15].
• Storage elements: We scale the on-chip SRAM power by investigating the power scaling trend from the STC 6T SRAM
cell to the NTC-friendly 10T SRAM cell [20].
• Interconnect: We estimate the interconnect power to be 50%
of the dynamic power based on previous work [4]. Since
scaling the supply voltage equally affects both interconnect power and dynamic power, we assume that their relative weight remains unchanged for STC and NTC.

(a) Normalized system level performance.

(b) Normalized packet latency.

Figure 11: (a) System level performance improvement of our proposed schemes normalized to AlwaysNTC scheme. (b) Normalized
reduction in packet latency of BoostNoC compared to AlwaysNTC scheme.

6.

EXPERIMENTAL RESULTS

In this section, we discuss the results obtained from our
simulation of the BoostNoC architecture considering the within die PV. Section 6.1 summarizes the different schemes that
we use in our simulations. We evaluate the effectiveness of
our proposed architectures using three metrics performance
(Section 6.2), peak power (Section 6.3) and energy efficiency (Section 6.4). We end our results section by presenting the design
footprint in terms of area overhead in Section 6.5

6.1 Evaluation Schemes
The four schemes evaluated in our simulations are:

• Always NTC: The NoC and the cores are both operated
in the NTC regime throughout the application runtime.
In theory, this scheme is extremely energy efficient at the
cost of a substantial drop in performance. Moreover, the
with-in die process variation significantly affects the performance/power characteristics of both the cores, as well
as, the NoC in this scheme.
• Always STC: In this scheme, the cores are operating in
the NTC regime, while the NoC is operating at nominal
voltage. This configuration provides the best performance
while taking a significant hit in energy efficiency. The cores
substantially suffer from the effect of process variation. However, the NoC exhibits lower variation in performance/power
characteristics as it operates at the STC regime.
• BoostNoC: Our proposed BoostNoC architecture, discussed
in Section 3, uses two layers (FruPUL and BoPeL) to provide the best of both worlds. The architecture sacrifices
chip area to deliver better performance and energy efficiency. The process variation affects both cores and NoC
significantly. Since NoC operates in FruPUL layer during
most of the application runtime, the effect of process variation is high compared to an always STC scheme.
• Drowsy routers in BoPeL or drowsy BoostNoC: This scheme
uses drowsy routers in the BoPeL to further improve energy efficiency.

6.2 Performance Analysis
Figure 11a shows the normalized system level performance
of the BoostNoC architectures considering within-die process
variation. The performance is normalized to the PV-free always NTC scheme. Our results demonstrate that on an average, the BoostNoC improves the performance by nearly 2×.
Benchmarks with a high communication demand such as fft
and radiosity show even higher performance improvement
(around 3×). However, applications with low communication demand (barnes and water.sp) are less sensitive to the

Figure 12: Normalized peak power of BoostNoC architectures
compared to PV-free AlwaysNTC (Lower is better).
boost in operating frequency and hence deliver only 4% improvement in the system level performance.
Our evaluations also show that the performance of drowsy
BoostNoC nearly matches that of the BoostNoC. The small difference in performance between the two schemes is due to
the overhead suffered while transitioning from low power
drowsy state to the ON state.
Figure 11b illustrates the packet latency reduction due to
BoostNoC. These results signify the communication performance as compared to the PV-free always NTC NoC. Drowsy
BoostNoC data is omitted from the plot as the performance is
fairly identical to the BoostNoC architecture. On an average,
our scheme improves the packet latency by nearly 40% compared to a conventional always NTC scheme. As expected, always STC performs better than BoostNoC. Our results demonstrate that applications with high communication loads significantly benefit from the BoostNoC architecture.

6.3 NoC Peak Power Analysis
Figure 12 compares the peak power dissipated in all our
different simulation schemes. The values obtained are normalized to PV-free always NTC peak power which is expected
dissipate the least power. BoostNoC suffers from a 30% rise
in the peak power on an average, due to the switchover to
BoPeL which operates at the nominal voltage. A key observation from Figure 12 is the difference in peak power between
BoostNoC and drowsy BoostNoC schemes. By putting the idle
routers into a low power mode, we obtain modest improvements in peak power without significantly compromising the
performance of the NoC.

6.4 NoC Energy Efficiency Analysis
Figure 13 compares the normalized energy efficiency of the
schemes. In a sense, performance delivered per watt is an accurate measure for comparison of the schemes as it accounts

and energy efficiency of the NoC in many-core NTC systems.

8. CONCLUSION

Figure 13: Energy efficiency of BoostNoC Architectures normalized to PV-free AlwaysNTC (Higher is better).
for both performance, as well as, power. Our analysis shows
that, though the performance of the always STC scheme is
significantly higher than other schemes, it is highly inefficient. The proposed BoostNoC provides a favorable trade-off
between power and performance, and hence surpasses conventional NTC architectures by 25%. Drowsy BoostNoC, further improves the energy efficiency over always NTC by 40%.
Water.sp has a low runtime and a low communication demand limiting the duration of operation in BoPeL. These characteristics of water.sp prohibits BoostNoC from improving its
energy efficiency. Similarly, the meager improvement in barnes
is due to its high compute and low communication attributes.

6.5 Design Overheads
The overheads due to the cost associated with switching
between layers is accounted for in our performance evaluations. BoostNoC sacrifices chip area to deliver better performance
and energy efficiency. The relative chip area increases by 1.16×
as BoostNoC consists of two architecturally homogeneous layers. However, the design overhead of the LC and the RC logic
is a mere 1.77% of the single layered NoC.

7.

RELATED WORK

Two key challenges have prevented us from fully leveraging the potential of near threshold computing: (a) parametric variation and (b) performance loss [5]. To fully understand
the impact of PV and capture the increased sensitivity to PV
at NTC, researchers have developed microarchitectural PV
models [9]. Further, several innovative solutions, such as
the use of PV tolerant memory structures [13], use of multiple voltage-frequency domains [18] and computational core
pipeline weaving [10], among others, have been proposed to
tackle the challenges arising due to PV.
To reclaim the lost performance caused by the reduction
in operating frequency, contemporary works have proposed
circuit-architectural solutions, such as device optimization
by improving channel doping profile [5, 14] and clustered architecture [5, 18]. But the most intuitive approach has been
to increase the number of computational cores to exploit application parallelism [15, 18]. A direct consequence of this
approach has been the tremendous increase in the on-chip
communication demand. While a handful of previous works recognize this increase in communication demand [15], no previous
work tackles the performance bottleneck arising as its aftermath.
Our work in this paper is distinct in two ways. First, we
clearly demonstrate the performance bottleneck created by
sub-optimal NoC architectures in many-core NTC systems.
Second, we propose a power-efficient multi-layered NoC architecture — BoostNoC — to improve the power, performance

In this paper, we demonstrate that on-chip communication creates a severe performance bottleneck in many-core
NTC systems. We therefore propose — BoostNoC — a novel
power-efficient, multi-layered NoC architecture. BoostNoC
effectively switches between the FruPUL (power efficient) and
BoPeL (performance optimized) to boost the system performance by 2×. Our analysis shows that BoostNoC with drowsy
routers improves the energy efficiency of the NoC by 1.4×.

Acknowledgments
This work was supported in part by National Science Foundation grants (CNS-1117425, CAREER-1253024, CCF-1318826,
CNS-1421022, CNS-1421068). Any opinions, findings, and
conclusions or recommendations expressed in this material
are those of the authors and do not necessarily reflect the
views of the NSF.

9. REFERENCES
[1] B ECKER , D. Open Source NoC Router RTL, August 2012.
[2] C ARLSON , T. E. AND OTHERS Sniper: exploring the level of abstraction
for scalable and accurate parallel multi-core simulation. In Proc. of SC
(2011).
[3] C HANG, L. AND OTHERS Practical Strategies for Power-Efficient
Computing Technologies. Proceedings of the IEEE 98, 2, 215–236.
[4] C HEN , H. AND OTHERS Opportunistic turbo execution in NTC:
exploiting the paradigm shift in performance bottlenecks. In Proc. of
DAC (2015), pp. 63:1–63:6.
[5] D RESLINSKI , R. G. AND OTHERS Near-Threshold Computing:
Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits.
Proceedings of the IEEE 98, 2 (2010), 253–266.
[6] FLAUTNER , K. AND OTHERS Drowsy Caches: Simple Techniques for
Reducing Leakage Power. In Proc. of 29th ISCA (2002), pp. 148–157.
[7] JANIDARMIAN , M. AND OTHERS Onyx: A new heuristic
bandwidth-constrained mapping of cores onto tile-based Network on
Chip. IEICE Electronics Express 6, 1 (2009), 1–7.
[8] JIANG, N. AND OTHERS A detailed and flexible cycle-accurate
Network-on-Chip simulator. In ISPASS (2013), pp. 86–96.
[9] K ARPUZCU , U. R. AND OTHERS VARIUS-NTV: A microarchitectural
model to capture the increased sensitivity of manycores to process
variations at near-threshold voltages. In DSN (2012), pp. 1–11.
[10] K RIMER , E. AND OTHERS Synctium: a Near-Threshold Stream Processor
for Energy-Constrained Parallel Applications. IEEE Computer
Architecture Letters 9, 1 (Jan 2010), 21–24.
[11] L ABS, I. The SCC Platform Overview, 24 May 2010.
[12] L ANKES, A. AND OTHERS Hierarchical NoCs for Optimized Access to
Shared Memory and IO Resources. In Proc. of DSD (2009), pp. 255–262.
[13] M UKHOPADHYAY, S. AND OTHERS Low-Power and Process Variation
Tolerant Memories in sub-90nm Technologies. In 2006 IEEE International
SOC Conference (Sept 2006), pp. 155–159.
[14] PAUL, B. C. AND OTHERS Device optimization for ultra-low power
digital sub-threshold operation. In Proc. of ISLPED (2004), pp. 96–101.
[15] P INCKNEY, N. R. AND OTHERS Assessing the performance limits of
parallelized near-threshold computing. In DAC (2012), pp. 1147–1152.
[16] S ARANGI , S. AND OTHERS VARIUS:A Model of Process Variation and
Resulting Timing Errors for Microarchitects. IEEE Trans. on
Semiconductor Manufacturing 21 (2008), 3 –13.
[17] S EIFI , M. R., AND E SHGHI , M. Clustered NOC, a suitable design for
group communications in Network on Chip. Computers and Electrical
Engineering 38, 1 (2012), 82 – 95.
[18] S ILVANO , C. AND OTHERS Voltage Island Management in Near
Threshold Manycore Architectures To Mitigate Dark Silicon. In DATE,
Dresden, Germany, March 24-28, 2014 (2014), pp. 1–6.
[19] S UN , C. AND OTHERS DSENT - A Tool Connecting Emerging Photonics
with Electronics for Opto-Electronic Networks-on-Chip Modeling. In
NOCS (2012), pp. 201–210.
[20] W ESTE , N., AND HARRIS , D. CMOS VLSI Design: A Circuits and Systems
Perspective, 4th ed. Addison-Wesley Publishing Company, USA, 2010.
[21] W OO , S. C. AND OTHERS The SPLASH-2 programs: Characterization
and Methodological Considerations. In ISCA (1995), ACM, pp. 24–36.
[22] Z HAO , W., AND C AO , Y. Predictive Technology Model, June 2012.

