Opportunistic Turbo Execution in NTC: Exploiting the
Paradigm Shift in Performance Bottlenecks
Hu Chen, Dieudonne Manzi, Sanghamitra Roy, Koushik Chakraborty
BRIDGE Lab, Electrical and Computer Engineering, Utah State University
hu.chen@aggiemail.usu.edu, manzidieudonne@yahoo.com,
{sanghamitra.roy, koushik.chakraborty}@usu.edu
ABSTRACT
In this paper, we investigate an intriguing shifting trend in
performance bottlenecks for Near-Threshold Computing (NTC)
processors. Our study demonstrates that the traditional memory latency bottleneck is largely superseded by the bottlenecks of Long Latency Datapaths (LLDs) within a processor
core. To exploit this paradigm shift, we propose Opportunistic Turbo Execution (OTE). OTE dynamically boosts the performance of LLDs, by several factors, improving both performance and energy efficiency in an NTC core. Using a comprehensive circuit-architectural analysis, we demonstrate a
42.2% improvement in energy efficiency over a recently proposed technique, across a range of benchmarks.

1.

INTRODUCTION

Near-Threshold Computing (NTC) has emerged as a promising direction to improve the energy-efficiency of integrated
circuits. The NTC supply voltage Vdd is marginally higher
than its threshold voltage Vth . NTC exploits the super-linear
Vdd -power relationship to reduce the system energy consumption. Inspired by this promise of better energy efficiency,
many recent works have explored key challenges in NTC,
such as recovering performance degradation through concurrency and tackling increased process variation [8, 11, 15].
While these works embody progress in NTC designs, several traditional design practices in Super-Threshold Computing (STC) processors are poised to be fundamentally altered
in the NTC regime.
In this paper, we show that the shift from STC to NTC
can redefine the performance bottlenecks within a processor core. Migrating into the NTC region leads to a 10-100X
processor frequency reduction [8, 15]. The memory, on the
other hand, is expected to operate in the STC region, as the
combined memory traffic from parallel threads remains comparable to STC [19]. Consequently, one of the fundamental
aspects of modern computer systems—the growing gap between processor and memory speed—is now poised to reverse its direction.
We show in this paper that the primary performance
bottlenecks in NTC processors shift from the memory to the Long
Latency Datapaths (LLD) within the core. One class of LLDs in a

a modern microprocessor is the multi-cycle func-tional units
(MFU) such as integer multiply/divide. Using a rigorous
analysis, we demonstrate that the relative per-formance
impact from this LLD class grows by 161X, com-pared to
memory latency, as we transition from STC to NTC. To
exploit this shift, we propose opportunistic turbo execution
(OTE)—a dynamic technique to improve the energy
efficiency of NTC processors. Fundamentally, OTE
dynamically speeds up LLDs by 2-5X to improve
performance, while also reduc-ing leakage energy—the major
source of energy consump-tion in NTC systems. While
conceptually intriguing, the OTE technique is not feasible in
STC, as an STC pipeline already operates near its minimal
delay region [11]. Consequently, OTE is fundamentally
distinct from the frequency overscal-ing (10-20%) in some
STC processors [1, 3].
Several works have focused on improving the energy efficiency of NTC circuits. One such recent technique is Superpipeline that reduces the energy consumption in the NTC region [17]. By using a deeper pipeline, Super-pipeline transfers
the circuit to the dynamic energy dominated region, thereby
reducing the leakage energy. However, this circuit-level technique ignores the performance bottlenecks in an NTC processor coming from the architectural layer. Lack of this knowledge can be detrimental to the energy efficiency of the Superpipeline technique. We demonstrate compelling advantages
over Super-pipeline through a cross-layer circuit-architectural
analysis, where we identify and opportunistically ameliorate
the performance bottlenecks in the NTC region.
We make the following contributions in this paper.

• Using a cross-layer methodology combining the architecture, circuit and device layers, we find that the performance
bottleneck shifts from the memory to the LLDs within the
core, as we move from STC to NTC (Section 2).
• We propose OTE—a unique technique geared for NTC processors. OTE dynamically boosts the LLDs by 2-5X to exploit the shifting trend in performance bottlenecks in NTC
processors (Section 3). Compared to the recently proposed
Super-pipelining technique [17], OTE gives a 42.2% improvement in the NTC energy efficiency. Using synthesis followed by place and route of an ARM processor core
augmented with OTE, we observe marginal overheads in
power (1.2%), wire length (0.98%), and area (0.37%), respectively (Section 5).

2. MOTIVATION
In this section, we explore shifting trends in performance
bottlenecks in a processor as we move from STC to NTC. Using cycle-accurate simulation methodology (Section 2.1), we

Figure 1: Shifting trends in CPU performance sensitivity from the STC to the NTC regime.

Figure 2: LLD speedup over memory speedup.
show that LLDs are poised to become the primary performance bottlenecks in NTC processors replacing the memory
(Section 2.2). Then, we present the opportunity for OTE to
improve the energy efficiency of NTC cores (Section 2.3).

2.1 Methodology
Estimating the performance bottlenecks in STC and NTC
presents a methodological challenge. We briefly outline our
cross-layer approach here, while presenting more details in
Section 4. We model a processor core similar to the ARM
Cortex A15 processor [2] on the gem5 simulator [5]. We next
profile 15 SPEC CPU2006 FP benchmarks [4] to investigate
the processor performance bottleneck in both STC and NTC
regions. We empirically evaluate the performance impact
due to a 4X boost in the memory access and also in one class
of LLDs–MFU. The Simpoint tool [18] is employed to pick
up the representing phases of each benchmark. We assume
the STC and NTC processors are clocked at 2GHz and 100
MHz [16], respectively. The memory model is given in Table 1, whereas the MFUs in the processor have latencies of
3 ∼ 33 cycles.

2.2 Performance Bottlenecks in NTC
2.2.1 Comparative Speedups
Figure 1 shows the average sensitivity of processor performance to both memory and LLD speedup, for STC and NTC
regions, respectively. We find that the processor performance
is highly sensitive to memory speedup in STC, but nearly
unresponsive to it in NTC. On the other hand, performance
sensitivity to LLD speedup is much higher in NTC than in
STC. Figure 2 more clearly shows this bottleneck transition
from STC to NTC. We present the geometric mean of the relative performance sensitivity of LLD speedup over memory
speedup in the figure. The relative performance sensitivity in

Figure 3: Core Energy Breakdown For NTC.
STC is 0.34, on an average. In the NTC region, this sensitivity
grows by 161X, reaching 54.8 on an average.

2.2.2 Significance
The transition we observe above will potentially inaugurate a new epoch in computer system design. A vast number
of works in the past few decades have been devoted to mitigate the processor-memory performance gap. In the NTC
era, however, this historic bottleneck will give way to other
performance barriers lying inside the processor datapath like
LLDs as we have shown above. Hence it is now critical to understand and mitigate the emerging bottlenecks in upcoming NTC processors. Our proposed technique opportunistic
turbo execution (OTE), detailed in Section 3, aims to mitigate
this bottleneck by dynamically boosting LLDs for turbo execution. A key research question is: while OTE can deliver
substantial performance boost in an NTC processor, how does it
impact the processor energy consumption, as well as, the overall
energy efficiency of the system?

2.3 Energy Efficiency Perspective of OTE
Figure 3 presents the energy breakdown (dynamic and leakage energy) of the ARM Cortex core, with and without OTE
boost. This data has been collected using an elaborate circuitarchitectural methodology, which combines architectural simulation with circuit level synthesized hardware and device
level HSPICE modeling, as outlined in Section 4. We find that
in addition to the performance gains in Figure 1, OTE has the
potential to improve energy consumption in the NTC region
by trading leakage energy for dynamic energy.

3. OPPORTUNISTIC TURBO EXECUTION
In this section, we present an overview of the proposed
OTE technique (Section 3.1), the hardware support for OTE

(Section 3.2), pipeline support for variable latency in LLDs
(Section 3.3), and our dynamic OTE algorithm (Section 3.4).

3.1 OTE Overview
Figure 4 shows an overview of the OTE technique in a
microprocessor core. The major augmentations of a microprocessor pipeline to support the normal and boosted mode
for the MFUs include: voltage regulators (VR), power gates
(PG), level shifters (LS), and the OTE controller that dynamically switches between the normal and the boosted mode.
Collectively, these components work in harmony to boost the
energy efficiency of a processor operating in the NTC region.
While we focus on boosting only one class of LLDs (MFUs),
our OTE technique can be applied to other LLDs within the
core as well (e.g., register file).

Enabling OTE results in variable latencies in the target LLDs
(MFUs). Therefore, the pipeline micro-architecture must be
augmented to allow seamless switchover between the normal and the OTE mode. We discuss two central aspects of
this necessary modification: (a) MFU Modification and (b)
Instruction Level Dependency Tracking.
MFU Modification: MFUs are typically pipelined in stages.
During the OTE mode, we dynamically collapse the number
of pipe stages of a MFU, boosting its performance. To achieve
this goal, we add a MUX circuit between two consecutive
MFU stages. This MUX is controlled by a single-bit select
indicating the OTE mode. In the OTE mode, the MUX redirects the output of one stage directly into the input of the
next stage, bypassing the intervening pipeline register. In the
normal mode, this redirection is turned off, which enables
only the pipeline register to drive the input of the next stage.
Instruction Level Dependency Tracking: Modern microprocessor pipelines are already equipped with the necessary logic
to deal with variable latency operations [14]. When an instruction is scheduled on an execution unit, the issue queue
logic tracks its expected completion time. For a MFU, a countdown register is used to track its completion time. During
the OTE mode, we modify all countdown registers of target
MFUs to update their latencies. For example, a MFU that
takes three cycles in the normal mode will complete within
a single cycle under OTE (3X). At the end of this completion
time, a wakeup signal is broadcasted with the corresponding
tag for that instruction. Dependent instructions can then do
a tag-match and grab the output, enabling correct dataflow
within the pipeline.

3.4 OTE Controller

Figure 4: OTE Overview: the ARM Cortex A15 processor
pipeline is shown, along with the major augmentations for OTE.

3.2 Supporting Two Operating Modes
We next outline the details on our voltage regulator and
level shifters to support two operating modes.
Voltage Regulator: We employ two power supply rails [12]
with off-chip Voltage Regulators (VR) to provide dual supply
voltages in our design. Vdd_H in figure 4 is the supply voltage
to support the OTE mode, in addition to the already existing
voltage Vdd_L for the normal mode. We use Vdd_H and Vdd_L
values of 0.6V and 0.35V to achieve a 4X OTE boost at the 32
nm technology node. Depending on the decision taken by
the OTE controller (details in Section 3.4), the supply voltage switches between the two rails. Using a transition-time
test setup similar to [12], we observe that the time to switch
between the normal mode and the OTE (4X) mode can complete within one cycle (10ns) in the NTC processor.
Level Shifter: We employ the level shifter proposed in [13] to
support the MFU under Vdd_H . The level shifter is a 24 transistor circuit that allows shifting between two voltage levels,
Vdd_L and Vdd_H . It is controlled by select signals generated
by the OTE controller (Figure 4). This level shifter consumes
a small portion of the cycle time in our design [13].

3.3 Dealing with Variable Latency MFU

The OTE controller is responsible for dynamically switching between the normal and the OTE modes. A few key
questions the OTE controller needs to decide at runtime are:
(a) how much to boost the MFU (Section 3.4.1)? (b) when
to switch from the normal to the OTE mode and vice versa
(Section 3.4.2)? and (c) how to preserve functional correctness during the switch-over (Section 3.4.3)? These decisions
involve a tradeoff in energy efficiency benefits and the associated overheads, outlined next.

3.4.1 Effective Choice of OTE Boost
The ideal OTE boost can vary across different workloads.
For example, in Figure 3, a 4X OTE boost is desirable for
benchmark dealII due to its associated energy reduction. On
the other hand, a 4X boost increases the energy for benchmark namd as the reduction in leakage energy is superseded
by the increase in dynamic energy. However, implementing
multiple performance boosts for different applications can
involve a high overhead in managing multiple power rails
and level shifters. Additionally, finding a minimal energyefficiency point at runtime is impractical for real applications.
Instead, we use a single OTE boost for the MFUs and decide
the boost voltage at design time. In Section 5.2, we explore
different OTE boosts to guide our choice.

3.4.2 Dynamic Control of OTE
Figure 1 shows that there is substantial variation in the advantage from speeding up the MFUs across different benchmarks. Even within a single benchmark, different phases can
vary on their respective performance sensitivities to OTE. To
effectively capture both these design aspects and improve

the energy efficiency, we explore dynamic control of OTE.
The fundamental insight of dynamic control is to exploit the
benchmark phase behavior at runtime. When a particular
phase is exhibiting heavy utilization of the MFUs, the OTE
controller switches from a normal mode to OTE. On the other
hand, low utilization of the MFUs results in switchover to
the normal mode. The OTE controller captures these utilization at runtime by employing the performance counters in
the pipeline, which are already present in modern microprocessors. We explore two variants of dynamic OTE, covering
a range of the design space.

• ST: We use a single threshold for the number of cycles
that the MFU is active in a given epoch1 . If the utilization exceeds this threshold in a given epoch, the execution switches to OTE in the next epoch. Otherwise, we
disable OTE in the next epoch.
• HL: In this scheme, we define two thresholds indicating low and high watermarks, respectively. When the
utilization exceeds the high watermark, the execution
switches to OTE. However, unlike ST, we prolong the
OTE even when the utilization is below the high watermark. When the utilization drops below the low watermark, we disable OTE for the next epoch. This scheme
essentially aims to reduce the number of transitions,
and its associated overheads.

3.4.3 Preserving Functional Correctness
A particular challenge during a mode switch arises because of alteration in the expected latencies in the MFUs (see
Section 3.3). To maintain functionally correct execution of
all instructions, we initiate a pipeline flush during the mode
switch. Before instructions are introduced in the pipeline,
the OTE controller modifies all the countdown registers for
MFUs to reflect their respective latencies after the switch.
Subsequently, instructions are fetched and regular processing resumes. We carefully model all circuit-architectural penalties of these aspects in our evaluation (Section 5).

4.

METHODOLOGY

Figure 5 shows our extensive cross-layer methodology to
evaluate our energy efficient NTC processor design. In this
pursuit, we combine SPICE level energy characteristics of
NTC and STC circuits in the device layer, synthesis and place
and route based energy analysis for processor components
in the circuit layer, and architectural simulation based power
performance analysis in the architecture and application layers. We next provide the methodology details for each layer:

4.1 Architecture Layer
We model an out-of-order processor core similar to the ARM
Cortex A15 processor [2]. Table 1 shows the processor configuration. The MFUs in the processor have latencies of 3 ∼ 33
cycles. To model OTE, we scale these latencies down by a
factor of n, the speedup of OTE, to as low as a single cycle. We run 15 SPEC CPU2006 FP benchmarks [4] on the
gem5 simulator [5]. For each benchmark, we skip the first
1 billion instructions to avoid the initialization period. Subsequently, we run each benchmark to its completion or 10
billion instructions, whichever happens earlier. Each benchmark is divided into epochs (epoch size = 10 million instructions) and OTE decisions are taken once per epoch. To obtain
1 defined

period of execution

core power distribution, we integrate our architectural simulation data with the McPAT tool [9]. McPAT uses technology
parameters representing the 32nm node.

4.2 Device Layer
We customize the PTM 32nm technology model card for
HSPICE to generate the leakage and dynamic power trends
in the STC and the NTC regions [22]. We obtain power characteristics for basic gates like nand, nor, inverter, flip-flop,
and also for a 31 fanout-of-4 inverter-chain and 6T and 10T
SRAM cells, as shown in Figure 5.

4.3 Circuit Layer
To obtain the hardware overhead of our scheme, we add
the major augmentations for OTE (Figure 4) to the RTL of a
FabScalar core modeled for an ARM A15 processor [7]. We
then synthesize this core using the Synopsys Design Compiler (DC) and a 45 nm reduced standard cell library using
only basic gates. We subsequently perform place and route
with the Cadence Encounter tool to get a more accurate estimate of the hardware overhead including the additional
power rail. To obtain power characteristics for the 32nm technology node, we feed the power from our device simulation
of basic gates to the synthesized netlist.

4.3.1 STC-to-NTC Power Scaling
Scaling the entire core power from the STC to the NTC region presents a methodological challenge. HSPICE simulation of an entire processor core is computationally intense. To
manage the complexity, we adopt several steps. First, using
the core power distribution from McPAT in the architecture
layer, we estimate the relative power consumed by the processor components as shown in Figure 5. Second, we scale
the STC power to NTC using the following three categories:

• Combinational logic: this is scaled using the STC/NTC characteristics of the canonical 31 fanout-of-4 inverter-chain as
the representing circuit [15].
• Storage elements: we scale the on-chip SRAM power by investigating the power scaling trend from the STC 6T SRAM
cell to the NTC-friendly 10T SRAM cell [6, 20].
• Interconnect: McPAT does not give the power results for
the interconnect within the core. However, as seen by previous works [10], we estimate the interconnect power to be
50% of the core dynamic power. As both the interconnect
power and the core dynamic power are equally affected by
scaling the supply voltage, we assume that their relative
weight remains unchanged for STC and NTC.

5. EXPERIMENTAL RESULTS
In this section, we present a comprehensive analysis of our
proposed OTE in a typical NTC processor. We briefly outline
various comparative schemes (Section 5.1), empirical study
on OTE boost (Section 5.2), exploration of dynamic OTE configurations (Section 5.3), analysis of performance and energy
efficiency (Section 5.4), and the overhead and limitation of
our proposed schemes (Section 5.5).

5.1 Comparative Schemes
• Super-pipeline: This scheme uses a deeper pipeline to transfer the circuit to the dynamic energy dominated region,
thereby reducing the leakage energy [17]. Seok et al. showed

Parameter

Value

ISA
Frequency
Re-Order Buffer
Issue Queue
Fetch/Dispatch
Issue/Commit
Pipeline Depth
Cacheline
L1 I-cache
L1 D-cache
L2 cache
Memory

ARM
100MHz
128 entries
64 entries
3/cycle
8/cycle
15
64 Bytes
32 KB/2-way, 1-cycle
32 KB/2-way, 2-cycle
1 MB/16-way, 12-cycle
B/W: 4GB/s, Latency: 22.5 ∼ 37.5 ns

Figure 6: EDP Results of Four OTE boosts (Lower is better).

Table 1: The Configuration of the ARM Cortex NTC core.

Figure 7: EDP Results of Three Dynamic-OTE Schemes (OTE
boost=4X) (Lower is better).

5.4 Performance and Energy Efficiency

Figure 6 presents our design space exploration with four
possible OTE boosts for the Fixed-OTE scheme. We notice
a trend of diminishing improvements in energy efficiency
as we increase the boost strength to 4X. Eventually, at 5X
OTE boost, there is a degradation in energy efficiency, as the
power consumption to support a 5X boost masks the performance gain through it. We choose a 4X boost at design time.

Figures 8 and 9 show the performance improvement and
energy reduction of all the three schemes outlined in Section
5.1. These are calculated with respect to a Baseline NTC core
having no OTE or Super-pipelining. We observe that our
proposed schemes are substantially more effective to drive
performance gains, as they can exploit the emerging architectural bottlenecks in the NTC processor. For example, the performance of Dynamic OTE is 96.1%-178% higher than superpipeline, across all the benchmarks. This comparatively poor
result in super-pipeline stems from the fact that the gain in
clock frequency afforded through it cannot efficiently exploit
the bottleneck from LLDs. Our schemes also offer substantially higher energy reduction, by reducing the leakage energy more efficiently by optimizing the delay in LLDs. The
average energy reduction of Fixed-OTE and Dynamic-OTE
are 20.7X and 31.3X, respectively, compared to super-pipeline.
Collectively, gains in both performance and energy consumption leads to a significant improvement in energy efficiency of the system. Figure 10 shows the comparison of energy efficiency using the energy-delay product as the metric.
We observe that the EDP of Fixed-OTE and Dynamic-OTE
are 41.7% and 42.2% lower than the super-pipeline scheme.
However, super-pipeline also outperforms our OTE schemes
in some benchmarks like namd. The namd benchmark performs worse, as it is not sensitive to increase in pipeline latency and has little usage of the MFUs.

5.3 Dynamic OTE Configuration

5.5 Overhead and Limitation

Figure 7 presents the EDP results for three variants of the
Dynamic-OTE scheme (see Section 3.4.2). In all schemes, we
monitor the runtime utilization of LLDs for an epoch length
of 10M instructions. For the ST mode, we switch to OTE
when this utilization is above 5M cycles. For HL(10%) and
HL(20%), we set the high and low watermarks at 10% and
20% above and below 5M cycles, respectively. We observe
that ST and HL(10%) perform comparably, but a larger spread
in watermarks degrades the energy efficiency of HL(20%).
Henceforth, we use the ST configuration of Dynamic OTE.

The on-chip overhead of our schemes comes from two sources:
control for the LLDs and control for the pipeline (Section
3). Using the circuit methodology detailed in Section 4.3, we
perform synthesis followed by place and route of the ARM
core augmented with our OTE scheme. We observe marginal
overheads in power (1.2%), wire length (0.98%), and area
(0.37%), respectively.
Efficacy: The efficacy of the OTE scheme relies on the VR
efficiency. Figure 11 illustrates that the normalized EDP of
the Dynamic-OTE scheme deteriorates from 51.6% to 57.5%,

Figure 5: Our Cross-Layer Methodology.
a 50% depth increase in a single pipe stage. To capture the
maximum benefit from Super-pipeline, we optimistically
model a 50% increase in pipeline depth in the entire processing core.
• Fixed-OTE: This is the static—always on—OTE scheme.
• Dynamic-OTE: In this scheme, OTE is dynamically controlled as described in Section 3.4.2.

5.2 Choice of OTE Boost

Figure 8: Performance Improvement (Higher is better).

Figure 9: Energy Reduction (Higher is better).

Figure 10: Energy-Efficiency Comparison (Lower is better).

Figure 11: Impact of Voltage-Regulator Efficiency on the EDP
(normalized to the baseline) achieved by Dynamic-OTE.

as the VR efficiency decreases from 90% to 50%.

6.

RELATED WORK

Several existing works have explored the principles of the
NTC region. Dreslinski et al. have investigated the opportunities and challenges, as well as, possible solutions associated with the NTC era [8]. Marković et al. [11] outlined
a new design philosophy for the NTC era, where Vdd will
be the most effective parameter to adjust the device delay in
the NTC region. Other works have shown the dominance
of leakage, while exploring throughput centric computing in
NTC [15, 21]. Seok et al. [17] proposed super-pipelining to
improve the energy efficiency of small-scale NTC circuits,
primarily driven by a circuit layer analysis. On the other
hand, Wang et al. [19] explored platform level energy-reduction
in an NTC system, but do not consider the core-level energyefficiency interplay. Our work in this paper is distinct in two
ways. First, we demonstrate a paradigm shift in performance
bottlenecks—the rapidly growing importance of LLDs compared to memory—as we transition from an STC to an NTC
era. Second, we propose a cross-layer technique—OTE—to
improve the energy efficiency in an NTC processor, combining architectural and circuit layer characteristics.

7.

CONCLUSION

We identify a shifting trend in performance bottlenecks
in a microprocessor pipeline as we transition from STC to
NTC. To exploit this intriguing change, we propose OTE that
dynamically boosts long-latency datapaths in the processor
pipeline. Our rigorous circuit-architectural analysis demonstrates a 42.2% improvement in energy efficiency over a recently proposed technique, across a range of benchmarks.

Acknowledgments
This work was supported in part by National Science Foundation grants (CNS-1117425, CAREER-1253024, CCF-1318826,
CNS-1421022, CNS-1421068). Any opinions, findings, and
conclusions or recommendations expressed in this material
are those of the authors and do not necessarily reflect the
views of the NSF.

8.

REFERENCES

[1] AMD Turbo Core Technology. http://www.amd.com/en-us/
innovations/software-technologies/turbo-core.
[2] ARM Cortex-A Series.
http://www.arm.com/products/processors/cortex-a/.
[3] Intel Core-i7 Processors. http://www.intel.com/content/www/us/en/
processors/core/core-i7-processor.html.
[4] SPEC CPU2006 benchmarks. http://www.spec.org/cpu2006/.
[5] B INKERT, N. AND OTHERS The gem5 simulator. SIGARCH Computer
Architecture News 39, 2 (Aug. 2011), 1–7.
[6] C ALHOUN , B., AND C HANDRAKASAN , A. A 256-kb 65-nm
sub-threshold SRAM design for ultra-low-voltage operation. In JSSC
(March 2007), vol. 42, pp. 680–688.
[7] C HOUDHARY, N. K. AND OTHERS FabScalar: composing synthesizable
RTL designs of arbitrary cores within a canonical superscalar template.
In Proc. of ISCA (2011), pp. 11–22.
[8] D RESLINSKI , R. G. AND OTHERS Near-Threshold Computing:
Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits.
Proceedings of the IEEE 98, 2 (2010), 253–266.
[9] L I , S. AND OTHERS McPAT: An integrated power, area, and timing
modeling framework for multicore and manycore architectures. In
Proc. of MICRO (2009), pp. 469 –480.
[10] M AGEN , N. AND OTHERS Interconnect-power dissipation in a
microprocessor. In Proc. of SLIP (2004), pp. 7–13.
[11] M ARKOVIC , D. AND OTHERS Ultralow-Power Design in Near-Threshold
Region. Proceedings of the IEEE 98, 2 (2010), 237–252.
[12] M ILLER , T. N. AND OTHERS Booster: Reactive core acceleration for
mitigating the effects of process variation and application imbalance in
low-voltage chips. In HPCA (2012), pp. 1–12.
[13] M OHANTY, S. P., AND P RADHAN , D. K. ULS: A dual-Vth /high-kappa
nano-CMOS universal level shifter for system-level power management.
JETC 6, 2 (2010).
[14] PATTERSON , D. A., AND HENNESSY, J. L. Computer Organization and
Design, 4 ed. Morgan Kaufmann, 2009.
[15] P INCKNEY, N. R. AND OTHERS Assessing the performance limits of
parallelized near-threshold computing. In DAC (2012), pp. 1147–1152.
[16] P U , Y. AND OTHERS Misleading energy and performance claims in
sub/near threshold digital systems. In Proc. of ICCAD (2010),
pp. 625–631.
[17] S EOK , M. AND OTHERS Pipeline strategy for improving optimal energy
efficiency in ultra-low voltage design. In DAC (2011), pp. 990–995.
[18] S HERWOOD , T. AND OTHERS Automatically characterizing large scale
program behavior. In ASPLOS (2002), pp. 45–57.
[19] WANG, H. AND OTHERS Improving platform energy: chip area trade-off
in near-threshold computing environment. In Proc. of ICCAD (2013),
pp. 318–325.
[20] W ESTE , N., AND HARRIS , D. CMOS VLSI Design: A Circuits and Systems
Perspective, 4th ed. Addison-Wesley Publishing Company, USA, 2010.
[21] Z HAI , B. AND OTHERS Theoretical and practical limits of dynamic
voltage scaling. In DAC (2004), pp. 868–873.
[22] Z HAO , W., AND C AO , Y. Predictive Technology Model, June 2012.

