EFFORT: Enhancing Energy Efficiency and Error
Resilience of a Near-Threshold Tensor Processing Unit

ABSTRACT
Modern deep neural network (DNN) applications demand a
remarkable processing throughput usually unmet by traditional Von Neumann architectures. Consequently, hardware
accelerators, comprising a sea of multiplier and accumulate
(MAC) units, have recently gained prominence in accelerating DNN inference engine. For example, Tensor Processing
Units (TPU) account for a lion’s share of Google’s datacenter inference operations. The proliferation of real-time DNN
predictions is accompanied with a tremendous energy budget. In quest of trimming the energy footprint of DNN accelerators, we propose EFFORT—an energy optimized, yet
high performance TPU architecture, operating at the NearThreshold Computing (NTC) region. EFFORT promotes a
better-than-worst-case design by operating the NTC TPU
at a substantially high frequency while keeping the voltage
at the NTC nominal value. In order to tackle the timing errors due to such aggressive operation, we employ an opportunistic error mitigation strategy. Additionally, we implement an in-situ clock gating architecture, drastically reducing the MACs’ dynamic power consumption. Compared to a
cutting-edge error mitigation technique for TPUs, EFFORT
enables up to 2.5× better performance at NTC with only 2%
average accuracy drop across 3 out of 4 DNN datasets.

1.

INTRODUCTION

Advancements in artificial intelligence have entered a new
realm owing to the development of domain specific architectures dedicated to neural networks (NN) processing. Tensor
Processing Unit (TPU), a custom application specific integrated circuit (ASIC) built by Google, is one such accelerator, which is exclusively built to handle most of the deep
neural networks (DNN) inference workloads in their servers.
The rapidly increasing workloads calls for an increase in
the processing speed and deployment volume [8]. It, however, comes at a cost of a heavy power usage, thus affecting
the energy efficiency of the system. In order to preserve the
energy efficiency, we operate the TPU at the Near-Threshold
Computing (NTC) region, where we scale down the transistor supply voltage to just above its threshold voltage [4].
Accelerators like TPUs are designed to offer a very high
throughput for DNN inference workloads. Although NTC
operating conditions can ensure a low energy consumption,
the throughput is heavily declined due to the slower transistors and longer computational delays. As NTC systems are
also highly sensitive to voltage and process variations, they
become a victim of timing violations, which in turn, can
affect the DNN inference accuracy significantly [4, 9]. This
paper underlines the significance of the computational delays and order of execution of the arithmetic units, to handle

timing violations in NTC TPUs. Additionally, by exploring
the predictable data flow pattern in the TPU systolic array,
we further enhance its energy efficiency, thereby promoting
an error-resilient and energy-efficient TPU design paradigm.
Several timing error resilient schemes have been explored
for CPUs and custom ASICs [5, 20, 22]. However, these
schemes are inefficient for combating timing violation in
TPUs. Razor is one such popular timing violation detection method which uses a double sampling flip-flop to detect the errors [5]. Using instruction replay, the erroneous
data is recomputed and the correct value is propagated to
the next stage of the pipeline. TPU has a massive systolic
array of 256 × 256 multiplier-and-accumulate (MAC) units.
So, using an instruction replay in one MAC unit, results in
stalling the operation of the entire systolic array, leading to
a massive drop of throughput and increase in the energy
consumption. TE-Drop is a recently proposed technique to
handle timing violations in TPU-like systolic arrays. In this
technique, the MAC unit encountering a timing error, steals
an execution cycle from its downstream MAC, and recomputes the correct value [22]. In the process, the downstream
MAC’s computation is bypassed. However, there will be
multiple levels of bypassing, in case of timing errors in consecutive rows of the same column of MACs, in the same
clock cycle. Bypassing multiple computations can cause a
severe drop in the inference accuracy. Additionally, the timing errors encountered in the last row of MACs will not be
tackled by TE-Drop, also resulting in an accuracy drop. A
naive approach to tackle timing violations is to allow the
erroneous data to flow through the successive stages of operations [15, 20]. This technique undermines the effects of
the erroneous data in DNN computations, as a large number of timing errors causes a significant drop in the inference
accuracy [6].
In order to overcome the drawbacks of these error handling schemes, we propose a unique timing error correction
technique which handles timing errors in the same cycle of
the execution while enhancing the energy efficiency of the
TPU. We observe that in a MAC, multiplier takes relatively
higher execution time than accumulator (Section 2.3). Additionally, we observe a predictable data flow pattern in the
TPU systolic array (Section 3.3). Analyzing these computational delays, data flow patterns, and utilizing the computational order of the arithmetic units, we propose EFFORT—
an error resilient, low-power, novel TPU design paradigm.
Following are the specific contributions in this work:
• We experimentally demonstrate that 8-bit multiplier takes
higher computation time than 24-bit accumulator (Section
2.3). We exploit the computational delays and operational
order of these arithmetic units to tackle timing errors.

(a) CDF of multiplier delay distribution.

(b) CDF of accumulator delay distribution.

(c) Power consumption of a MAC unit.

Figure 1: CDFs of the delay distributions for multiplier (Figure 1(a)) and accumulator (Figure 1(b)) show that the multiplier has
higher computational delay compared to the accumulator. Accumulator takes less than a half clock cycle for its part of computation.
Figure 1(c) portrays the increase of static power and decrease in dynamic power for decreasing frequencies. Voltages are scaled
accordingly to depict the shift from STC to NTC.
• We also observe a predictable data flow pattern in the
TPU and utilize this data flow pattern to reduce the energy consumption in the systolic array (Section 3.3).
• We propose a low-overhead dynamic timing error detection/correction and a dynamic power management technique, called EFFORT (Section 3.2). Our technique detects the timing errors, obtains the corrected data and
propagates it to preserve the output accuracy. Additionally, we employ a low-overhead clock gating technique to
improve the energy efficiency of the TPU (Section 3.3).
• In comparison to TE-Drop [22] and the Baseline-TPU,
EFFORT delivers 2× better performance for 3 out of 4
DNN datasets, while incurring only 2% loss in inference
accuracy (Section 5.2).
• We show that EFFORT consumes up to 6% and 27% less
power and gives up to 1.06× and 1.35× better performance per unit power, than Baseline-TPU and TE-Drop
(Section 5.3).

The asymmetric delay distributions of the multiplier and
accumulate blocks in a MAC unit, open up a unique opportunity to tackle timing errors in a systolic array. The
accumulate operation in a MAC, adds the output of the
upstream MAC to the output of its own multiplier block.
Due to a relatively large computation time of the multiplication operation, the output from the upstream MAC has
ample time to reach the current accumulate unit, presuming the synchronization takes place at the primary output
of the MAC. Exploiting this available timing window, correcting an erroneous operation at the upstream MAC can be
overlapped with the multiplication operation of the current
MAC, without paying any additional performance penalty.
The wavefront propagation of data in a systolic array leads
to a static pattern of busy and idle phases. Such a predictable pattern creates an avenue to conserve power of the
idle MAC units. Next, we briefly discuss our experimental
methodology, used to demonstrate these opportunities in the
systolic array of an NTC TPU.

2.

2.2 Methodology

MOTIVATION

In this section, we illustrate the unseen opportunities that
can be availed to tackle timing errors in a TPU systolic array. Section 2.1 sheds light on the background of the TPU
systolic array and inherent opportunities which can be exploited for an improved performance. Using the cross-layer
methodology in Section 2.2, we investigate the MAC units’
delay profiles. Section 2.3 elaborates the significance of the
results and establishes the ground work for our timing error
correction and dynamic power management scheme.

2.1 Background
2.1.1 DNN Accelerators
DNN obtains inference using multiple layers of computation. Outputs of neurons from each layers are referred to as
activation streams. An activation matrix is multiplied with
the weight matrix in each layer. To accelerate the matrix
multiplication, a systolic array of MAC units are employed
in DNN accelerators [7]. TPU–a DNN accelerator–uses a
256×256 systolic array of MACs. The weight matrices are
pre-loaded into the MACs. The activation streams flow from
left to right in consecutive clock cycles. The activation and
weight matrices maintain an 8-bit integer precision, while
the accumulator maintains a 24-bit integer precision.

2.1.2 Opportunities in a Systolic Array

We synthesize a multiplier and an accumulator unit at
NTC, using 15-nm FinFET library from NanGate [16]. To
model the PV at NTC for FinFET, we use VARIUS-NTV
models [19]. For a conservative estimate, we consider PVinduced delays in randomly chosen 2% of the gates in the
circuit [17]. We use our in-house statistical analysis tool to
investigate the delay distribution of the sensitized path for
different inputs to the multiplier and accumulator unit.

2.3 Results and Significance
Figures 1(a) and 1(b) show the delay distributions of the
multiplier unit and accumulate unit, respectively. The multiplier unit is tested for a set of all possible 8-bit activation
streams created against all possible 8-bit weight streams,
which results in a total of 65536, 16-bit output combinations.
Each one these 65536 outputs serve as one of the inputs for
the accumulator while, the other input of the accumulator
is fed with its own output from the previous cycle.
From Figure 1(a) and 1(b), it can be inferred that, even
if the output of multiplier unit is sensitized only to the activation stream due to the preloaded weight, the multiplier
unit still induces a higher combinational delay to the MAC
operation in comparison to the accumulator, during a clock
period of operation. Figure 1(b) shows that the accumulation requires less than half cycle of the clock period. This
disparate timing characteristics of the multiplier and accu-

Figure 2: CostCo implemented inside the MAC unit to detect and correct and timing violations.
mulate operations create an opportunistic timing window to
correct any timing violation in an upstream MAC, thereby
preventing any erroneous value to be propagated down the
column of the systolic array.
Figure 1(c) depicts the decrease in dynamic power and
domination of static power in a MAC unit as the region of
operation is changed from super-threshold computing (STC)
to NTC. The X-axis is normalized to 1GHz. Operating voltage is set at 0.85V and scaled linearly to depict the shift in
operating conditions. Static energy consumption can be reduced by operating the systolic array at frequencies, above
the nominal NTC frequency. However, increasing the operating frequency linearly increases the dynamic energy consumption of the MAC units. In order to curb the increase
in dynamic energy, an in-situ clock gating technique can be
employed in the systolic array. With this opportunistic window in-sight, we explore our performance enhancing TPU
systolic array design–EFFORT.

3.

EFFORT DESIGN

In this section, we discuss Energy eFFicient and errOr
Resilient TPU (EFFORT), a novel design paradigm to improve the performance of an NTC TPU by enhancing the
timing error resilience of its MAC units and managing the
dynamic energy consumption of the systolic array. We describe the overview of EFFORT in Section 3.1, and explain
its detailed components in Section 3.2 and Section 3.3.

3.1 Design Overview
Figure 2 demonstrates the high-level design of EFFORT.
We add two key modifications to the baseline NTC TPU.
First, we augment each MAC unit with a novel penalty-free
error detection and correction logic, thus preserving a high
performance. Second, a low-overhead clock gating technique
is implemented to conserve the dynamic power of the systolic
array. These two components are discussed next.

3.2 Costless Correction (CostCo)
In this section, we introduce Costless Correction (CostCo).
We augment the conventional Razor [5] with a multiplexer
(MUX) and an Exclusive-OR (XOR) gate, as demonstrated
in Figure 2. Since CostCo controls its output with the comparison of the shadow latch and the main latch, it is capable
of propagating the correct value to the downstream logic
within the same clock cycle that timing error detection happens. Figure 3 demonstrates the RTL simulation waveforms
for two consecutive MAC units within a column, in a systolic
array, both in absence and presence of a timing violation.
Note that, we synchronize only the primary output of the

(a) No timing violation.

(b) Timing violation with no correction.

(c) Timing violation with correction.

Figure 3: CostCo can diagnose timing violation and propagate
the corrected value within one clock cycle.

MAC units with the system clock, to enable our proposed
CostCo design.
Figure 3(a) demonstrates the normal output waveforms
of two consecutive column MACs in absence of any timing
violation. Figure 3(b) shows how a small additional delay
in the input of the first MAC, engenders timing violation in
its immediate downstream MAC, leading to an erroneous result. Figure 3(c) exhibits how CostCo can detect the timing
violation and propagate the correct value to its succeeding
downstream MAC, within the same clock cycle. CostCo can
be employed as a competent method to tackle timing violation if the downstream combinational logic has sufficient
time-window before the next rising edge of the clock, to replay its logical operations on the corrected data.
We consider 24 CostCo flip-flops at the output of each
MAC that provide the correct values to the downstream
MACs, in case of a timing error. The output of each MAC
is utilized in the accumulation operation in its succeeding
MAC. As accumulation in each MAC requires less than 50%
of the clock cycle (Section 2), we decide a 50% shift in the
system clock to provision the CostCo flip-flop clock. This
shift of clock provides an opportunity to detect timing errors
up to 50% beyond the system clock, while it guarantees the
succeeding MACs to have adequate time to accomplish their

(a) Cycle 1 to 4.

(b) Cycle 5.

(c) Cycle 6.

clock of idle MACs, improving the dynamic energy consumption of the TPU. Since all of the MACs on the same diagonal
of the systolic array are active or idle in the same clock cycle, instead of endowing a separate clock gating unit for each
MAC, we provide one for each set of MACs on each diagonal. Since an n × n matrix has (2n − 1) diagonals, we thus
reduce the total on number of required clock gating units
from n2 to (2n − 1). Figure 2 shows that each clock gating unit consists of one flip-flop to register the enable signal
for the downstream clock gating unit, and an AND gate to
control the clock for its corresponding group of MACs.

3.3.3 MAC Activity Analysis

(d) Cycle 7.

(e) Cycle 8 to 11.

Figure 4: Data flow pattern in a 4 × 4 systolic array for 11
consecutive cycles. Gray MACs are yet to receive their inputs,
black MACs have completed their operations, while the rest of
the MACs are presently computing their respective outputs.
accumulation operations in the remaining time window. We
discuss the hardware overhead and performance gain of this
design in Section 5.

3.3 Systolic Clock Gating
In EFFORT, we aim to increase the operating frequency,
while keeping the supply voltage at the nominal NTC value,
in order to provide a better performance compared to a baseline NTC TPU. To reduce the power consumption due to a
high-frequency operation, we exploit the application independent data-flow pattern within the TPU systolic array,
and employ a low-overhead clock gating technique.

3.3.1 Application Independent Data Flow
Figure 4 demonstrates the pattern of data flow inside
a 4 × 4 systolic array for 11 consecutive clock cycles. In
this figure, the gray nodes represent the MACs that have
not received their data yet, the nodes in black denote the
MACs that completed their operations, and other colored
nodes demonstrate the MACs which are doing their operations. Numbers on black nodes show the cycle when they
completed their operations, numbers on other colored nodes
represent the cycle in which they received their first data,
and numbers on each edge display the cycle in which the
preceding node attempts to activate or deactivate its subsequent nodes either on the right-hand side or down a row. As
Figure 4(a) exhibits, considering the upper left node as the
start point from cycle 1 to 4, all the MACs from start point
down to the main diagonal, receive their data respectively
in a sequential fashion, while the rest of them are yet to
receive their data. After cycle 4 (Figure 4(b) through Figure 4(d)), as a new set of MACs receive their data in each
cycle, another set of MACs accomplish their tasks. Figure
4(e) displays the systolic array after 11 clock cycles, when
the entire systolic array operation is completed. We observe
that all the MACs on the same diagonal of the systolic array,
are active or idle in the same cycles.

3.3.2 Clock Gating Components
Based on the activity pattern of a systolic array, we propose a low overhead clock gating technique to shutdown the

Generalizing from Figure 4, an n × n systolic array needs
(3n − 2) cycles to complete its operation. However, not all
MACs are active during each clock cycle. For an n × n systolic array, at each clock cycle in the range [Cycle 1, Cycle
n] and [Cycle (2n - 1), Cycle (3n - 2)], the total number
. Furthermore, in the interval
of active MACs is n×(n+1)
2
[Cycle (n + 1), Cycle (2n)], at each cycle i, the number of
active MACs change by (n − (2 × i)), where a positive (negative) value indicates an increase (decrease) in the number
of MACs. Applying our proposed clock gating technique by
summing the number of active MACs during the (3n − 2)
clock cycles, we reduce the order of active sequential logic
from (3n3 ) to (n3 ). We discuss the experimental results and
hardware overhead of this technique in Section 5.

4. METHODOLOGY
In this section, we expound our comprehensive cross-layer
methodology, used to implement our proposed design and
evaluate its capabilities across DNN applications.

4.1 Device Layer
We measure the delay distributions of basic logic gates
(e.g., NOR, NAND and Inverter) by performing HSPICE
simulations with 16-nm Predictive Technology Model [23].
We consider the impact of with-in die process variation at
NTC by using VARIUS-NTV model [10]. In addition, we
employ VARIUS-TC model to integrate the FinFET attributes [11]. The delay values are used in the circuit layer
(Section 4.2) to analyze the delays in a MAC unit.

4.2 Circuit Layer
We implement a TPU systolic array, as well as, the components of our proposed design in Verilog RTL. We synthesize
the developed RTLs using Synopsys Design Compiler. We
use the synthesized netlists in our in-house statistical timing
analysis (STA) tool. The STA tool employs libraries of delay
distributions for basic logic gates from HSPICE simulations
(Section 4.1), to provide the delays of the sensitized paths in
the MAC circuit. We utilize the sensitized delays for further
evaluations of our proposed technique.

4.3 Architecture Layer
We use our in-house TPU Systolic array simulator developed using C++, based on the detailed architecture of a
TPU [7]. The delays from the STA tool (Section 4.2) are
incorporated into the TPU Simulator to simulate timing errors in the MAC units. We interface Keras [3] with our
TPU simulator to replicate a real-life inference engine. The
DNN applications (viz., Reuters [1], IMDB [14], MNIST [13],
CIFAR-10 [12]) are trained using Keras. Activation inputs
and trained weights from each layers are extracted and separated into 256×256 matrices. Inference accuracy is obtained
by combining the output matrices from the simulator.

Figure 5: Normalized inference accuracies of the 4 DNN datasets for different comparative schemes.

5.

EXPERIMENTAL RESULTS

In this section, we compare the efficacy of different schemes
for NTC TPU operation. The baseline frequency for our
scheme is (0.45V, 67.5MHz), which offers an error-free execution of the systolic array. Section 5.1 introduces the comparative schemes. Section 5.2 presents the inference accuracies. Section 5.3 discusses the power savings and Section 5.3
presents the overheads of EFFORT.

5.1 Comparative Schemes
• Baseline-TPU : This scheme operates an NTC TPU
without any error detection and correction. It allows the
erroneous data to propagate throgh all the computation
stages in the systolic array [20].
• TE-Drop : This technique handles the timing errors by
dropping the subsequent downstream MAC operation [22].
The erroneous MAC recomputes the output by stealing
the clock cycle from its downstream MAC.
• EFFORT : This is our proposed technique which uses
the opportunistic timing window in the MAC operation
to detect and correct timing errors (Section 3). However,
if a computational delay falls beyond that opportunistic
timing window, an erroneous value will be propagated.

Figure 6: Power Consumption (Lower is better).

5.2 Inference Accuracy
Figure 5 shows the normalized accuracies for different
comparative schemes at various operating frequencies. The
operating voltage is set to 0.45V for all frequencies. Y-axis
is normalized to the error free accuracy for the baseline operation and X-axis is normalized to the baseline frequency.
Error free accuracy for datasets are REUTERS: 0.80, IMDB:
0.89, MNIST: 0.98 and CIFAR-10: 0.77, respectively.
A modest timing error resilience can be observed in all
the schemes up to 1.25× the baseline frequency of operation. Accuracy begins to decline as the number of errors
drastically increases at higher frequencies. EFFORT outperforms other schemes by detecting and correcting most of the
timing errors. However, for CIFAR-10, the computational
delay at the highest frequency is relatively higher than other
datasets, which increases the number of undetected errors in
EFFORT and consequently, causes more reduction in inference accuracy. Baseline-TPU has a relatively sudden fall in
inference accuracy as propagating errors in successive stages
massively deteriorates the quality of the output matrices [6].
Inference accuracy for TE-Drop, however, falls at a slower
pace, compared to the baseline-TPU. At higher frequencies,
due to a large number of timing errors, TE-Drop bypasses
a higher number of MAC computations, resulting in inferior
accuracies compared to EFFORT. Hence, an NTC TPU, enhanced with EFFORT, results in only 2% average accuracy
loss, when operated up to 2.5× the baseline frequency, for 3

Figure 7: TOPS/Watt (Higher is better).
out of 4 DNN datasets.

5.3 Energy Efficiency
Figure 6 shows the average power consumption for the
4 DNN datasets for different comparative schemes. Power
consumption for the comparative schemes are normalized to
the power consumption of the Baseline-TPU at the baseline frequency. With the increasing operational frequency,
power consumption steadily increases for all the schemes.
However, EFFORT has lower power consumption compared
to other schemes. The clock gating scheme implemented in
EFFORT yields lower dynamic power in MAC units which
are idle. Hence, the overall power consumption for the systolic operation is reduced. Thus, EFFORT consumes up to
6% and 27% less power when compared to Baseline TPU
and TE-Drop. TE-Drop, due to its Razor flip-flops, has the
highest power consumption.
Figure 7 depicts the average of the energy-efficiency, measured in Tera Operations Per Second (TOPS)/Watt, for 4
DNN datasets with the normalized frequencies. TOPS/Watt
for all the scheme are normalized to that of the BaselineTPU at the baseline frequency. All the schemes have the

same TOPS measure. However, TE-Drop has the lowest
energy-efficiency due to its relatively high power footprint
compared to both EFFORT and the Baseline-TPU. Owing
to the clocking gating, EFFORT boasts the highest energyefficiency. EFFORT delivers up to 1.06× and 1.35× better
performance per unit power consumption, relative to other
schemes. Hence, EFFORT is a superior NTC TPU design
paradigm, offering a high energy-efficiency while providing a
high timing error resilience.

5.4 Implementation Overhead
EFFORT incurs hardware overheads due to the clock gating circuit, and the CostCo logic added to each MAC. As
the systolic array takes almost 24% of the TPU die area [7],
EFFORT incurs an area overhead of only 5%.

6.

RELATED WORK

Recent studies explore timing error resilience as well as
energy efficiency improvement in DNN accelerators. Reagen
et al. presented a co-design technique across the algorithm,
architecture and circuit level to improve the energy efficiency of DNN by applying selective pruning through lowering SRAM voltages without compromising the accuracy [18].
Chen et al. proposed a run-time pruning technique, called
row stationary, that enhances the efficiency of a convolutional neural network by re-configuring the spatial architecture, in order to map its computations [2]. Zhang et al.
introduced an aggressive voltage underscaling method to improve the energy efficiency of DNN accelerators while keeping accuracy drop less than 1% [22]. Yu et al. presented
a hardware pruning technique which applies SIMD-aware
weight and node pruning synergistically at the design time
to improve the energy efficiency of the DNN by reducing the
size of the underlying hardware [21].

7.

CONCLUSION

Increase in the processing workloads in real-time DNN applications calls for a DNN accelerator capable of delivering
high classification accuracy while efficiently meeting the energy requirements of the system. This paper demonstrates
EFFORT—a high-performance energy-efficient novel design
paradigm for a TPU, operating at NTC. EFFORT efficiently
detects and tackles timing errors while reducing the power
consupmtion of the TPU. EFFORT delivers upto 2.5× increase in performance with a minimum drop in accuracy and
consumes in between 6% - 27% less power in comparision to
recently proposed schemes. Additionally, EFFORT gives between 1.06× and 1.35× superior performance per unit power
against representative timing error resilient schemes.

8.

REFERENCES

[1] Reuters-21578 Dataset.

[2] Chen, Y.-H. and others Eyeriss: An energy-efficient
reconfigurable accelerator for deep convolutional neural
networks. IEEE Journal of Solid-State Circuits 52, 1 (2016),
127–138.
[3] Chollet, F., et al. Keras. https://keras.io, 2015.
[4] Dreslinski, R. G. and others Near-Threshold Computing:
Reclaiming Moore’s Law Through Energy Efficient Integrated
Circuits. Proc. of the IEEE 98, 2 (2010), 253–266.
[5] Ernst, D. and others Razor: A Low-Power Pipeline Based on
Circuit-Level Timing Speculation. In Proc. of MICRO (2003),
pp. 7–18.
[6] Jiao, X. and others An assessment of vulnerability of hardware
neural networks to dynamic voltage and temperature variations.
In Proceedings of the 36th International Conference on
Computer-Aided Design (2017), IEEE Press, pp. 945–950.
[7] Jouppi, N. and others Motivation for and evaluation of the first
tensor processing unit. IEEE Micro 38, 3 (2018), 10–19.
[8] Jouppi, N. P. and others In-datacenter performance analysis of
a tensor processing unit. In Computer Architecture (ISCA),
2017 ACM/IEEE 44th Annual International Symposium on
(2017), IEEE, pp. 1–12.
[9] Karpuzcu, U. and others Coping with Parametric Variation at
Near-Threshold Voltages. IEEE Micro 33, 4 (July 2013), 6–14.
[10] Karpuzcu, U. R. and others VARIUS-NTV: A
microarchitectural model to capture the increased sensitivity of
manycores to process variations at near-threshold voltages. In
DSN (2012), pp. 1–11.
[11] Khatamifard, S. K. and others VARIUS-TC: A modular
architecture-level model of parametric variation for
thin-channel switches. In ICCD (2016), pp. 654–661.
[12] Krizhevsky, A. Learning Multiple Layers of Features from Tiny
Images.
[13] LeCun, Y., and Cortes, C. MNIST handwritten digit database.
[14] Maas, A. L. and others Learning Word Vectors for Sentiment
Analysis. Association for Computational Linguistics,
pp. 142–150.
[15] Nakhaee, F. and others Lifetime improvement by exploiting
aggressive voltage scaling during runtime of error-resilient
applications. Integration 61 (2018), 29–38.
[16] NanGate. http://www.nangate.com/?page_id=2328.
[17] Pandey, P. and others GreenTPU: Improving Timing Error
Resilience of a Near-Threshold Tensor Processing Unit. In
Proc. of DAC (2019), pp. 173:1–173:6.
[18] Reagen, B. and others Minerva: Enabling low-power,
highly-accurate deep neural network accelerators. In ACM
SIGARCH Computer Architecture News (2016), vol. 44, IEEE
Press, pp. 267–278.
[19] Sarangi, S. and others VARIUS:A Model of Process Variation
and Resulting Timing Errors for Microarchitects. IEEE Tran.
on Semicond. Manufac. 21 (2008), 3 –13.
[20] Whatmough, P. N. and others Circuit-level timing error
tolerance for low-power DSP filters and transforms. IEEE
Transactions on Very Large Scale Integration (VLSI)
Systems 21, 6 (2012), 989–999.
[21] Yu, J. and others Scalpel: Customizing dnn pruning to the
underlying hardware parallelism. In ACM SIGARCH
Computer Architecture News (2017), vol. 45, ACM,
pp. 548–560.
[22] Zhang, J. and others ThUnderVolt: Enabling Aggressive
Voltage Underscaling and Timing Error Resilience for Energy
Efficient Deep Neural Network Accelerators. arXiv preprint
arXiv:1802.03806 (2018).
[23] Zhao, W., and Cao, Y. New Generation of Predictive
Technology Model for sub-45nm Early Design Exploration. T.
Electron Devices 53, 11 (2006), 2816 –2823.

