FATE: Fast and Accurate Timing Error Prediction Framework for Low Power
  DNN Accelerator Design by Zhang, Jeff & Garg, Siddharth
FATE: Fast and Accurate Timing Error Prediction
Framework for Low Power DNN Accelerator Design
Jeff (Jun) Zhang Siddharth Garg
New York University
{jz2163, sg175}@nyu.edu
ABSTRACT
Deep neural networks (DNN) are increasingly being accel-
erated on application-specific hardware such as the Google
TPU designed especially for deep learning. Timing specula-
tion is a promising approach to further increase the energy
efficiency of DNN accelerators. Architectural exploration for
timing speculation requires detailed gate-level timing sim-
ulations that can be time-consuming for large DNNs that
execute millions of multiply-and-accumulate (MAC) opera-
tions. In this paper we propose FATE, a new methodology
for fast and accurate timing simulations of DNN accelera-
tors like the Google TPU. FATE proposes two novel ideas:
(i) DelayNet, a DNN based timing model for MAC units;
and (ii) a statistical sampling methodology that reduces the
number of MAC operations for which timing simulations are
performed. We show that FATE results in between 8×−58×
speed-up in timing simulations, while introducing less than
2% error in classification accuracy estimates. We demon-
strate the use of FATE by comparing to conventional DNN
accelerator that uses 2’s complement (2C) arithmetic with
an alternative implementation that uses signed magnitude
representations (SMR). We show that that the SMR imple-
mentation provides 18% more energy savings for the same
classification accuracy than 2C, a result that might be of
independent interest.
1. INTRODUCTION
The recent success of deep learning in a range of ma-
chine learning applications [16, 18, 23, 30] has motivated in-
terest in the design of special purpose hardware accelera-
tors for both training and inference of deep neural networks
(DNN). Orders of magnitude energy savings compared to
CPU or GPU based solutions have been realized by re-
cent silicon prototypes that seek to accelerate the compu-
tationally expensive matrix multiplication and convolution
operations needed for DNN inference and training using
highly parallel arrays of multiply-and-accumulate (MAC)
units [4, 5, 8, 15, 20, 22]. Techniques such as zero-skipping
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
and neural network compression seek to further reduce the
energy consumption of deep learning accelerators by exploit-
ing special properties of DNNs [1,5, 17,21].
In this paper, we are interested in using timing specula-
tion, which has recently been shown as another promising
approach to increase the energy efficiency of DNN accel-
erators [27, 29]. Conventional digital design methodologies
pessimistically include significant timing margins (or guard-
bands) to guarantee correct functionality in the presence of
process, voltage, and temperature (PVT) variations. The
idea behind timing speculation is to optimistically execute a
chip at a lower voltage thus reduce energy at the expense of
higher delay and consequently occasional timing errors [9].
Timing errors can either be allowed to propagate if the ap-
plication is itself error tolerant [11,19], or detected using so-
called Razor flip-flops and corrected via safe re-execution [9].
Recent work has proposed a new timing speculation tech-
nique, referred to as TE-Drop, that significantly outperforms
the two aforementioned techniques and enables up to 57%
energy savings with negligible drop in classification accu-
racy [29]. The idea behind TE-Drop is to simply “drop out”
(or skip) erroneous MAC operations instead of re-executing
them, thus eliminating the performance overhead of frequent
re-execution.
Empirically evaluating and comparing timing speculation
methodologies for DNN accelerators is challenging (or even
intractable) for several reasons. For one, gate-level tim-
ing simulations are computationally expensive and orders
of magnitude slower than functional simulations. Second,
DNN accelerators tend to be large: the Google Tensor Pro-
cessing Unit (TPU), for example, utilizes a systolic array
containing 65K (256× 256) at its core. In the Google TPU
case, therefore, timing simulations have to be performed on
a gate-level netlist of roughly 42 million gates. Finally, state-
of-the-art DNNs perform several millions of MAC operations
over their multiple layers. Putting these together, we esti-
mate in Section 4.1 that detailed timing simulations of a
state-of-the-art DNN like AlexNet on a batch of 256 input
images takes 384 hours!
This paper presents FATE, a new methodology for fast
and accurate timing error simulation of DNN accelerators.
While FATE represents a general methodology, we will demon-
strate FATE in the context of systolic array based DNN ac-
celerators, which is used in the Google TPU and is one of
the most popularly used architectural paradigms for DNN
acceleration.
FATE builds on two separate but complementary ideas.
Both ideas leverage the fact that DNN accelerators (and es-
ar
X
iv
:1
80
7.
00
48
0v
1 
 [c
s.L
G]
  2
 Ju
l 2
01
8
pecially systolic arrays) are large, regular arrays of identical
MAC units.
1. DNN based acceleration of timing simulations: instead
of running full gate-level timing simulations, FATE in-
stead trains a DNN, which we refer to as DelayNet,
to accurately estimate the delay of a MAC unit as a
function of its inputs. DelayNet is then used along
with functional simulations to evaluate the impact of
timing errors on the DNN accelerator.
2. Sampling based timing error estimation: instead of
simulating all MAC units in the DNN accelerator, FATE
samples and performs timing simulations on only a
subset of MAC units, for instance, a subset of columns
of a systolic array, and probabilistically injects errors
in the remaining MAC units at the same rate.
We show that DelayNet and sampling speed-up timing simu-
lations for DNN accelerators by 8× and 58.57×, respectively,
while introducing only 4.3%-6.17% average error in timing
error estimates.
We use FATE to compare timing speculation on systolic
arrays that use either 2’s-complement (2C) based or sign-
magnitude representation (SMR) based MAC units. We
show that systolic arrays designed using SMR MACs have
significantly lower timing error rates compared to their 2C
counterparts and provide greater energy savings for the same
classification accuracy. These results are of independent in-
terest to DNN accelerator designers.
2. RELATED WORK
There is a considerable body of work on timing specula-
tion based low-power design of digital logic [9,10,12], includ-
ing recent work that has evaluated or developed new tim-
ing speculation mechanisms for DNN accelerators [14,27,29].
However, none of these works focus on frameworks to speed-
up timing simulations of DNN accelerators, which is the goal
of FATE.
The problem of fast timing error simulation has been ad-
dressed in prior work as well. Proposed techniques include
the use of symbolic timing analysis tool [24, 25] and more
recently DNN based timing estimation [13]. While Jiao et
al.’s work is closest to ours, there are several important dif-
ferences. First, Jiao et al. are focused on timing estimation
for microprocessor modules while our target is DNN accel-
erators. Second, Jiao et al.’s train a classifier to predict
whether given inputs result in a timing error or not. De-
layNet, our proposed timing model outputs a delay value
and can therefore be reused to predict timing errors across
differnt clock frequency and voltage scaling values. Finally,
FATE incorporates a second idea, i.e., statistical sampling
of MAC operations, that is not studied in Jiao et al.’s work.
Kruijf et al. [6] propose a micro-architecture level model
to estimate the overhead of timing errors, but the model
still requires detailed gate-level timing simulations. We note
that statistical sampling has been proposed in the past as
a technique to speed-up micro-architectural simulations of
multi-core processors [28], but to the best of our knowledge,
we are the first to apply this methodology for timing specu-
lative DNN accelerators.
3. BACKGROUND












	














	





	







	






	


	












Figure 1: Systolic array based DNN accelerator based on
the Google TPU [15].
This section describes the requisite background relevant
to FATE. We start by briefly describing the design of DNN
accelerators.
3.1 DNN Accelerators
A DNN is structured as a feed-forward network that con-
tains L nested layers of computation. Layer i ∈ [1, L] has
Ni “neurons” whose outputs a
i ∈ RNi×B are called activa-
tions. Here B is the batch size, i.e., the number of inputs
the DNN operates on simultaneously. Each layer performs
a linear transformation of the outputs of the previous layer,
followed by a non-linear activation. The operation of a DNN
can be described mathematically as:
ai = φ
(
wiai−1 + bi
)
∀i ∈ [1, L], (1)
where φ : RNi → RNi,B is each layer’s activation function,
wi ∈ RNi−1 × Ni is the weight matrix, and bi ∈ RNi are
referred to as the bias. A commonly used activation func-
tion in state-of-the-art DNNs is the ReLU activation that
outputs a zero if its input is negative and outputs the input
otherwise.
Multiplying the matrix of weights wi with a matrix of
activations ai is the most computationally expensive opera-
tion in DNN execution1. Consequently, the primary focus of
DNN accelerators is in speeding-up matrix multiplications.
For instance, the baseline DNN accelerator shown in Fig-
ure 1 uses a systolic array, a grid of N ×N MAC units, to
accelerate matrix multiplications.
The operation of a systolic array can be understood as fol-
lows. First, a matrix of N ×N is loaded into the array from
the weight memory, one weight per MAC unit. Next activa-
tions are read from the activation memory and flow through
the array from left to right. Each MAC unit receives two
inputs per clock cycle; an activation from the left and a par-
tial sum from the north. It multiplies its incoming activation
with its stored weight and adds the product to the incoming
partial sum; the resulting sum is sent to the downstream
MAC unit. The final MAC unit in column i outputs, in suc-
cessive clock cycles, the dot product of row i of the weight
1The time complexity of matrix multiplication is O(N2B)
for a DNN with N neurons in each layer versus only O(NB)
for applying the activation function
Figure 2: Block-level and timing diagram of TE-Drop pro-
posed in [29].
matrix with each column in the activation matrix. The final
results are stored in accumulation buffers and copied back
into activation memory for subsequent computations.
3.2 Timing Speculation
Timing speculation was first proposed by Ernst et al. [9]
as a strategy to speculatively operate digital logic at a lower
voltage than that required to guarantee that timing con-
straints are met. While lowering voltage quadratically re-
duces dynamic power, it also increase the delay of combina-
tion logic. Consequently, under-volting the logic can result
in timing errors. Errors are detected using double-sampling
flip-flops that latch the combinational logic output at the
regular clock edge and using a delayed clock. If the two
outputs are different, a timing error is inferred. The input
that caused the timing error is then re-executed, but this
time with a slower clock so as to guarantee correct execu-
tion. We will refer to this approach as timing error detection
and re-execution (TEDR).
For algorithms that are inherently error resilient, a sim-
pler way to deal with timing errors is to simply allow them
to propagate instead of incurring detection and re-execution
overhead [14,19,26]. This is referred to as timing error prop-
agation (TEP).
Recent work has shown that TEDR nor TEP have lim-
ited energy saving potential for DNN accelerators as they
only allow the accelerators to run at relatively low timing
error rates below 1% [14, 29]. While we refer the interested
reader to the prior work for more details as to why this is
the case, we note that Zhang et al. [29] propose a new tim-
ing speculation methodology for DNN accelerators referred
to as TE-Drop.
TE-Drop detects timing errors in the same way as TEDR,
but differs in the way it responds to errors. When a timing
error is detected in a MAC operation, TE-Drop borrows a
cycle from the subsequent MAC to correctly complete its
own execution, but drops (i.e., zeroes out) the contribution
of the subsequent MAC operation to the partial sum (see
block diagram and timing diagram in Figure 2). Empirical
results show that TE-Drop enables execution at timing error
rates as high as 10% with negligible drop in classification
accuracy and no performance loss. we therefore adopt TE-
Drop as our baseline timing speculation mechanism in the
remainder of this paper.
4. FATE METHODOLOGY
In this section we describe the FATE methodology. We be-
gin by motivating the need for FATE and then describe the
two new techniques that constitute FATE, i.e., DelayNet,
a DNN based timing model for MAC units, and sampling
based acceleration of timing simulations.
4.1 Motivation
To demonstrate the intractability of running detailed gate-
level timing simulations of a DNN accelerator, we developed
an RTL prototype of a systolic array, modeled on the Google
TPU, with 256×256 MAC units and synthesized it with the
OSU FreePDK 45nm Library. More details of our experi-
mental setup can be found in Section 5. We then scheduled
the four benchmark DNNs, as described in Table 1, on the
prototype and ran detailed post-synthesis timing simulations
using Modelsim for a batch of 256 test inputs. Table 1 shows
the number of MAC operations that each DNN performs,
and the simulation time required to execute a batch of test
inputs in detailed timing simulation mode. Note that run-
ning AlexNet, the largest DNN we simulated, takes more
than 10 days for only 256 inputs which represents only a
fraction of AlexNet’s test suite. This study motivates the
acute need to speed-up timing error simulations for DNN
based accelerators.
4.2 DelayNet
In
pu
t (
72
) =
 { 
 W
ei
gh
t (
8)
, C
ur
. A
ct
iv
at
io
n 
(8
), 
Pr
e.
 A
ct
iv
at
io
n 
(8
), 
   
  C
ur
. A
cc
um
ul
at
io
n 
(2
4)
, P
re
. A
cc
um
ul
at
io
n 
(2
4)
  }
 DelayNN: 72 X 30 X 1
Activation: Sigmoid
(72) (30) (1)
Figure 3: Architecture of DelayNet. Assuming 8-bit weights
and activations and 24-bit partial sums, DelayNet takes a
72-bit input and outputs a real valued estimate of the MAC
unit delay for the given inputs.
The first component of FATE is DelayNet, a DNN based
timing model for MAC units. DelayNet is motivated by the
observation that MAC units constitute a dominant fraction
of the logic of DNN accelerators; for instance, MAC units
constitute > 99% of the logic of systolic array shown in Fig-
ure 1. Furthermore, from our simulations we noted that all
timing errors are triggered by MAC units since the control
logic is simple and has low delay. Therefore, any method
that speeds-up timing simulations for the MAC unit pro-
vides commensurate speed-ups in timing simulation of the
full DNN accelerator.
Given the success of deep learning for regression tasks,
we train a 2-layer fully-connected DNN with sigmoid acti-
vations, referred to as DelayNet, to approximate the delay
of MAC units. Ideally, the inputs to DelayNet would m-bit
activation inputs, n-bit partial sum inputs and and k-bit
weight inputs to the MAC unit from the current and previ-
ous clock periods. However, since our baseline systolic array
is weight stationary [5], we do not require the weight from
the previous cycle. Hence, DelayNet has a 2m+ 2n+ k bit
input and a real-valued delay output. Figure 3 shows the
DelayNet architecture.
0 50 100 150 200 250
Epoch
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
M
ea
n 
Sq
ua
re
 E
rr
or
Figure 4: Training error with respect to number of training
epochs. Each training epoch takes roughly 10 seconds.
DelayNet is trained using inputs and delays obtained from
full timing simulations of the systolic array. During timing
simulations, the systolic array is exercised with training data
for the DNN benchmarks under evaluation. From the timing
simulations, we extract 1M MAC input and corresponding
delay pairs as training data for DelayNet. We are helped by
the fact that DNN benchmarks are accompanied by large
and representative training data sets that reflect the actual
distribution of MAC inputs during test. Figure 4 shows evo-
lution of training error as a function of the training epoch.
We note that the error converges rapidly and the entire net-
work is trained within 10 minutes with an root mean square
error of 0.038 (we assume delay values are normalized in
[0, 1]).
Once trained, DelayNet can be re-used across experiments.
For example, in Section 5 we use DelayNet to estimate tim-
ing error probabilities and classification accuracy at multiple
different voltage underscaling ratios. Furthermore, although
not illustrated in this paper, DelayNet can be used to com-
pare different DNN accelerators architectures as long as they
use the same MAC design.
4.3 Sampling Based Acceleration
Even after speeding-up timing simulations with DelayNet,
the sheer number of MAC operations performed by large
DNNs can still result in prohibitive simulation time. To ad-
dress this problem, we use statistical sampling [28], an ap-
proach that has already been effectively applied in speeding-
up micro-architecture simulations of multi-core processors.
Specifically, we randomly sample a subset of MAC opera-
tions during DNN execution and run either full or DelayNet
based timing simulations on these samples. From the sam-
ples, we then compute the average error probability over
these samples perr (i.e., the fraction of sampled MAC oper-
ations that cause timing errors). Finally, we randomly inject
timing errors in the remaining MAC units with probability
perr.
For systolic array based DNN accelerators, FATE ran-
domly samples a subset of columns on which to perform
timing simulations. Note each column in the systolic array
computes the output of a different neuron. Column sam-
pling has two advantages: (i) we expect the timing error
rates of columns to be correlated since each column has the
same activation inputs; and (ii) each sampled column can
be simulated in parallel since there are no data dependen-
cies between columns of the systolic array. The pseudo-code
for FATE with column sampling is described in Algorithm 1.
Note that the sampling methodology is repeated per layer
of DNN execution. This is because the distribution of acti-
vation inputs can change significantly from one layer to the
next. Indeed, because prior work [29] has noted that DNN
timing error rates vary significantly across DNN layers, fur-
ther motivating the need to sample for each layer separately.
4.4 FATE Tool-flow
The overall FATE Tool-flow, illustrated in Figure 5, con-
tains two phases. (i) The training phasemaps DNN bench-
marks to the synthesized systolic array netlist and performs
detailed gate-level timing simulation with the Standard De-
lay File (SDF) in Modelsim to extract the ground truth de-
lay for each MAC operation. With the ground truth delay,
FATE trains the DelayNet as described in Section 4.2 using
Tensorflow. (ii) The prediction phase uses inputs from
the test dataset to perform the function simulations and
uses sampling to extract a subset of MAC operations from
which perr is estimated using timing simulations. Proba-
bilistic error injection with probability perr is then used to
determine the classification accuracy versus energy trade-
offs in the presence of timing speculation.
5. EMPIRICAL EVALUATION
We now evaluate the accuracy and speed-up of FATE, and
Table 1: Simulation time for detailed gate-level timing simulations of benchmark DNNs for a test set containing 256 inputs.
Also shown are the architectural parameters and total number of MAC operations for each benchmark DNN.
Benchmarks 256 Input Batch
Name Architecture MAC OPs Sim. Time
MNIST L1-L4 (FC): 784× 256× 256× 256× 10 3.34e5 30 min
Reuters [3] L1-L4 (FC): 2048× 256× 256× 256× 52 6.69e5 50 min
TIMIT [2] L1-L4 (FC): 1845× 2000× 2000× 2000× 183 1.21e7 24.6 h
ImageNet [7]
L1-L2 (Conv): (224, 224, 3)× (27, 27, 64)× (13, 13, 192)
6.56e8 384 h
L3-L5 (Conv): (13, 13, 384)× (13, 13, 256)× (6, 6, 256)
L6-L8: 4096× 4096× 1000
Figure 5: Diagram illustrating FATE Tool Flow.
demonstrate its use in enabling architectural exploration for
timing speculation based DNN accelerators. We begin by
describing our experimental setup.
Algorithm 1: Sampling based Acceleration
Input : wl, al, tCLK
Output: al+1
1 Algorithm Sampling(wl, al, tCLK)
2 al+1 = 0 ;
3 /*Random Selection of q Columns*/
4 I ⊂ [0, N − 1], |I| = q ;
5 /*Sampling*/
6 for (i = 0, i ≤ B, i+ +) do
7 for j ∈I do
8 for (k = 0, k < N, k + +) do
9 if MACdelay(a
l
ki,w
l
jk,a
l+1
ji ) > tCLK then
10 perr = perr + 1;
11 end
12 /* Delay of MAC can be either obtained
from timing simulation, or DelayNet*/
13 al+1ji + = w
l
jka
l
ki ;
14 end
15 end
16 end
17 /* Probabilistic Error Injection*/
18 for (i = 0, i ≤ B, i+ +) do
19 for j /∈I do
20 for (k = 0, k < N, k + +) do
21 if Bernoulli(perr) == 1 then
22 al+1ji = a
l+1
ji ;
23 end
24 else
25 al+1ji + = w
l
jka
l
ki ;
26 end
27 end
28 end
29 end
30 return al+1;
31
5.1 Setup
DNN Benchmarks.
We evaluate FATE on four popular DNNs: two small
DNNs for MNIST digit classification and Reuters text cate-
gorization [3], and two large state-of-the-art DNNs for TIMIT
speech recognition [2] and image recognition using the Ima-
geNet dataset [7]. The parameters of each DNN benchmark
is shown in Table 1. Of the four DNNs, three are multi-layer
perceptrons (MLP) and the largest, AlexNet, is a convolu-
tional neural network (CNN). Indeed, although we described
FATE in the context of MLPs, it can be used to evaluate
CNNs as well.
Baseline DNN Accelerator.
All our experiments are performed on a systolic array
based DNN accelerator that closely resembles the Google
TPU. The systolic array in our accelerator has 65K MAC
units arranged in a square grid. Each MAC unit in the ar-
ray operates on inputs represented as 2’s complement (2C)
signed integers with 8-bit weights and activations and 24-bit
partial sums. A cycle-accurate prototype of this accelera-
tor is implemented in fully-synthesizable Verilog and syn-
thesized with the 45 nm OSU FreePDK technology library
using Cadence Genus. Gate-level timing simulations are per-
formed using ModelSim.
5.2 FATE Accuracy and Speed-up
In our experimental results, we evaluate FATE in terms
of accuracy and speed-up with respect to full timing simu-
lations which we refer to as Full-Sim.
• FATE-DNN: simulates all MAC units in the systolic
array but uses DelayNet instead of full gate-level tim-
ing simulations.
• FATE-Samp: uses the proposed sampling method-
ology with sampling parameter q = 32. That is, 32
of the 256 columns in the systolic array are sampled
and simulated fully while probabilistic error injection
is used for the rest.
Note that FATE-DNN and FATE-Samp are orthogonal and
can be used together to offer multiplicative speed-ups. Here,
we characterize each technique’s accuracy and speed-up sep-
arately.
Figure 6 shows the per-layer timing error rates for MNIST,
Reuters and TIMIT estimated using Full-Sim, FATE-DNN
and FATE-Samp. Note that we were unable to run full tim-
ing simulations for AlexNet because of the prohibitive run-
time of doing so. From Figure 6 we observe that the timing
error rates obtained using FATE-DNN and FATE-SIM are
close to those obtained from Full-Sim. Both FATE-DNN
and FATE-Samp capture two important qualitative prop-
erties of timing errors: (i) the increase in timing error rate
with voltage scaling, and (ii) the fact that timing errors rates
vary significantly from one layer to the next.
0.5 0.6 0.7 0.8 0.9 1
Voltage Underscaling Ratio
0
0.1
0.2
0.3
0.4
0.5
0.6
Ti
m
in
g 
Er
ro
r P
ro
ba
bl
iti
es
FATE-Samp
FATE-DNN
Full-Sim
Layer 3
Layer 2
Layer 1
(a) TIMIT.
0.5 0.6 0.7 0.8 0.9 1
Voltage Underscaling Ratio
0
0.05
0.1
0.15
0.2
0.25
Ti
m
in
g 
Er
ro
r P
ro
ba
bl
iti
es
FATE-Samp
FATE-DNN
Full-SimLayer 1
Layer 2
Layer 3
(b) MNIST.
0.5 0.6 0.7 0.8 0.9 1
Voltage Underscaling Ratio
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Ti
m
in
g 
Er
ro
r P
ro
ba
bl
iti
es
FATE-Samp
FATE-DNN
Full-Sim
Layer 3
Layer 2
Layer 1
(c) Reuters.
Figure 6: Timing Error Rates on TIMIT, MNIT, Reuters Test Dataset.
Table 2: FATE Accuracy and Speed-up
Benchmark
FATE-Samp FATE-DNN
Error Speedup Error Speedup
MNIST 2.33% 8X 2.73% 3X
Reuters 2.16% 8X 4.76% 5.8X
TIMIT 4.3% 8X 6.17% 58.57X
Quantitatively, we note in Table 2 the average error in tim-
ing error estimates obtained from FATE-DNN and FATE-
Samp as well as the speed-ups compared to Full-Sim. Note
that that FATE-DNN speed-ups include the training time
for DelayNet. We note that FATE-Samp is more accurate
compared to FATE-DNN but is also slower, providing 8×
speed-up over Full-Sim with only 4.3% error in timing error
estimates.
Next, we used FATE-DNN and FATE-Samp to evaluate
the classification accuracy versus energy trade-offs enabled
by the TE-Drop timing speculation scheme. Recall that as
voltage is reduced, the DNN accelerator’s energy consump-
tion drops but its timing error rate increases. Consequently,
a larger fraction of MAC operations are “dropped” result-
ing in a decrease in classification accuracy. Figure 7 plots
the classification accuracy versus energy trade-off curves ob-
tained using FATE-DNN and FATE-Samp along with the
baseline results from Full-Sim for the MNIST, Reuters and
TIMIT DNNs. We note that the classification accuracy esti-
mated using FATE-DNN and FATE-Samp are always within
2% of the golden value obtained from Full-Sim.
5.3 Architectural Exploration Using FATE
We now illustrate how FATE enables architectural ex-
ploration for low-power, timing speculation based DNN ac-
celerators. Specifically, we will compare two implementa-
tions of systolic array based DNN accelerator in Figure 1:
the first makes use of 2’s complement (2C) representation
for weights, activations and partial sums, while the second
makes use of a sign-magnitude representation (SMR).
The reason we believe SMR might be advantageous over
2C from a timing error perspective is as follows. Prior work
and our own empirical observation show that the weights,
activations and partial sums in DNNs cluster around small
positive and negative values (see Figure 8c for example). In
a 2C representation, small positive values are encoded using
logic 1s in the least significant bits (LSBs) while small nega-
tive values are encoded using logic 1s in the most significant
bits (MSBs). Hence, switching activity is distributed across
the LSBs and MSBs. On the other hand, small positive and
negative values are encoded identically (except the sign bit)
for SMR, and hence the switching activity is focused in the
LSBs.
In Figure 8a and Figure 8b we confirm this intuition using
full timing simulations on TIMIT. Observe that although
2C and SMR have the same worst-case delay values2, the
mean delay of SMR for small negative weights is significantly
smaller than that for 2C.
Using both full timing simulations and FATE on MNIST,
Reuters and TIMIT, we confirm that SMR results in more
favorable classification accuracy versus energy trade-offs com-
pared to 2C. However, since 2C itself performs quite well on
these three benchmarks, the improvements from SMR are
relatively modest.
We now use FATE-Samp to compare 2C versus SMR on
AlexNet, for which full timing simulations on the validation
dataset are infeasible. In Figure 9(a)-(b), we plot the per-
layer timing error rates for 2C and SMR implementations.
In both cases Layer 1 has the highest rate of timing errors,
but the timing error rates for SMR are lower than those for
2C. The timing error rates for other layers are comparable.
Figure 9(c) compares the classification accuracy vs. en-
ergy trade-offs obtained using TE-Drop based timing spec-
ulation on 2C and SMR implementations3. We observe that
SMR significantly outperforms 2C, resulting in up to 14.45%
higher classification accuracy for the same energy savings.
Conversely, SMR enables greater energy savings for similar
classification accuracy. These analyses are enabled by the
simulation speed-ups that FATE provides.
6. CONCLUSION AND FUTURE WORK
In this paper we have presented FATE, a methodology for
fast and accurate timing error rate estimation of time spec-
ulation based low-power DNN accelerators. Using two novel
techniques, i.e., DNN based delay prediction and statistical
sampling of MAC operations, FATE is able to reduce the
run-time of timing simulations for AlexNet from 384 hours
to between 7 to 48 hours, depending on the approach used.
At the same time, FATE’s timing error rates and DNN clas-
sification accuracy estimates are within 6% and 2% of those
obtained from full simulations. We have used FATE to com-
pare 2C and SMR based implementations of a DNN accel-
2We synthesized both with the same timing target for a fair
comparison.
3As is common practice, we report both the Top 5 and Top
1 accuracy for ImageNet
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized Energy
55
60
65
70
75
80
Cl
as
si
fic
at
io
n 
Ac
cu
ra
cy
 (%
)
FATE-Samp
FATE-DNN
Full-Sim
(a) TIMIT Accuracy vs. Energy valida-
tion.
0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized Energy
92
93
94
95
96
97
98
99
100
Cl
as
si
fic
at
io
n 
Ac
cu
ra
cy
 (%
)
FATE-Samp
FATE-DNN
Full-Sim
(b) MNIST Accuracy vs. Energy valida-
tion.
0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized Energy
80
85
90
95
100
Cl
as
si
fic
at
io
n 
Ac
cu
ra
cy
 (%
)
FATE-Samp
FATE-DNN
Full-Sim
(c) Reuters Accuracy vs. Energy valida-
tion.
Figure 7: Accuracy Energy Tradeoff Validation (a) TIMIT, (b) MNIST, and (c) Reuters.
-25 -20 -15 -10 -5 0 5 10 15 20 25
Quantilized Weight Vaule
200
400
600
800
1000
1200
1400
1600
M
A
C 
De
la
y 
(p
s)
SMR
2C
(a) Maximum MAC Delay versus Weight.
-25 -20 -15 -10 -5 0 5 10 15 20 25
Quantilized Weight Vaule
200
300
400
500
600
700
800
900
1000
M
A
C 
De
la
y 
(p
s)
SMR
2C
(b) Mean MAC Delay versus Weight. (c) Weight Distribution.
Figure 8: 2C Vs SMR MAC Design, TIMIT Dataset, (a) Max MAC Delay per Weight, (b) Mean MAC Delay per Weight,
and (c) Weight Distribution.
erator modeled on the Google TPU and show that the SMR
implementation provides more favorable accuracy vs. energy
trade-offs compared to the 2C implementation.
As future work, we would like to provide a rigorous sta-
tistical analysis of FATE, and leverage FATE to compare a
broader range of DNN accelerator architectures under tim-
ing speculation.
7. REFERENCES
[1] J. Albericio, P. Judd, T. Hetherington, T. Aamodt,
N. E. Jerger, and A. Moshovos. Cnvlutin:
ineffectual-neuron-free deep neural network
computing. In Computer Architecture (ISCA), 2016
ACM/IEEE 43rd Annual International Symposium
on, pages 1–13. IEEE, 2016.
[2] J. Ba and R. Caruana. Do deep nets really need to be
deep? In Advances in neural information processing
systems, pages 2654–2662, 2014.
[3] A. M. d. J. C. Cachopo. Improving methods for
single-label text categorization. Instituto Superior
Te´cnico, Portugal, 2007.
[4] S. Chakradhar, M. Sankaradas, V. Jakkula, and
S. Cadambi. A dynamically configurable coprocessor
for convolutional neural networks. In ACM SIGARCH
Computer Architecture News, volume 38, pages
247–257. ACM, 2010.
[5] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze.
Eyeriss: An energy-efficient reconfigurable accelerator
for deep convolutional neural networks. IEEE Journal
of Solid-State Circuits, 52(1):127–138, 2017.
[6] M. De Kruijf, S. Nomura, and K. Sankaralingam. A
unified model for timing speculation: Evaluating the
impact of technology scaling, cmos design style, and
fault recovery mechanism. In International Conference
on Dependable Systems and Networks, pages 487–496.
IEEE, 2010.
0.5 0.6 0.7 0.8 0.9 1
Voltage Underscaling Ratio
0
5
10
15
20
25
30
Ti
m
in
g 
Er
ro
r P
ro
ba
bl
iti
es
 (%
)
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
(a) 2s comp Timing Errors of AlexNet.
0.5 0.6 0.7 0.8 0.9 1
Voltage Underscaling Ratio
0
5
10
15
20
Ti
m
in
g 
Er
ro
r P
ro
ba
bl
iti
es
 (%
)
Layer 1
Layer 2
Layer 3
Layer 4
Layer 5
(b) SMR Timing Errors of AlexNet.
0.4 0.5 0.6 0.7 0.8 0.9 1
Normalized Energy
20
30
40
50
60
70
80
90
Cl
as
si
fic
at
io
n 
Ac
cu
ra
cy
 (%
)
2s comp. MAC
SMR MAC
TOP 1
TOP 5
(c) Energy Accuracy tradeoff AlexNet.
Figure 9: ImageNet Dataset (a) 2’s compliment MAC, (b) SMR MAC, and (c) Energy Accuracy tradeoff.
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
L. Fei-Fei. Imagenet: A large-scale hierarchical image
database. In Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on,
pages 248–255. IEEE, 2009.
[8] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo,
X. Feng, Y. Chen, and O. Temam. Shidiannao:
Shifting vision processing closer to the sensor. In ACM
SIGARCH Computer Architecture News, volume 43,
pages 92–104. ACM, 2015.
[9] D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin,
T. Mudge, N. S. Kim, and K. Flautner. Razor:
circuit-level correction of timing errors for low-power
operation. IEEE Micro, 24(6):10–20, 2004.
[10] M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. Harris,
D. Blaauw, and D. Sylvester. Bubble razor: An
architecture-independent approach to timing-error
detection and correction. In International Solid-State
Circuits Conference, pages 488–490. IEEE, 2012.
[11] A. Gebregiorgis et al. Error propagation aware timing
relaxation for approximate near threshold computing.
In Proceedings of the 54th Annual Design Automation
Conference (DAC), pages 1–6. IEEE, 2017.
[12] B. Greskamp, L. Wan, U. R. Karpuzcu, J. J. Cook,
J. Torrellas, D. Chen, and C. Zilles. Blueshift:
Designing processors for timing speculation from the
ground up. In 2009 IEEE 15th International
Symposium on High Performance Computer
Architecture, pages 213–224, Feb 2009.
[13] X. Jiao, A. Rahimi, Y. Jiang, J. Wang, H. Fatemi,
J. P. de Gyvez, and R. Gupta. Clim: A cross-level
workload-aware timing error prediction model for
functional units. IEEE Transactions on Computers,
2017.
[14] X. Jiao et al. An assessment of vulnerability of
hardware neural networks to dynamic voltage and
temperature variations. In Proceedings of IEEE/ACM
ICCAD, 2017.
[15] N. P. Jouppi, C. Young, N. Patil, D. Patterson,
G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden,
A. Borchers, et al. In-datacenter performance analysis
of a tensor processing unit. arXiv preprint
arXiv:1704.04760, 2017.
[16] A. Karpathy, G. Toderici, S. Shetty, T. Leung,
R. Sukthankar, and L. Fei-Fei. Large-scale video
classification with convolutional neural networks. In
Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition, pages 1725–1732,
2014.
[17] D. Kim et al. Zena: Zero-aware neural network
accelerator. IEEE Design Test, 2017.
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural
networks. In Advances in neural information
processing systems, pages 1097–1105, 2012.
[19] F. Nakhaee et al. Lifetime improvement by exploiting
aggressive voltage scaling during runtime of
error-resilient applications. Integration, the VLSI
Journal, 2017.
[20] S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J.
Yoo. 4.6 a1. 93tops/w scalable deep learning/inference
processor with tetra-parallel mimd architecture for
big-data applications. In Solid-State Circuits
Conference-(ISSCC), 2015 IEEE International, pages
1–3. IEEE, 2015.
[21] B. Reagen et al. Minerva: Enabling low-power,
highly-accurate deep neural network accelerators. In
ACM/IEEE ISCA, pages 267–278, 2016.
[22] M. Sankaradas, V. Jakkula, S. Cadambi,
S. Chakradhar, I. Durdanovic, E. Cosatto, and H. P.
Graf. A massively parallel coprocessor for
convolutional neural networks. In Application-specific
Systems, Architectures and Processors, 2009. ASAP
2009. 20th IEEE International Conference on, pages
53–60. IEEE, 2009.
[23] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to
sequence learning with neural networks. In Advances
in neural information processing systems, pages
3104–3112, 2014.
[24] A. Tosson, S. Garg, and M. Anis. Tagged probabilistic
simulation based error probability estimation for
better-than-worst case circuit design. In Very Large
Scale Integration (VLSI-SoC), 2013 IFIP/IEEE 21st
International Conference on, pages 368–373. IEEE,
2013.
[25] L. Wan and D. Chen. Dynatune: circuit-level
optimization for timing speculation considering
dynamic path behavior. In Proceedings of the 2009
International Conference on Computer-Aided Design,
pages 172–179. ACM, 2009.
[26] P. N. Whatmough, S. Das, D. M. Bull, and
I. Darwazeh. Circuit-level timing error tolerance for
low-power dsp filters and transforms. IEEE
Transactions on Very Large Scale Integration (VLSI)
Systems, 21(6):989–999, 2013.
[27] P. N. Whatmough, S. K. Lee, H. Lee, S. Rama,
D. Brooks, and G.-Y. Wei. 14.3 a 28nm soc with a 1.2
ghz 568nj/prediction sparse deep-neural-network
engine with> 0.1 timing error rate tolerance for iot
applications. In Solid-State Circuits Conference
(ISSCC), 2017 IEEE International, pages 242–243.
IEEE, 2017.
[28] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C.
Hoe. Statistical sampling of microarchitecture
simulation. ACM Transactions on Modeling and
Computer Simulation (TOMACS), 16(3):197–224,
2006.
[29] J. Zhang, K. Rangineni, Z. Ghodsi, and S. Garg.
Thundervolt: Enabling aggressive voltage underscaling
and timing error resilience for energy efficient deep
neural network accelerators. 2018.
[30] M.-L. Zhang and Z.-H. Zhou. Multilabel neural
networks with applications to functional genomics and
text categorization. IEEE transactions on Knowledge
and Data Engineering, 18(10):1338–1351, 2006.
