A Machine Learning Pipeline Stage for Adaptive Frequency Adjustment by Ajirlou, Arash Fouman & Partin-Vaisband, Inna
1A Machine Learning Pipeline Stage for Adaptive
Frequency Adjustment
Arash Fouman Ajirlou, Student Member, IEEE, Inna Partin-Vaisband, Member,IEEE
Abstract—A machine learning (ML) design framework is proposed for adaptively adjusting clock frequency based on propagation
delay of individual instructions. A random forest model is trained to classify propagation delays in real time, utilizing current operation
type, current operands, and computation history as ML features. The trained model is implemented in Verilog as an additional pipeline
stage within a baseline processor. The modified system is experimentally tested at the gate level in 45 nm CMOS technology, exhibiting
a speedup of 70% and energy reduction of 30% with coarse-grained ML classification. A speedup of 89% is demonstrated with finer
granularities with 15.5% reduction in energy consumption.
Index Terms—Computer Systems Organization, Microprocessors and microcomputers, Hardware, Pipeline, Processor Architectures,
Pipeline processors, Pipeline implementation, VLSI Systems, Impact of VLSI on system design, VLSI, System architectures,
integration and modeling, Design Methodology, Cost/performance, Machine learning, Classifier design and evaluation
F
1 INTRODUCTION
THe primary design goal in computer architecture isto maximize the performance of a system under
power, area, temperature, and other application-specific
constraints. Heterogeneous nature of VLSI systems and the
adverse effect of process, voltage, and temperature (PVT)
variations have raised challenges in meeting timing con-
straints in modern integrated circuits (ICs). To address these
challenges, timing guardbands have constantly been in-
creased, limiting the operational frequency of synchronous
digital circuits. On the other hand, the increasing variety of
functions in modern processors increases delay imbalance
among different signal propagation paths. Bounded by criti-
cal path delay, these systems are traditionally designed with
pessimistically slow clock period, yielding underutilized IC
performance. Moreover, power efficiency of these underuti-
lized systems also degrades due to the increasing power
leakage. Alternatively, when designed with relaxed timing
constraints, integrated systems are prone to functional fail-
ures. To simultaneously maintain correct functionality and
increase system performance, numerous optimization tech-
niques as well as offline and online models have recently
been proposed including: pipelining, multicore computing,
dynamic frequency and voltage scaling (DVFS), and ML
driven models [1], [2], [3], [4], [5], [6], [7], [8], [9].
Propagation delay in a processor is a strong function
of the type, input operands, and output of the current op-
eration, and computation history [4]. Computation history
accounts for data overwrite and crosstalk noises. Intuitively,
majority of operations are completed within a small portion
of the clock period, as determined by the slowest path in the
circuit. Based on path delay distribution, as reported in [5],
• A. Fouman was with the Department of Electrical and Computer Engi-
neering, University of Illinois at Chicago, Chicago, IL, 60607.
E-mail: afouma2@uic.edu
• I. Parin-Vaisband was with the Department of Electrical and Computer
Engineering, University of Illinois at Chicago, Chicago, IL, 60607.
E-mail: vaisband@uic.edu
the operational frequency can be doubled for majority (e.g.,
86.7% in [5]) of instructions in a typical program.
While multicore approaches have been proposed to
enhance system performance, the scalability of modern
multicore systems is limited by the design complexity of
instruction level parallelism and thermal design power con-
straints [10], [11]. Thus, speeding a single thread execution
is an important cornerstone for enhancing performance in
modern ICs [12]. This is, therefore, the primary focus of the
proposed approach. To the best of the authors’ knowledge,
this paper is the first to employ ML for adaptively adjusting
the clock frequency at the instruction-level. Note that with
the proposed method, the clock frequency is adaptively ad-
justed per instruction in real time, yielding a fundamentally
different approach as compared with the traditional, task-
based dynamic frequency scaling. The main contributions
of this work are as follows:
1) A systematic flow is proposed and implemented as a
unified platform for extracting ML input features from
an instruction and classifying the instruction execution
delay in real time.
2) A random forest (RF) model is trained to classify in-
dividual instructions into delay classes based on their
type, input operands, and the computation history of
the system.
3) A new pipeline stage is integrated within a pipelined
MIPS processor.
4) The proposed method is synthesized and verified on
LegUp [13] benchmark suite of programs with Synop-
sys Design Compiler in 45 nm CMOS technology node.
The rest of the paper is organized as follows. Section
2 describes prior and related work. Section 3 explains the
proposed unified platform and the design methodology.
ML algorithms for classification of instruction delay are
described in Section 4. In Section 5 the implementation
details of the system are introduced. Experimental results
are presented in Section 6. Conclusions and future work
ar
X
iv
:2
00
7.
01
82
0v
1 
 [c
s.A
R]
  2
 Ju
l 2
02
0
2are discussed in Section 7, and the paper is summarized
in Section 8.
2 PRIOR AND RELATED WORK
Multiple approaches have been proposed for efficiently
tuning the operating point (i.e., voltage supply and clock
frequency) of a system at various levels of a computing
system, including application- and task-based methods and
instruction-level speculations.
Predicting timing violations in a constraint-relaxed sys-
tem is impractical with deterministic approaches, due to the
wide dynamic range of input and output signals (typically
32 or 64 bits), variety of operations in a modern processor,
and delay dependence on the runtime and physical char-
acteristics of the system (e.g., crosstalk noise). ML based
approaches for predicting timing violations of individual
instructions have recently been proposed, which consider
the impact of input operands and computation history on
timing violations [4], [14], [15]. While significant for the
design process of next generation scalable high performance
systems, these approaches have several limitations:
1) Instruction output is considered as a ML feature and
exploited in these systems for predicting the timing
characteristics of the individual instructions. These pre-
dictions are, however, carried out before the instruc-
tion execution, when the instruction output is not yet
available, limiting the effectiveness of these methods in
practical systems.
2) The modules under the test are studied separately and
evaluated in an isolated test environment without the
effects of other processing elements (e.g., arithmetic
modules, buffers or multiplexers). The high reported
accuracy is, therefore, expected to degrade if the meth-
ods are applied to a complex system (e.g., a practical
execution unit).
3) Power and timing overheads due to additional hard-
ware are not considered in these papers.
Granularity of prediction is another primary concern. A
bit-level ML based method has been proposed in [16] for
predicting timing violations with reduced timing guard-
bands. While up to 95% prediction accuracy has been re-
ported with this method, the excessively high, per bit granu-
larity of the ML predictions is expected to exhibit substantial
power, area, and timing overheads. These overheads are,
however, not evaluated in [16]. Furthermore, a procedure
for recovery upon a timing error is not provided and the
recovery overheads are also not considered.
As an alternative to fine-grain high-overhead ML meth-
ods, multiple coarse-grain schemes for timing error de-
tection and recovery have been proposed to mitigate the
adverse effect of the pessimistic design constraints. A better-
than-worst-case design approach has been introduced in [5].
With this approach, the clock period is set to a statistically
nominal value (rather than worst-case propagation delay)
and the history of timing erroneous program counters is
kept in a ternary content-addressable memory (TCAM).
The TCAM is exploited for predicting timing violations
of the instructions based on previous observations. Note
that the system only warns against those timing violations
that have been previously recorded. Alternatively, unseen
violations are not predicted with this approach. Owning
to the apparent simplicity of this approach, only bi-state
operating conditions (i.e., nominal and worst-case clock
frequencies) can be efficiently utilized with this method.
Alternatively, the design complexity and system overheads
are expected to significantly increase with the increasing
number of frequency domains.
In BandiTS [17], a reinforcement learning approach has
been proposed to estimate the timing error probability (TEP)
within a program time interval, given timing speculation
(TS) ratios, TSR = tclk/tnom for various values of the
reduced clock period tclk, and the worst-case clock period
tnom. The TS-based TEP problem is modeled in [17] as the
classical multi-armed bandit problem [18], where the TS
ratios and TEPs correspond to, respectively, the arms and
stochastic rewards. The primary limitation of that work is
the lack of details about the hardware implementation and
overheads. In addition, the maximum achievable perfor-
mance gain of only 25% has been reported. Furthermore,
BandiTS approach exhibits per-task clock granularity and
scales the clock frequency for a batch of instructions. Higher
performance gain is possible with fine-grain, per instruction
clock frequency adjustment, as shown in this paper.
A thermal-aware voltage scaling has been proposed in
[19]. Voltage selection algorithm has been developed and
integrated within FPGA synthesis process to aggressively
scale the core and block RAM voltages, utilizing the avail-
able thermal headroom of the FPGA-mapped design. As
a result, 36% reduction in power consumption has been
demonstrated. Driven by workload and thermal power dis-
sipation, this method, however, supports only coarse-grain
voltage and frequency scaling.
Predicting program error rate in timing-speculative pro-
cessors has been proposed in [20]. A statistical model is
developed for predicting dynamic timing slack (DTS) at
various pipeline stages. The predicted DTS values are ex-
ploited to estimate the timing error rate in a program. The
implementation overheads, and the potential performance
or power consumption gains are, however, not reported
with this approach.
An offline model for TS processors has been introduced
in [21]. This probabilistic model is trained to optimally se-
lect a better-than-worst-case, nominal clock frequency. The
provided hardware-based speculation, however, does not
consider the overall workload or specific finer units, limiting
the fidelity of the method. Alternatively, the adverse effect of
process variations on the propagation delay is considered,
strengthening the approach in [21]. Note that PVT varia-
tions are also considered with the proposed approach of
classifying instructions into delay intervals in real time, as
described in the following sections.
Finally, ML based methods for modeling system behav-
ior have also been proposed. For example, in [6], linear re-
gression has been leveraged for modeling the aging behav-
ior of an embedded processor based on current instruction
and its operands, as well as the computation history and
overall circuit switching activity. As a result, the timing
guardband designed to compensate for aging in digital
circuits can be effectively reduced, in presence of graceful
degradation [6]. Reallocation of delay budget has, however,
not been considered with this method.
3ML ICs can exhibit a prohibitively high power consump-
tion and physical size. Furthermore, ML ICs can introduce
additional delay and increase design complexity, depending
upon the application characteristics. To efficiently exploit
ML methods for managing frequency in modern processors,
delay, power, and area of ML ICs should be considered.
3 THE PROPOSED ML BASED FREQUENCY AD-
JUSTMENT
In this paper, a design methodology is proposed for ML
driven adjustment of operational frequency in pipeline pro-
cessors. With the proposed method, individual instructions
are classified into the corresponding propagation delay
classes in real time, and the clock frequency is accordingly
adjusted to reduce the gap between the actual propagation
delay and the clock period. The classes are defined by
segmenting the worst-case clock period into shorter delay
fragments. Each class is characterized by a specific supply
voltage and clock frequency. The primary design objective
is to maximize system performance within an allocated
energy budget. The overall delay and energy consumption
are evaluated with the additional ML components, and both
the correct and incorrect predictions. The proposed scalable
framework allows for other control configurations to be
defined in a similar manner for different design objectives.
The real-time clock adjustment is enabled by the recent
advancement in clock management circuits [24].
In order to evaluate this method, a pipelined, 32-bit MIPS
processor (TigerMIPS [22]) is utilized as the baseline proces-
sor. The ML classifier is designed as an additional pipeline
stage within the pipelined MIPS processor, as shown in
Fig. 1. The inputs to the additional ML pipeline stage are
the current instruction and its operands, as well as the
computation history, as defined by the toggled inputs bits
(i.e., current inputs are XORed with the previous inputs)
and output of the previous operation. The choice of these
parameters is in accordance with the results in [4] and [6].
These inputs are utilized as ML features for predicting the
delay class of the current instruction based on the trained
ML model. It is important to note that more complex, slower
ML models can also be trained with this methodology, as
long as the design complexity and hardware costs of the
final system meet the specified constraints. To meet the
overall system throughput constraints, the trained models
can be implemented as multiple pipeline stages, mitigating
the additional latency introduced by the ML functions.
Fig. 1: The proposed pipeline with the additional ML stage.
In this configuration, six ML features and three delay classes
are illustrated.
Finally, the granularity of the output delay (e.g., three delay
classes are illustrated in Fig. 1) can be varied to meet the
timing constraints within the energy budget.
A systematic flow has been developed, implemented,
and verified on TigerMIPS with LegUp benchmark suite.
The flow comprises three primary phases, as shown in Fig.
2. The individual phases are described in the following
subsections.
3.1 Phase 1: Baseline processor synthesis and profil-
ing
First, the high-level hardware description language (HDL)
model of the baseline processor is synthesized into gate-
level description model. During this phase, timing informa-
tion is generated in the IEEE standard delay format (SDF).
Based on this information, the gate-level simulation (GLS) is
performed and the instruction-level execution profile is gen-
erated. A profile comprises a list of instructions, the fetched
or forwarded operands, the output of the operations, and
the propagation delays. In addition to the execution profile,
post place-and-route (PAR) reports, including timing and
power information, are collected in this phase.
3.2 Phase 2: ML training
In this phase, the gate-level profiles from Phase 1 are parsed
and utilized as ML features. Based on the extracted features,
a preferred ML model is trained in Python with Scikit-learn
ML library [23]. A HDL code (e.g., Verilog in this paper) of
the trained model is generated and integrated within the
baseline processor as a single (or multiple) pipeline stage(s)
between the decode and execute stages (see Fig. 1).
3.3 Phase 3: Verification and Evaluation
During this phase, the modified high-level HDL model of
the system with the ML pipeline stage is synthesized and
profiled, as described in Phase 1. To guarantee functional
correctness, the output signal is double-sampled to detect
timing violations, and timing-erroneous instructions are re-
executed with the worst-case clock frequency. Similar to
the baseline iteration, the post PAR reports are extracted
for evaluating the timing and energy characteristics of the
system. Finally, the profiling of the modified system is
executed during this phase to evaluate the overall speedup
of the system.
To optimize the final solution in terms of the operational
frequency and energy consumption, the proposed flow is
executed iteratively with various ML algorithms and clock
fragments, as shown with the feedback in Fig. 2. The clock
signal of the pipeline registers is assumed to be near-
instantly switched based on the individual classification
results, as has been experimentally demonstrated in [24].
4 MACHINE LEARNING MODELS
Owing to the unique learning characteristics and hardware
trade-offs of neural networks (NNs), support vector ma-
chines (SVMs), and random forest (RF) models, all these
ML models are considered in this paper. Each model is
trained based on the instruction profiles extracted from a
4Fig. 2: Systematic flow for designing ML predictor within a typical pipelined processor.
synthetically generated dataset of 3,000 random instructions
per class. The delay boundaries of the individual classes are
experimentally determined with respect to the worst-case
delay of 4 ns as follows: {[0.0,2.2],(2.2,4.0]} for the two-
class configuration, {[0.0,1.8],(1.8,2.6],(2.6,4.0]} for the three-
class configuration, and {[0.0,1.0],(1.0,2.0],(2.0,3.0],(3.0,4.0]}
for the four-class configuration.
The feature vector of the ith instruction comprises six
elements, xi = (instr, op1, op2, Xop1, Xop2, output). The
first feature, instr, comprises four subfeatures, representing
the type of the operation in one-hot format,
instr =

1000, if arithmetic
0100, if arithmetic with immediate operand
0010, if logical
0001, if multiplication or division
The subsequent four elements are defined by the operands.
The features op1 and op2 are the first and second operands
of the instruction, and the features Xop1 and Xop2 are the
XORed values of the first and second operands with their
respective previous values. The last feature, output, is the
output of the preceding instruction. The last three elements
of the feature vector are exploited to capture the effect of
computation history on the instruction delay. Note that the
operands and output of the preceding instruction are 32-
bit long, as determined by the 32-bit baseline processor
utilized in this work. Thus, the distribution of these features
significantly differs from the distribution of the operation
type subfeatures. To balance the overall distribution of the
individual features, the input features are preprocessed and
scaled to follow a normal distribution using quantile trans-
former in Python scikit-learn library. An example of operand
and output features with and without the transformation is
shown in Fig. 3 for arithmetic and logical instructions. Note
that the type subfeatures remain unchanged.
To evaluate the efficiency and efficacy of the proposed
method, propagation delay classification is investigated
with three common ML algorithms: NN, SVM, and RF. The
configuration of each of the three ML models is described in
the following subsections, including the hyperparameters,
performance, and hardware costs of the individual ML
algorithms. All the algorithms are five-fold cross-validated
based on three thousand randomly generated instructions
per class. While finding an effective metric for stability of the
evaluation is still an open question, k-fold cross-validation
with 5 ≤ K ≤ 20 is typically used, as these K values have
been demonstrated to simultaneously minimize the bias and
variance across many studied test sets [25], [26], [27], [28].
Thus, K = 5 is used in this work. ML accuracy is reported
as the F1-score of delay classification and the resultant
speedup for each benchmark program has been considered
in determining the performance of each ML algorithm.
Hardware cost is evaluated as the number of additional
transistors required for implementing the individual ML
algorithms and has also been considered in determining the
performance of the ML algorithms. Among the evaluated
ML algorithms, the RF classifier is preferred in this work
due to the favorable tradeoff between the performance gain
and hardware costs, as well as the relative simplicity of the
RF algorithm, as explained in the following subsections.
4.1 Neural Networks
NNs excel in learning complex hidden patterns in large
datasets and have exhibited a particular supremacy in vision
and text applications as compared with classical ML algo-
rithms. Following this success, promising results have been
shown with NNs in various hardware related applications
[29], [30], [31].
To determine the preferred set of hyperparameters for
the two-, three-, and four-class NN models, a grid search is
executed for each multiclass NN over the following ranges:
5(a)
(b)
Fig. 3: A typical feature vector with and without the ML
preprocessing, (a) for arithmetic operation with immediate
operand, and (b) for logical operation. Note that the values
without preprocessing are shown on a logarithmic scale,
while the values with preprocessing are shown on a linear
scale.
1) Identity, tanh, logistic, and ReLu activation functions,
2) Stochastic gradient descent [32], lbfgs (a limited mem-
ory BFGS quasi-Newton optimization algorithm [33]),
and Adam (an adaptive learning rate optimization al-
gorithm [34]) solvers, and
3) A single m-neuron hidden layer (m ∈ {5, 10, 15, 20})
and two hidden NN layers with m1 and m2 neurons
in, respectively, the first and second layers (m1 ×m2 ∈
{20× 5, 20× 10, 20× 15}).
The networks are trained using backpropagation algorithm
for 200 epochs until convergence with quasi-newton opti-
mizer. Note that the number of neurons in the input and
output layers is determined by, respectively, the number
of ML features (nine, including the four instruction type
subfeatures) and the number of ML classes (two, three,
and four). The top ten grid search results (within 1% of
the highest F1-score) are listed in Table 1 for each of the
multiclass NNs in the descending order of the F1-scores.
The hardware cost is determined based on the number
of transistors comprising the NN adders and multipliers.
The transistor count for the individual NN adders and
multipliers is determined based on [35]. The number of
multipliers, NMULT , and adders, NADD, in a NN with L
TABLE 1: Top (within 1% of the highest F1-score) NN
configurations and their respective performance metrics
(i.e., speedup, hardware cost (in million transistors), and
speedup per hardware metric (SPH)).
4 classes
Activation Solver Neurons F1-score Speedup HW cost SPH
1 tanh lbfgs 10 0.859 1.915 2.834 0.676
2 relu adam 20× 5 0.858 1.940 6.543 0.297
3 relu adam 20× 10 0.856 1.882 9.166 0.205
4 tanh adam 20× 10 0.856 1.911 9.166 0.208
5 tanh adam 20× 15 0.853 1.916 11.789 0.163
6 tanh lbfgs 20 0.850 1.934 5.671 0.341
7 relu lbfgs 20 0.848 1.940 5.671 0.342
8 logistic lbfgs 20 0.848 1.836 5.671 0.324
9 tanh adam 20× 5 0.845 1.946 6.543 0.297
10 relu adam 20× 15 0.844 1.951 11.789 0.165
Average 0.852 1.917 7.484 0.302
Positive standard deviation(σ+) 0.002 0.012 1.634 0.095
Negative standard deviation(σ−) 0.002 0.018 0.961 0.040
3 classes
Activation Solver Neurons F1-score Speedup HW cost SPH
1 logistic lbfgs 20× 15 0.923 1.645 11.461 0.143
2 tanh adam 20× 15 0.922 1.642 11.461 0.143
3 logistic lbfgs 20× 10 0.921 1.643 8.948 0.184
4 relu lbfgs 20× 5 0.920 1.642 6.434 0.255
5 logistic lbfgs 15 0.920 1.642 3.925 0.418
6 relu adam 20 0.919 1.643 5.234 0.314
7 tanh adam 20× 5 0.919 1.642 6.434 0.255
8 tanh lbfgs 5 0.918 1.643 1.307 1.257
9 relu adam 20× 5 0.914 1.643 6.434 0.255
10 tanh adam 20× 10 0.914 1.643 8.948 0.184
Average 0.919 1.643 7.059 0.341
Positive standard deviation(σ+) 0.001 3.7E-4 1.694 0.460
Negative standard deviation(σ−) 0.001 4.0E-4 1.148 0.049
2 classes
Activation Solver Neurons F1-score Speedup HW cost SPH
1 logistic adam 20× 15 0.972 1.682 11.134 0.151
2 identity lbfgs 20× 10 0.972 1.682 8.730 0.193
3 identity lbfgs 20 0.972 1.682 4.797 0.351
4 identity lbfgs 20× 5 0.972 1.682 6.326 0.266
5 identity lbfgs 10 0.972 1.682 2.398 0.701
6 identity lbfgs 5 0.972 1.682 1.198 1.404
7 identity lbfgs 20× 15 0.972 1.682 11.134 0.151
8 relu adam 10 0.972 1.682 2.398 0.701
9 identity lbfgs 15 0.972 1.681 3.598 0.467
10 identity adam 5 0.969 1.678 1.198 1.400
Average 0.972 1.681 5.291 0.579
Positive standard deviation(σ+) 1.0E-4 1.8E-4 2.252 0.294
Negative standard deviation(σ−) 0.003 0.002 1.217 0.137
layers is determined, respectively, as,
NMULT =
L∑
i=1
mi · vi, (1)
and
NADD =
L∑
i=1
mi · (vi − 1), (2)
where mi is the number of neurons in each layer, and vi
is the size of the input vector to each layer (or the feature
vector size in the input layer).
The speedup per hardware cost (SPH) is also listed in
Table 1 for each of the NN configurations. These top NN
results are compared with the SVM and RF top results,
as described at the end of this section. As a general rule,
6learning capacity of a NN increases with the network com-
plexity (i.e., number of neurons and number of layers). For
a NN to be competitive with or outperform a classical
ML algorithm, a large number of neurons and layers is
required, significantly increasing the system complexity and
hardware overhead of the NN based solutions.
4.2 Support Vector Machines
SVM classifier generates an optimal hyperplane which sep-
arates data samples in feature space with the objective
to minimize the classification error. Linear SVM can only
classify a linearly-separable data. Alternatively, to learn
complex nonlinear data patterns, SVM can be combined
with a kernel trick, enabling the feature transformation into
linearly separable space [36]. In this work, a grid search is
performed over the following kernel SVM hyperparameters:
1) Linear, polynomial, and radial basis function (rbf) ker-
nels,
2) Integer degree of flexibility of the polynomial decision
boundary, d ∈ [2, 15], and
3) The influence on the model of a single sample in
a training set with N features and variance V ar by
scaling (i.e., gamma = 1/(N · V ar)) or not scaling (i.e.,
gamma = 1/N ) the kernel coefficient, gamma.
The sets of hyperparameters with the highest F1-scores are
listed in Table 2. The speedup, hardware cost, and SPH
metric are also listed in the table for all the SVM configu-
rations. SVM hardware cost is determined as the number
of transistors, based on the method presented in [37]. SVM
often exhibits excellent performance as compared with other
learning algorithms at the expense of higher computational
and design complexity, and accordingly higher power and
area overheads [38]. These tradeoffs are discussed at the end
of this section.
4.3 Random Forest
RF classifier is an ensemble of decision tree classifiers. The
input samples are split into multiple sample subsets and
each decision tree is trained on one training subset. The
final classification decision for each sample is made based
on the result of averaging the individual tree decisions
(i.e., ensembling). RF models benefit from the accuracy,
training speed, and interpretability of the decision tree
model, while the ensembling mitigates the overfitting,
otherwise common to decision tree classifier. RF is often
preferred in scientific and practical applications [4], [39].
The computational and hardware complexity of RF is a
strong function of the number and depth of the decision
trees. The depth of the individual trees is dependent on
the number of features and their correlation. In this work,
a RF grid search is performed over the following ranges of
hyperparameters:
1) Number of trees in the forest, n estimators ∈
{1, 10, 50, 100, 200},
2) Maximum number of levels in each tree, max depth ∈
{10, 20, 30, 40, 50}.
The results of the top estimators (within 1% of the highest
F1-score) are listed in Table 3. The hardware cost of an
TABLE 2: Top (within 1% of the highest F1-score) SVM
configurations and their respective performance metrics
(i.e., speedup, hardware cost (in million transistors), and
speedup per hardware metric (SPH)).
4 classes
kernel degree gamma F1-score Speedup HW cost SPH
1 poly 5 scale 0.837 1.873 1323.343 0.001
2 poly 4 scale 0.834 1.899 1307.230 0.001
3 poly 6 scale 0.833 1.889 1352.670 0.001
4 poly 3 scale 0.828 1.910 1291.739 0.001
5 poly 7 scale 0.827 1.923 1412.040 0.001
6 poly 8 scale 0.826 1.915 1476.320 0.001
7 poly 9 scale 0.823 1.908 1534.991 0.001
8 rbf N/A scale 0.822 1.916 228.524 0.008
9 poly 10 scale 0.819 1.814 1596.913 0.001
10 poly 11 scale 0.813 1.793 1654.040 0.001
Average 0.826 1.884 1317.781 0.002
Positive standard deviation(σ+) 0.003 0.010 74.701 0.006
Negative standard deviation(σ−) 0.003 0.038 363.206 2.3E-4
3 classes
kernel degree gamma F1-score Speedup HW cost SPH
1 poly 5 scale 0.876 1.604 755.765 0.002
2 poly 4 scale 0.875 1.617 715.028 0.002
3 poly 3 scale 0.875 1.617 675.647 0.002
4 rbf N/A scale 0.875 1.617 119.954 0.013
5 poly 7 scale 0.873 1.627 847.086 0.002
6 poly 2 scale 0.872 1.618 677.269 0.002
7 poly 6 scale 0.872 1.601 800.553 0.002
8 poly 8 scale 0.870 1.613 900.614 0.002
9 rbf N/A auto 0.869 1.617 135.990 0.012
10 poly 9 scale 0.869 1.660 948.560 0.002
Average 0.873 1.619 657.647 0.004
Positive standard deviation(σ+) 0.001 0.021 57.771 0.006
Negative standard deviation(σ−) 0.001 0.003 374.579 7.4E-4
2 classes
kernel degree gamma F1-score Speedup HW cost SPH
1 rbf N/A scale 0.957 1.680 28.616 0.059
2 poly 4 scale 0.956 1.679 169.482 0.010
3 poly 5 scale 0.955 1.681 181.843 0.009
4 poly 6 scale 0.954 1.683 198.685 0.008
5 poly 3 auto 0.954 1.674 323.000 0.005
6 poly 3 scale 0.953 1.677 158.120 0.011
7 poly 8 scale 0.952 1.683 225.245 0.007
8 poly 7 scale 0.952 1.678 212.632 0.008
9 poly 2 scale 0.951 1.666 145.830 0.011
10 rbf N/A auto 0.951 1.629 27.952 0.058
Average 0.954 1.673 167.141 0.019
Positive standard deviation(σ+) 9.2 E-4 0.003 29.323 0.028
Negative standard deviation(σ−) 8.3 E-4 0.022 49.4330 0.004
RF classifier is evaluated based on the number of required
comparators, O(n estimators× log2(max depth)), and re-
ported in terms of the total number of RF transistors. Tran-
sistor count for a single comparator is determined based on
[40].
4.4 ML Algorithm Tradeoffs
The tradeoffs between the speedup and F1-score are sum-
marized in Fig. 4 for all the classifiers. Note that not in all
the cases speedup increases with F1-score. This is due to the
effect of the type of misclassification on the overall speedup.
For example, if a slow instruction is classified into a faster
class, the result at the output of the execution unit at the
end of the fast clock period is incorrect. Thus, a four-clock-
cycle penalty is incurred to re-execute the slow instruction,
compensating for the combined latency of the re-executed
IF, ID, ML, and EX stages. Alternatively, if a fast instruction
is misclassified into a slow class, the execution still results
7TABLE 3: Top (within 1% of the highest F1-score) RF
configurations and their respective performance metrics
(i.e., speedup, hardware cost (in million transistors), and
speedup per hardware metric (SPH)).
4 classes
max depth n estimator F1-score Speedup HW cost SPH
1 30 50 0.852 1.835 0.177 10.357
2 10 200 0.850 1.925 0.480 4.012
3 30 200 0.850 1.889 0.709 2.666
4 10 100 0.849 1.913 0.240 7.978
5 50 200 0.849 1.833 0.815 2.249
6 50 100 0.846 1.856 0.407 4.554
7 20 200 0.845 1.874 0.624 3.004
8 50 50 0.843 1.836 0.204 9.011
9 30 100 0.842 1.838 0.354 5.187
10 10 50 0.840 1.902 0.120 15.859
Average 0.847 1.870 0.413 6.488
Positive standard deviation(σ+) 0.002 0.016 0.137 2.638
Negative standard deviation(σ−) 0.002 0.014 0.078 1.250
3 classes
max depth n estimator F1-score Speedup HW cost SPH
1 20 50 0.949 1.879 0.156 12.040
2 30 100 0.947 1.856 0.354 5.238
3 20 200 0.946 1.851 0.624 2.966
4 40 200 0.945 1.847 0.768 2.403
5 40 50 0.944 1.843 0.192 9.591
6 50 200 0.944 1.839 0.815 2.257
7 10 200 0.944 1.848 0.480 3.853
8 30 200 0.942 1.814 0.709 2.561
9 10 100 0.940 1.828 0.240 7.621
10 10 50 0.939 1.826 0.120 15.224
Average 0.944 1.843 0.446 6.375
Positive standard deviation(σ+) 0.002 0.008 0.117 2.764
Negative standard deviation(σ−) 0.002 0.007 0.111 1.360
2 classes
max depth n estimator F1-score Speedup HW cost SPH
1 40 50 0.981 1.688 0.192 8.785
2 30 200 0.981 1.686 0.709 2.380
3 30 100 0.981 1.683 0.354 4.750
4 40 200 0.981 1.686 0.768 2.193
5 10 50 0.981 1.686 0.120 14.058
6 20 200 0.981 1.683 0.624 2.696
7 10 200 0.980 1.686 0.480 3.515
8 50 200 0.980 1.687 0.815 2.070
9 10 100 0.980 1.685 0.240 7.024
10 20 100 0.979 1.683 0.312 5.393
Average 0.980 1.685 0.461 5.286
Positive standard deviation(σ+) 2.0E-04 0.001 0.111 2.401
Negative standard deviation(σ−) 4.3E-04 0.001 0.104 1.034
in correct answer albeit the potential loss in performance
gain. In addition, if a fast instruction is classified into a
nominal-delay class (for example, in the case with three
delay classes), the overall performance of the system is still
increased (but not maximized) as compared with the execu-
tion in the slowest delay class (as designed for the worst-
case clock period). To understand the significance of speed
and overhead in the overall performance of individual ML
classifiers, SPH metric is considered. The SPH results (as
determined based on Tables 1-3) are shown in Fig. 5 for
NN, SVM, and RF classifiers in two-, three-, and four-class
configurations. Based on these results, RF exhibits the best
tradeoff between the hardware cost and speedup, as well as
the lowest design complexity and hardware overheads. RF
classifier is, therefore, preferred in this work as a demon-
stration vehicle of the proposed framework.
Fig. 4: Speedup vs. F1 for two-, three-, and four-class config-
urations based on Tables 1-3.
5 IMPLEMENTATION
The proposed framework is implemented with RF model
within TigerMIPS and evaluated based on LegUp bench-
marks. The details of the implementation are described in
this section.
5.1 Unified Platform
A holistic platform is developed based on the proposed
system design methodology, as illustrated in Fig. 2. The
framework is unified within a shell programming platform
supported with several peripheral programs developed in
C++ and Python. The synthesis steps, as described in Fig. 2,
are sequentially executed from Start to Finish.
During the first phase, Synopsys Design Compiler is
called with the high-level HDL model of the baseline pro-
cessor. The profiler triggers are added to the system and
GLS is performed in Modelsim.
The second phase is triggered upon the completion of the
instruction profiling. An external parser program is called
to transform the instruction profiles into the ML feature
data structure and eliminate outliers. The model is trained
to classify propagation delays into user-defined number of
classes based on a user-specified learning algorithm and
delay boundaries. The ML accuracy and estimated speedup
are evaluated upon the training completion. If the design re-
quirements are met, the ML software model is transformed
into the high-level HDL code. Otherwise, ML model is
retrained with new parameters.
Upon training completion, the HDL code of the ML
model is instantiated within the original HDL model of
the baseline processor. Finally, the procedure in Phase 1 is
repeated in Phase 3 with the modified processor model,
and the overall system performance and overheads are
evaluated.
5.2 Baseline Processor
The proposed framework is demonstrated on TigerMIPS.
In addition to the basic MIPS units, such as Instruction
Fetch (IF), Instruction Decode (ID), Execute (Exe), Memory
access (Mem), and Write-back (WB), TigerMIPS comprises
advanced units, such as, forwarding unit, branch handling
unit, stall logic, and instruction and data caches, which are
common in modern pipeline processors.
8Fig. 5: Speedup per hardware cost (SPH) for two-, three, and
four-class configurations. The hardware cost is evaluated
based on the number of transistors needed to realize each
classifier. The SPH performance is highest with RF classifier
as compared with the SVM and NN based classifiers for
each of the classifier configurations. Numbers correspond
to data listed in Tables 1, 2, and 3.
5.3 Synthesis and Profiling
The baseline model is synthesized in 45 nm NanGate CMOS
technology node with Synopsys Design Compiler. Upon
completion of the synthesis, triggers are implemented in
Verilog HDL, enabling data and timestamp sampling at the
input and output of the execution unit within the MIPS
pipeline. The profiling is performed based on GLS with
Modelsim simulator.
5.4 Integration, Verification and Evaluation
The trained ML model is first validated in Python. The
HDL code of the validated ML model is integrated into
the baseline processor. Finally, the modified processor is
synthesized and its functionality is verified through GLS.
The post PAR reports are utilized to evaluate the modified
system with respect to specified design constraints.
6 EXPERIMENTAL RESULTS
To demonstrate the framework, LegUp high-level synthesis
benchmark suite coupled with LLMVM compiler toolchain
[41] is utilized for profiling and verification during GLS. The
trained RF model is tested with nine standard benchmark
programs available within the LegUp benchmark suite and
an additional synthetically generated benchmark with one
million random instructions. The F1-score is shown in Fig.
6 for two, three, and four ML delay classes, yielding above
95% F1-score for majority of the programs with two delay
classes. Resultant speedup for the individual benchmarks
is shown in Fig. 7, including the practical speedup (with
the misclassification penalty), no-penalty speedup (without
the misclassification penalty), and ideal speedup (with 100%
classification accuracy). The energy overhead due to the
additional ML hardware and classification errors is listed
in Table 4. To account for delay overheads due to the
misclassification of a slow instruction into a higher per-
formance class, a re-execution penalty of four clock cycles
(compensating for IF, ID, ML, and EX stages) is considered
within the performance results, as reported in Fig. 7. The no-
penalty speedup is also presented in Fig. 7, visualizing the
penalty due to the misclassification of a fast instruction into
a slow class. Note that the overall speedup with four-class
configuration is higher than the speedup with two-class
configuration, albeit the higher classification accuracy with
two delay classes. Alternatively, higher misclassification rate
with four delay classes yields higher re-execution energy
consumption, as listed in Table 4. Also, note that a negative
energy overhead indicates a reduction in the overall energy
consumption (i.e., power-delay product).
Performance comparison between the proposed method
and state-of-the-art (ML and non-ML) DVFS approaches is
listed in Table 5. For example, both the proposed framework
and the approach in [5] consider binary classification with
two execution delay classes. The proposed method exhibits
3.5 times higher speedup gain and 33% energy savings as
compared with 3% energy overhead, as reported in [5]. As
compared with the adaptive approach in [24], the proposed
method exhibits up to 4.9 times increase in performance gain
with 50% less energy savings. Alternatively, a 3.85 times
higher performance gain is demonstrated as compared to
[24] with similar energy savings.
Power overhead per instruction for two-, three-, and
four-delay class configurations are also determined for the
programs in the LegUp benchmark suite. The average
power overhead (due to the additional ML stage and re-
execution of misclassified instructions) is shown in Fig. 8.
The average power is linearly reduced with the increasing
number of program instructions, exhibiting an overhead of
less than 0.02 microwatts in practical applications with more
than one million instructions. Furthermore, the additional
average power consumption rapidly converges for various
number of classes, as shown in Fig. 7. Thus, when optimiz-
ing the number of delay classes in processors with large
workload, power overhead is a secondary factor. Finally, the
steeper decrease in the power oberhead with the four-class
9Fig. 6: Inference RF classification based on the LegUp benchmark suite with two, three, and four classes.
Fig. 7: Experimental speedup with the proposed ML framework with two, three, and four delay classes. Practical, no-
penalty, and ideal speedups are presented for each benchmark and class. The practical speedup considers the experimental
classification accuracy and delay overheads due to misclassification of a slow instruction into a fast class. The no-penalty
speedup considers the experimental accuracy, but disregards the idle time due to misclassification of a fast instruction into
a slow class. Finally, the ideal speedup is the theoretical maximum with 100% classification accuracy.
configuration supports the previous assertion regarding
the gain-overhead tradeoff with finer granularity of delay
classes: as the number of instructions increases, the higher
accuracy with four-class configuration mitigates the adverse
effects of misclassifications on the overall system frequency.
7 CONCLUSIONS AND FUTURE WORK
The proposed unified framework facilitates efficient utiliza-
tion of the time and hardware recourses in the system. In
addition, this approach enables the design of ML pipeline
stages, while satisfying design constraints, as shown in Fig.
2. Finally, classification of instructions into delay intervals
in real time alleviates the path propagation variances im-
posed by PVT variations and system aging. To enhance
the performance gain, the proposed approach should be
preferred with those applications and systems characterized
by considerable variations in the propagation delay of the
individual instructions.
This method is practical with pipelined, MIPS-like pro-
10
TABLE 4: Experimental power and energy overhead of the
proposed ML method.
4 classes
Benchmark Practical Power Energy Instruction
speedup overhead overhead count
rand1M 1.923 38.5% -27.99% 1000000
adpcm 1.497 56.49% 4.55% 30197
aes 2.087 45.14% -30.46% 11223
blowfish 1.633 65.01% 1.02% 199759
fft 1.165 42.45% 22.28% 11001
fir 2.908 25.94% -56.7% 7024
gsm 1.382 45.77% 5.45% 7671
jpeg 1.792 55.04% -13.49% 1133161
sha 1.657 62.07% -2.22% 345576
sra 2.840 20.62% -57.53% 1775
Average 1.889 45.7% -15.51% 274738.7
Positive
standard 0.35 5.81% 8.70% 374831.68
deviation(σ+)
Negative
standard 0.17 6.58% 15.50% 92682.62
deviation(σ−)
3 classes
Benchmark Practical Power Energy Instruction
speedup overhead overhead count
rand1M 1.765 23.3% -30.151% 1000000
adpcm 1.389 33.69% -3.768% 30197
aes 1.987 27.06% -36.051% 11223
blowfish 2.125 39.27% -34.456% 199759
fft 1.957 25.47% -35.872% 11001
fir 2.222 16.14% -47.727% 7024
gsm 1.465 27.87% -12.729% 7671
jpeg 1.786 32.62% -25.74% 1133161
sha 1.578 37.93% -12.566% 345576
sra 2.013 7.18% -46.745% 1775
Average 1.829 27.05% -28.58% 274738.7
Positive
standard 0.11 3.09% 8.41% 374831.68
deviation(σ+)
Negative
standard 0.13 5.76% 4.84% 92682.62
deviation(σ−)
2 classes
Benchmark Practical Power Energy Instruction
speedup overhead overhead count
rand1M 1.530 14.29% -25.323% 1000000
adpcm 1.418 22.4% -13.654% 30197
aes 1.818 17.1% -35.595% 11223
blowfish 1.818 27.23% -30.024% 199759
fft 1.646 16.14% -29.45% 11001
fir 1.818 10.5% -39.225% 7024
gsm 1.635 18.34% -27.642% 7671
jpeg 1.665 22% -26.727% 1133161
sha 1.818 27.23% -30.024% 345576
sra 1.818 5.5% -41.975% 1775
Average 1.699 18.073% -29.964% 274738.7
Positive
standard 0.05 2.84% 3.49% 374831.68
deviation(σ+)
Negative
standard 0.07 3.06% 3.24% 92682.62
deviation(σ−)
cessors, in which the overall delay is dominated by the delay
of the execution stage. Although, the proposed method
is explored in this work with a single core system, fur-
ther increases in energy efficiency and the overall system
performance are expected if the approach is adjusted for
modern architecture processors with out-of-order execution
and multicore processors with multiple frequency domains.
To exploit the positive impact of out-of-order execution and
Fig. 8: Power overhead per instruction for 2, 3, and 4 delay
class configuration based on the benchmarks in Table 4.
TABLE 5: Comparison between the proposed method and
existing state-of-the-art methods.
Algorithm Performance Energy ML
gain overhead based
SLoT [4] 23% N/A Yes
Early Prediction [5] 20% 3% No
Clim [14] 24% N/A Yes
SLBM [16] 15% N/A Yes
Adaptive Clock 18.2% -30.4% No
Management [24]
2 classes 70% -30%
This work 3 classes 83% -28.6% Yes
4 classes 89% -15.5%
multicore systems on performance and energy efficiency in
commercial class processors, the following methodologies
should be considered.
7.1 Single-Core, Single-Clock Delay-Based Out-of-
Order Execution
To support out-of-order execution, instructions within a
delay class should be bundled into a delay-class specific
reservation station (RS). Instructions stored in an RS are
individually executed at a constant frequency until the
RS is emptied or a dependency is determined, preventing
further execution of instructions in the RS. Such bundling of
instructions reduces the number of clock signal transitions
among various frequencies, increasing the performance and
power efficiency of the system.
7.2 Single-Core, Multi-Clock Delay-Based Out-of-Order
Execution
As previously, to support out-of-order execution, instruc-
tions should be bundled based on the delay classes and
stored within the matching RS’s. To support multi-clock
execution, the ALUs and FPUs within the execution unit
should be operated at different clock frequencies, as deter-
mined by the granularity of the delay classes. Intuitively, the
parallelization of execution from different delay classes with
this approach decreases the number of clock adjustments,
increasing the system performance and energy efficiency.
11
7.3 Multi-Core, Multi-Clock Delay-Based Out-of-Order
Execution
To leverage the advantages provided by processing with
multiple clock domains in multicore systems, bundled in-
structions within the individual clock domains (as defined
in subsection 7.1) should be shared among all the system
clock domains, mitigating the additional cost of multiple
clocking (as described in subsection 7.2). To enable the
sharing of bundles, efficient bundle scheduling and low
overhead communication channels are required. While the
number of clock adjustments is expected to further reduce
with this approach, additional overheads due to intelli-
gent communication of bundles among the cores should
be considered. Alternatively, by partially or fully replacing
the traditional DFS, DVFS, and thread scheduling mecha-
nisms, additional savings are expected with the proposed
approach. Finally, the proposed method can be adjusted in
a similar manner to classify instruction propagation delay
of various pipeline stages.
Existing approaches are focused on offline speculations,
statistical models, per-task (workload-based) frequency scal-
ing, and prediction of timing errors at an operating point of
a system. Alternatively, the proposed method demonstrates
the benefits of fine-grain, instruction-level frequency ad-
justment, simultaneously utilizing most of the clock period
slack and mitigating the adverse effects of PVT variations
and aging.
8 SUMMARY
In this work, an additional ML pipeline stage is proposed
for increasing the overall system performance by enhancing
the temporal resource utilization. This additional stage is
designed to classify instructions into propagation delay
classes. The system clock frequency is adaptively adjusted
based on the individual delay class predictions. Pipelining
is exploited to mitigate the effect of the ML stage latency
on the overall system performance. Practical ML features
are extracted based on current instruction and computation
history. ML hardware and misclassification power and delay
overheads are considered within the reported results. Tiger-
MIPS is utilized as the baseline processor. The processor is
enhanced with the ML predictor and simulated with the
LegUp benchmark suite. Based on the experimental results,
up to 89% performance gain is achieved with four delay
classes with 15.5% energy saving. Alternatively, the reduc-
tion of 30% in energy consumption with 70% performance
gain is demonstrated with two delay classes. A unified shell
programing platform with peripheral programs is designed
to provide a systematic design flow for ML driven pipelined
processors.
REFERENCES
[1] Fields B, Bodk R, Hill MD. Slack: Maximizing performance un-
der technological constraints. InProceedings 29th Annual Interna-
tional Symposium on Computer Architecture 2002 May 25 (pp.
47-58). IEEE.
[2] Zyuban V, Brooks D, Srinivasan V, Gschwind M, Bose P, Strenski
PN, Emma PG. Integrated analysis of power and performance for
pipelined microprocessors. IEEE Transactions on Computers. 2004
Jun 21;53(8):1004-16.
[3] Kumar R, Farkas KI, Jouppi NP, Ranganathan P, Tullsen DM.
Single-ISA heterogeneous multi-core architectures: The potential
for processor power reduction. InProceedings of the 36th annual
IEEE/ACM International Symposium on Microarchitecture 2003
Dec 3 (p. 81). IEEE Computer Society.
[4] Jiao X, Jiang Y, Rahimi A, Gupta RK. Slot: A supervised learn-
ing model to predict dynamic timing errors of functional units.
InProceedings of the Conference on Design, Automation & Test
in Europe 2017 Mar 27 (pp. 1183-1188). European Design and
Automation Association.
[5] Hashemi SH, Ajirlou AF, Soltani M, Navabi Z. Early prediction
of timing critical instructions in pipeline processor. In2016 15th
Biennial Baltic Electronics Conference (BEC) 2016 Oct 3 (pp. 95-
98). IEEE.
[6] Moghaddasi I, Fouman A, Salehi ME, Kargahi M. Instruction-
level NBTI Stress Estimation and its Application in Runtime
Aging Prediction for Embedded Processors. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems. 2018
Jun 12.
[7] Gepner P, Kowalik MF. Multi-core processors: New way to achieve
high system performance. InInternational Symposium on Parallel
Computing in Electrical Engineering (PARELEC’06) 2006 Sep 13
(pp. 9-13). IEEE.
[8] Hu Z, Buyuktosunoglu A, Srinivasan V, Zyuban V, Jacobson H,
Bose P. Microarchitectural techniques for power gating of execu-
tion units. InProceedings of the 2004 international symposium on
Low power electronics and design 2004 Aug 9 (pp. 32-37). ACM.
[9] Wu Q, Pedram M, Wu X. Clock-gating and its application to
low power design of sequential circuits. IEEE Transactions on
Circuits and Systems I: Fundamental Theory and Applications.
2000 Mar;47(3):415-20.
[10] Wang S, Ananthanarayanan G, Zeng Y, Goel N, Pathania A,
Mitra T. High-throughput cnn inference on embedded arm big.
little multi-core processors. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems. 2019 Sep 30.
[11] Rapp M, Sagi M, Pathania A, Herkersdorf A, Henkel J. Power-
and Cache-Aware Task Mapping with Dynamic Power Budget-
ing for Many-Cores. IEEE Transactions on Computers. 2019 Aug
20;69(1):1-3.
[12] Isci C, Buyuktosunoglu A, Buyuktosunoglu A, Cher CY, Bose P,
Martonosi M. An analysis of efficient multi-core global power
management policies: Maximizing performance for a given power
budget. InProceedings of the 39th annual IEEE/ACM interna-
tional symposium on microarchitecture 2006 Dec 9 (pp. 347-358).
IEEE Computer Society.
[13] Canis A, Choi J, Aldham M, Zhang V, Kammoona A, Anderson JH,
Brown S, Czajkowski T. LegUp: high-level synthesis for FPGA-
based processor/accelerator systems. InProceedings of the 19th
ACM/SIGDA international symposium on Field programmable
gate arrays 2011 Feb 27 (pp. 33-36). ACM.
[14] Jiao X, Rahimi A, Jiang Y, Wang J, Fatemi H, De Gyvez JP, Gupta
RK. Clim: A cross-level workload-aware timing error prediction
model for functional units. IEEE Transactions on Computers. 2017
Dec 14;67(6):771-83.
[15] Zhang JJ, Garg S. FATE: fast and accurate timing error predic-
tion framework for low power DNN accelerator design. In2018
IEEE/ACM International Conference on Computer-Aided Design
(ICCAD) 2018 Nov 5 (pp. 1-8). IEEE.
[16] Jiao X, Rahimi A, Narayanaswamy B, Fatemi H, de Gyvez
JP, Gupta RK. Supervised learning based model for predicting
variability-induced timing errors. In2015 IEEE 13th International
New Circuits and Systems Conference (NEWCAS) 2015 Jun 7 (pp.
1-4). IEEE.
[17] Zhang JJ, Garg S. BandiTS: dynamic timing speculation using
multi-armed bandit based optimization. InDesign, Automation &
Test in Europe Conference & Exhibition (DATE), 2017 2017 Mar 27
(pp. 922-925). IEEE.
[18] Whittle P. Multi-armed bandits and the Gittins index. Journal
of the Royal Statistical Society: Series B (Methodological). 1980
Jan;42(2):143-9.
[19] Khaleghi B, Salamat S, Imani M, Rosing T. FPGA Energy Efficiency
by Leveraging Thermal Margin. arXiv preprint arXiv:1911.07187.
2019 Nov 17.
[20] Assare O, Gupta R. Accurate Estimation of Program Error Rate for
Timing-Speculative Processors. InProceedings of the 56th Annual
Design Automation Conference 2019 2019 Jun 2 (p. 180). ACM.
12
[21] De Kruijf M, Nomura S, Sankaralingam K. A unified model for
timing speculation: Evaluating the impact of technology scal-
ing, CMOS design style, and fault recovery mechanism. In2010
IEEE/IFIP International Conference on Dependable Systems &
Networks (DSN) 2010 Jun 28 (pp. 487-496). IEEE.
[22] Moore, S. and Chadwick, G., 2011. The Tiger “MIPS” processor.
[23] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel
O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas
J. Scikit-learn: Machine learning in Python. Journal of machine
learning research. 2011;12(Oct):2825-30.
[24] Jia T, Joseph R, Gu J. 19.4 An Adaptive Clock Management Scheme
Exploiting Instruction-Based Dynamic Timing Slack for a General-
Purpose Graphics Processor Unit with Deep Pipeline and Out-
of-Order Execution. In2019 IEEE International Solid-State Circuits
Conference-(ISSCC) 2019 Feb 17 (pp. 318-320). IEEE.
[25] G. James, et al., “An Introduction to Statistical Learning,” New
York: Springer, Vol. 112, 2013.
[26] M. Kuhn and J. Kjell, “Applied Predictive Modeling,” New York:
Springer, Vol. 26, 2013.
[27] R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accu-
racy Estimation and Model Selection,” Proc. of the International
Joint Conference on Artificial Intelligence, Vol. 14, No. 2, pp. 1137-
114, 1995.
[28] G. Forman and S. Scholtz, “Apples-to-Apples in Cross-Validation
Studies: Pitfalls in Classifier Performance Measurement.” ACM
SIGKDD Explorations Newsletter, Vol. 12, No. 1, pp. 49-57, 2010.
[29] Yue J, Liu R, Sun W, Yuan Z, Wang Z, Tu YN, Chen YJ,
Ren A, Wang Y, Chang MF, Li X. 7.5 A 65nm 0.39-to-140.3
TOPS/W 1-to-12b Unified Neural Network Processor Using
Block-Circulant-Enabled Transpose-Domain Acceleration with 8.1
Higher TOPS/mm 2 and 6T HBST-TRAM-Based 2D Data-
Reuse Architecture. In2019 IEEE International Solid-State Circuits
Conference-(ISSCC) 2019 Feb 17 (pp. 138-140). IEEE.
[30] Lee J, Lee J, Han D, Lee J, Park G, Yoo HJ. 7.7 lnpu: A 25.3 tflops/w
sparse deep-neural-network learning processor with fine-grained
mixed precision of fp8-fp16. In2019 IEEE International Solid-State
Circuits Conference-(ISSCC) 2019 Feb 17 (pp. 142-144). IEEE.
[31] Lee J, Lee J, Han D, Lee J, Park G, Yoo HJ. 7.7 lnpu: A 25.3 tflops/w
sparse deep-neural-network learning processor with fine-grained
mixed precision of fp8-fp16. In2019 IEEE International Solid-State
Circuits Conference-(ISSCC) 2019 Feb 17 (pp. 142-144). IEEE.
[32] Ruder S. An overview of gradient descent optimization algo-
rithms. arXiv preprint arXiv:1609.04747. 2016 Sep 15.
[33] Liu DC, Nocedal J. On the limited memory BFGS method for large
scale optimization. Mathematical programming. 1989 Aug 1;45(1-
3):503-28.
[34] Kingma DP, Ba J. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980. 2014 Dec 22.
[35] Asadi P, Navi K. A new low power 32 32-bit multiplier. World
Applied Sciences Journal. 2007;2(4):341-7.
[36] Hofmann M. Support vector machines-kernels and the kernel
trick. Notes. 2006 Jun 26;26(3).
[37] Mitran J, Bouillant S, Bourennane E. Classification boundary
approximation by using combination of training steps for real-
time image segmentation. InInternational Workshop on Machine
Learning and Data Mining in Pattern Recognition 2003 Jul 5 (pp.
141-155). Springer, Berlin, Heidelberg.
[38] Kulkarni A, Pino Y, Mohsenin T. SVM-based real-time hardware
Trojan detection for many-core platform. In2016 17th International
Symposium on Quality Electronic Design (ISQED) 2016 Mar 15
(pp. 362-367). IEEE.
[39] Zhang X, Wang W, Zheng X, Ma Y, Wei Y, Li M, Zhang Y.
A Clutter Suppression Method Based on SOM-SMOTE Random
Forest. In2019 IEEE Radar Conference (RadarConf) 2019 Apr 22
(pp. 1-4). IEEE.
[40] Cheng SW. A high-speed magnitude comparator with small tran-
sistor count. In10th IEEE International Conference on Electronics,
Circuits and Systems, 2003. ICECS 2003. Proceedings of the 2003
2003 Dec 14 (Vol. 3, pp. 1168-1171). IEEE.
[41] Lattner C, Adve V. LLVM: A compilation framework for life-
long program analysis & transformation. InProceedings of the
international symposium on Code generation and optimization:
feedback-directed and runtime optimization 2004 Mar 20 (p. 75).
IEEE Computer Society.
[42] Agarwal K, Sylvester D, Blaauw D. Modeling and analysis of
crosstalk noise in coupled RLC interconnects. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems.
2006 Apr 24;25(5):892-901.
Arash Fouman Ajirlou (S’17) received the
Bachelor of Science degree in computer
engineering from University of Tehran, Tehran,
Iran, in 2017. He started the PhD program
with Department of Electrical and Computer
Engineering at the University of Illinois at
Chicago, in 2018. He was a research assistant
in the school of Electrical and Computer
Engineering at University of Tehran between
2015 and late 2017. From 2017 to late 2018,
he served as the secretary of the Electrical and
Computer Engineering committee in Alumni Association of Faculty of
Engineering, University of Tehran. In 2018, prior to starting his PhD
in computer engineering at University of Illinois at Chicago, he was a
digital designer in the engineering department of Ofogh Tajrobe Moj
company, Tehran, Iran.
His primary interests are embedded systems and high-
performance/low-power computing systems, with an emphasis on
machine learning and self governing systems. His current focus is
on utilizing machine learning methodologies to enhance processor
performance and energy consumption.
Dr. Inna Partin-Vaisband (S’12–M’15) received
the Bachelor of Science degree in computer en-
gineering and the Master of Science degree in
electrical engineering from the Technion-Israel
Institute of Technology, Haifa, Israel, in, respec-
tively, 2006 and 2009, and the Ph.D. degree
in electrical engineering from the University of
Rochester, Rochester, NY in 2015. She is cur-
rently an Assistant Professor with the Depart-
ment of Electrical and Computer Engineering at
the University of Illinois at Chicago.
Between 2003 and 2009, she held a variety of software and hardware
R&D positions at Tower Semiconductor Ltd., GConnect Ltd., and IBM
Ltd., all in Israel. Her primary interests lay in the area of high perfor-
mance integrated circuits and VLSI system design. Her research is cur-
rently focused on innovation in the areas of AI hardware and hardware
security. Yet another primary focus is on distributed power delivery and
locally intelligent power management that facilitates performance scal-
ability in heterogeneous ultra-large scale integrated systems. Special
emphasis is placed on developing robust frameworks across levels of
design abstraction for complex heterogeneous integrated systems. Dr.
P.-Vaisband is an Associate Editor of the Microelectronics Journal and
has served on the Technical Program and Organization Committees of
various conferences.
