Decision Tree Based Hardware Power Monitoring for Run Time Dynamic Power
  Management in FPGA by Lin, Zhe et al.
Decision Tree Based Hardware Power Monitoring
for Run Time Dynamic Power Management in
FPGA
Zhe Lin∗, Wei Zhang∗ and Sharad Sinha†
∗Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong
†School of Computer Engineering, Nanyang Technological University, Singapore
{zlinaf, wei.zhang}@ust.hk, sharad sinha@ieee.org
Abstract—Fine-grained runtime power management tech-
niques could be promising solutions for power reduction. There-
fore, it is essential to establish accurate power monitoring
schemes to obtain dynamic power variation in a short period
(i.e., tens or hundreds of clock cycles). In this paper, we leverage
a decision-tree-based power modeling approach to establish fine-
grained hardware power monitoring on FPGA platforms. A
generic and complete design flow is developed to implement the
decision tree power model which is capable of precisely estimating
dynamic power in a fine-grained manner. A flexible architecture
of the hardware power monitoring is proposed, which can be
instrumented in any RTL design for runtime power estima-
tion, dispensing with the need for extra power measurement
devices. Experimental results of applying the proposed model
to benchmarks with different resource types reveal an average
error up to 4% for dynamic power estimation. Moreover, the
overheads of area, power and performance incurred by the power
monitoring circuitry are extremely low. Finally, we apply our
power monitoring technique to the power management using
phase shedding with an on-chip multi-phase regulator as a
proof of concept and the results demonstrate 14% efficiency
enhancement for the power supply of the FPGA internal logic.
I. INTRODUCTION
With the growth of capacity and complexity of field pro-
grammable gate arrays (FPGAs), and the increasing usage
of FPGAs in data centers, the reduction of the FPGA power
consumption is becoming an important issue. To alleviate this
problem, FPGA vendors provide gate-level power analysis
tools for power estimation, e.g., Xpower analyzer (XPA) from
Xilinx and Powerplay from Altera. These tools are fully aware
of the internal hardware implementation and they can provide
accurate power estimation during design time, thus helping
designers to tune their circuits using power optimization
techniques during development periods. Nevertheless, there
is a burgeoning interest in applying runtime power man-
agement techniques, e.g., dynamic voltage frequency scaling
(DVFS) [1] and task scheduling techniques [2]. Such runtime
strategies make it a necessity to be aware of the runtime
dynamic power consumption for the applications running on
FPGA. One common approach to obtain runtime power of
FPGAs is to use power measurement circuits. This method,
however, suffers from two major disadvantages: (1) it requires
additional board area to integrate the power measurement
devices; and (2) the power detection period is long, in the gran-
ularity of milliseconds [3]. As dynamic power management
approaches, such as DVFS and current control using on-chip
regulators, become feasible and promising [4], dynamic power
monitoring in the granularity of tens or hundreds of cycles is
required so as to support fine-grained power reduction.
Within this context, our goal is to establish a fine-grained,
accurate yet light-weight dynamic power monitoring scheme
for FPGA-based designs, leveraging the state-of-the-art ma-
chine learning theory. We first propose a generic and complete
design flow capable of capturing the key features necessary for
power modeling on FPGA and generating samples for power
modeling. Based on that, we exploit a decision-tree-based
power modeling method and develop an in-situ supporting
architecture which can be efficiently integrated into the RTL
designs with extremely low overheads of performance, power
consumption and resource utilization. The power estimation
period can be as fine-grained as tens of clock cycles, fa-
cilitating more possibilities of power management strategies.
The experiments reveal that the proposed decision-tree-based
power model exhibits salient improvement in both accuracy
and area overheads in comparison to traditional linear re-
gression methods. Furthermore, we propose model ensemble
and runtime phase shedding with an on-chip multi-phase
regulator to tackle practical problems using our hardware
power monitoring scheme.
In general, our contributions can be summarized as follows:
• A platform-independent and complete synthesis flow to
extract features, generate activity traces and power traces
as samples for power model establishment.
• A runtime dynamic power monitoring scheme based on
decision tree learning theory, with complete analysis from
feature selection to model optimization.
• A light-weight and in-situ hardware realization of the
power monitoring scheme, including activity counters and
an area-efficient memory-based decision tree with small
overheads of power, performance and resource.
• A model ensemble strategy: we experimentally quantify
the error of aggregating the pre-trained power models as
an ensemble model and shed light on its viability for
library-based and IP-based designs.
• A proof of concept for fine-grained phase shedding
using an on-chip multi-phase voltage regulator for FPGA
ar
X
iv
:2
00
9.
01
43
4v
1 
 [c
s.A
R]
  3
 Se
p 2
02
0
internal logic.
The rest of this paper is organized as follows. Section II
gives a general discussion about modern power modeling ap-
proaches. Section III and Section IV demonstrate the complete
design flow and Section V elaborates the monitoring hardware.
Experimental results are discussed in Section VI and finally,
we conclude the paper in Section VII.
II. RELATED WORK
In the literature, researchers have shown great interests in
dynamic power modeling for FPGAs, as it is one of the
main challenges for FPGA-based designs. Studies about FPGA
power modeling have been conducted at different abstraction
levels. In the work [5], a low level abstraction model was
developed by capturing average switching power based on
switched-level macromodels for LUTs and registers. Likewise,
the work [6] derived power models from basic operators, e.g.,
adders and multipliers. These low level models are specialized
in their targeted FPGA devices and thereby make them difficult
to migrate to other FPGA families. In contrast, power models
at high abstraction level are more promising to be generalized
in a shorter time, using a set of key signals without the
necessity of looking into lower level details.
In the recent work [7]–[9], linear regressions have been
employed for high level power models. Work [7] established
the regression model using IO toggle rate and resource uti-
lization in a log-arithmetic format. It attempted to formulate a
generic model suitable for all applications running on FPGA.
However, the conditions different from the trained model such
as the variation in IO port size will significantly degrade the
estimation accuracy. Work [8], [9] extracted the toggle rate of
a small set of internal signals and built the model in embedded
processors, resulting in the LUT resource overhead of 7%
in [8] and 9% in [9] for their tested applications, as well as
around 5% CPU time for both of these two work. Furthermore,
as studied in [10], [11], power behaviors of complex arithmetic
units are generally non-linear. Hence, the linear model exhibits
intrinsic restriction in accuracy enhancement when non-linear
power patterns increase with the growing sample size, which is
known as the underfitting problem in machine learning theory.
In light of this problem, the objective of our work is to leverage
a decision tree learning model with the ability to adaptively
learn different power patterns from samples captured under
various situations. To the best of our knowledge, our work
proposes the first approach to establish non-linear dynamic
power models in FPGA using the state-of-the-art machine
learning theory. Furthermore, the overheads of area, power
and performance incurred by our proposed power monitoring
is trivial. The resulted area overhead of the power monitor is
negligible in comparison with prior work [8], [9].
III. AUTOMATIC SYNTHESIS FLOW
Provided the original register transfer level (RTL) designs,
we propose a complete automatic synthesis flow from power
and activity trace generation to the decision tree model es-
tablishment as shown in Fig. 1. The flow can be decomposed
into three sub-flows: (1) activity trace flow (ATF); (2) power
trace flow (PTF); and (3) model synthesis flow (MSF). An
activity trace is defined as the switching activities of a set of
signals in the design in an estimation period, whereas a power
trace is the power value of the target design in an estimation
period. Note that our implementation targets Xilinx tool chain
and Modelsim, but the generic methodology is applicable to
other vendor tools.
Synthesis
Placement
Routing
Synthesis
Placement
Routing
Gate-level Power 
estimate
Timing 
simulation
Timing 
simulation
Activity trace flow (ATF) Power trace flow (PTF)
RTL description 
files
RTL description 
files
Input 
stimulus
Feature selection
Model tuning
Data learning
Model synthesis flow (MSF)
Power traceActivity trace
HDL netlist
Gate-level Power 
estimate
Feature info
Input 
stimulus
Synthesis
Placement
Routing
Synthesis
Placement
Routing
Gate-level Power 
estimate
Timing 
simulation
Timing 
simulation
Activity trace flow (ATF) Power trace flow (PTF)
RTL description 
files
Feature selection
Model tuning
Data learning
Model synthesis flow (MSF)
Power trace
Activity trace
HDL netlist
Gate-level Power 
estimate
Feature info
Input 
stimulus
Input 
stimulus
Synthesis
Placement
Routing
Gate-level Power 
estimate
Model synthesis flow
Timing 
simulation
RTL description 
files Synthesis
Placement
Routing
HDL netlist
Feature info
Feature 
selection
Model tuning Data learning
Gate-level Power 
estimate
Timing 
simulation
Activity Trace flow Power Trace flow 
Activity trace
Power trace
Activity trace
Activity trace
Input 
stimulus
Fig. 1. Automatic synthesis flow.
A. Activity trace flow (ATF)
Dynamic power is introduced by signal transitions which
dissipate power by charging and discharging the load ca-
pacitors. With the improvement on leakage power control in
recent devices, the main consideration for power management
is more relevant to dynamic power which is the main focus
of our work. The equation of dynamic power dissipation is
summarized in Equation (1), which demonstrates the relation-
ship between switching activity αi, capacitance Ci on the
net i, supply voltage Vdd and operating frequency f . Here
voltage and frequency are pre-set constants while capacitance
is determined by the device and switching activities totally
depend on the runtime execution of the application.
Pdyn =
∑
i∈N
αiCiV
2
ddf (1)
In order to generalize the dynamic power models at high
level, a set of key signals are captured and we monitor
their switching activities formulated in Equation (2), where
a(·) is the activity function which returns the difference of
signal transition counts s(·) on the signal set sig over the
estimation period from tstart to tend. For a large design
containing millions of nets, it is vital to identify a subset of
discriminative and informative nets that are strongly indicative
of the power. The number of transitions of an identified signal
in an estimation period serves as a feature for the power model
synthesis.
a(sig) = s(sig, tend)− s(sig, tstart) (2)
In order to identify the most indicative signals, we first
run vector-based timing simulation with randomly generated
input vectors and then use Xpower to export the sorting
of the signal activities. The simulation is supposed to run
for a sufficiently long time (i.e., orders of magnitude larger
than an estimation period) to cover as many situations as
possible. We select a number of candidate signals with highest
activities, and then perform feature selection in MSF. The
feature selection is performed to filter out the redundant
signals showing repetitive behaviors and leave a smaller subset
of signals with discriminative features across different input
vectors, as discussed in Section IV-B. Note that the identified
signals are the lower level nets in the HDL netlist exported
by command write verilog after placement and routing. The
HDL netlist is basically composed of primitives (e.g., instances
of LUTs or DSPs) and connections. The main advantages of
using the HDL netlist are two-fold: it enables us to extract the
identified signals, and it preserves the original mapping when
the activity counters are added afterwards for implementing
the model in hardware for runtime monitoring. Finally, timing
simulations are conducted using Modelsim to generate .saif
files for power synthesis in PTF, with the identified features
recorded in activity traces simultaneously.
B. Power trace flow (PTF)
We go through synthesis and implementation again using
the netlist in the PTF flow. After that, we apply the .saif
files derived from ATF to the Xpower analyzer to derive the
corresponding power traces for different estimation periods.
To attain high confidence power values, we need to set proper
constraints for IO toggling and all the clock specifications.
In addition, more than 25% of the signal toggling should be
covered in the .saif files [12]. The static power is dependent on
the ambient temperature. As we target Virtex-7 series FPGAs
with Xilinx high performance low power (HPL) technology
and we note that the static power for Xilinx HPL technology is
small in magnitude and shows small variation in a wide range
of temperatures, i.e., within 1 W for the power range from
-40 to 60 degree centigrade [13], which also conforms with
our experimental results. Hence, we conduct our experiments
under the ambient temperature of 25 degree centigrade to
approximate the static power whereas we mainly investigate
the dynamic power.
C. Model synthesis flow (MSF)
After we collect a sufficient amount of activity traces and
the corresponding power traces, our anchor is to develop a
runtime dynamic power monitoring scheme based on decision
tree learning algorithm. As introduced later in Section IV-A,
the decision tree can eventually be decomposed into a series
of if-then-else rules and intrinsically this is suitable to be
implemented in an area-efficient way in hardware with a series
of comparators. Moreover, the decision tree caters for models
with different complexities, by tuning the inherent parameters
(e.g., maximum depth, minimum number of samples to split a
node). It also renders high adaptability to improve itself when
more samples are provided, as indicated by the learning curves
in Section VI-A. In comparison, the traditionally employed
linear regression model shows an insufficiency in the ability
to acclimate itself to further increase accuracy when more
training samples are available. Nevertheless, the decision tree
is prone to the overfitting problem. In order to build an
accurate decision tree regression model, we systematically
present the essential steps about the decision tree model
establishment in Section IV.
IV. DECISION TREE MODEL ESTABLISHMENT
A. Background of decision tree
The decision tree is a nonparametric hierarchical model for
supervised learning which learns the samples in the form of
a tree structure. At each step of the tree building process, a
decision rule is generated, in which a coefficient is used to
compare with the target feature. According to the result of
the comparison, the sample set is split into two homogeneous
subsets, as shown in Fig. 2. Basically a tree node can be
categorized as: (1) decision/internal node: a node where the
contained subdivision of sample set can be further split by
a feature; and (2) leaf/terminal node: a node associated with
an output which can not be further split. The decision nodes
represent the decision rules whereas the leaf nodes provide
the classification or regression results. We use the CART [14]
algorithm to develop decision tree models.
Rule:
feature1≤coeff1 Decision 
node
Leaf 
node
True False
True FalseOut: value1 Rule:
feature2≤coeff2
Out: value2 Out: value3
if feature1≤coeff1:
    Out=value1
else  if  feature2≤coeff2:
    Out=value2
else  Out=value3
Fig. 2. Graphical and textual representation of decision tree regression.
B. Feature selection
In ATF, we have already screened out a set of candidate
signals with high activities. Before we develop the power
models, we apply feature selection to preserve only the
signals containing informative features across different input
vectors in different estimation periods. Feature selection can
circumvent the overfitting problem and reduce the resource
overhead induced by activity counters used to monitor signals.
We leverage recursive feature elimination algorithm to identify
required features, which has been successfully applied to
other domains [15]. In each iteration, the decision tree model
is trained and the feature importance values are computed,
according to the node impurity at different splits, which is
also known as the Gini importance [14] in the decision tree.
The features with least importance are eliminated and the
remaining features are used to re-train the decision tree model
and update feature importance. The insignificant features are
recursively pruned away and a subset of 20% key features
is remained for the final model establishment. This maximum
subset size is empirically determined based on the observation
that the final decision tree will use 10% to 20% of the features
as we extract one hundred features for each application.
C. Model tuning and training
The problem of learning an optimal decision tree is known
to be NP-complete under several aspects of optimality. To
boost the performance of our target model, we seek to tune a
set of essential hyper-parameters which largely determine the
accuracy of training, validation and testing. We focus on the
parameters defined in Table I. The most influential parameter
for the decision tree is the maximum depth which controls how
deep a tree can be formed. A deep tree means a high degree of
model complexity while also makes it prone to overfitting on
data. On the contrary, a shallow tree often fails to learn from
the samples well. The tree depth should be tuned in terms of
different applications to strike a good balance between training
and testing accuracy. In addition, the other three parameters
are supposed to be coordinated with the tree depth and be
tuned to circumvent the overfitting problem [14].
We experimentally determine the sets of parameters suitable
for our benchmarks: {3, 4, 5, 6, 7, 8} for maximum depth,
{5, 10, 15, 20} for minimum split sample and minimum leaf
sample, and {0.001, 0.01, 0.02, 0.03, 0.04, 0.05} for minimum
leaf impurity. To determine the best set of hyper-parameters
that fits the data well, k-fold cross validation approach [16]
is employed. We adopt ten-fold cross validation method: the
training set is further divided into ten subsets. A decision
tree model is trained using nine of them and the model is
evaluated with a score using the left subset as the validation
set. This procedure is repeated ten times with a different
validation set each time and we take the average to get the
overall score of the set of parameters, as an quantitative
indicator for performance. We try every combination of the
hyper-parameters and conduct cross validation to quantify the
performance of each model. The set of hyper-parameters with
highest score will be deployed to develop the model. After
that, we use both the training set and the validation set to
train the model in the model developing process. Finally, we
assess the model with the test set.
D. Model ensemble strategy
A large-scale hardware design traditionally requires the
collaborative participation of a group of designers, each of
whom focuses on an independent part of the complete design.
In light of this, we exploit the practical usage of our model
when different power models are developed independently for
different parts of the design and eventually they are aggregated
together and behave as an ensemble power model. The model
TABLE I
HYPER-PARAMETERS FOR DECISION TREE TUNING.
Name Description
Maximum
depth
The maximum depth that a tree can grow to.
Minimum
split sample
The minimum number of samples used to
split a decision node.
Minimum
leaf sample
The minimum number of samples necessary
to determine a leaf node.
Minimum
leaf impurity
The minimum percentage of samples giving
different output at a leaf node.
ensemble flow is shown in Fig. 3. Since the dynamic power
for different components are derived from different decision
tree models, we directly add up the dynamic power estimated
from the corresponding decision tree models in order to obtain
the overall power consumption. Experiment in Section VI-D
shows that the additional error is as small as 1.2% compared
to the re-trained decision tree model.
The model ensemble facilitates the model reuse: a power
model trained for a specific function can be saved in a library
and be integrated into an aggregated model directly when
needed. It is applicable for library-based and IP-based designs
with encapsulated power models. With the increasing usage of
FPGAs in data centers to accelerate multiple functions loaded
at runtime, such a fast and accurate ensemble method will be
particularly useful and necessary.
...
Model 
training
Model 1
Model n
Model 
training
Samples 
for app 1
Samples 
for app N
Model 
assessment
Ensemble
∑Model i 
...
Fig. 3. Model ensemble flow.
V. HARDWARE WRAPPER ARCHITECTURE
We propose an area-efficient hardware wrapper to realize the
proposed power monitoring in FPGA. The hardware wrapper
can be decomposed into two parts: activity counter and deci-
sion tree regression engine.
A. Activity counter
The activity counter design is shown in Fig. 4. A positive
edge detector firstly identifies the positive edges of the input
signal. Its output is valid for a single cycle, acting as the
enable signal to the counter. We propose two realizations of
the counter design: the LUT-based counter and the DSP-based
counter. The LUT-based counter only utilizes LUT resource
whereas the DSP-based counter is implemented using the
primitive COUNTER LOAD MACRO, as the instantiation of
the dynamic loading up counter occupying one DSP48 unit
with a maximum data width of 48 bits. These two counter
templates render the flexibility for developers who are aware
of the application resource utilization. Since we count the
positive edges of selected signals, the number of clock cycles
in an estimation period implies the upper bound of the signal
activity, which can be referred to set the maximum bit width
for counters. The counters are reset at the beginning of every
detection cycle by the decision tree regression engine.
D Q D Q
Clk
Signal
Feature
valueRst
Clk
EN
LUT/DSP
Counter
Positive edge detector
Fig. 4. Activity counter architecture.
B. Decision tree regression engine
We propose a memory-based decision tree regressor as
shown in Fig. 5. There are some studies on decision tree
implementation in hardware [17] to maximize the throughput
of decision tree computation. Nevertheless, our objective is
different from prior work in that the prime consideration is
not throughput in our solution. Instead, our decision-tree-
based monitoring scheme only operates once in an estimation
period and thereby we target at reducing power and resource
overheads at the first place. In our solution, the decision
tree structure is completely preserved in a memory element.
Additional peripheral control units are incorporated to or-
chestrate the activity counter controlling, tree node decoding
and branch deciding. To summarize, the complete hardware
structure in our proposed solution can be further divided into
three subsystems: (1) a feature controller; (2) a decision tree
finite state machine (FSM); and (3) a decision tree structure
memory.
To achieve periodic power estimate, the feature controller
buffers feature values from activity counters, invokes the FSM
to operating states and periodically resets the activity counters.
We use a user-specified parameter to set the period of the
power estimation. The decision tree FSM has four states: idle
(I), node reading (N), stalling (S) and result outputting (R).
The FSM starts execution by transferring the state from idle
to node reading. In the node reading state, the FSM fetches
the node information and tree structure from the memory
and completes the if-then-else branch decision by comparing
the addressed feature with the decoded tree node coefficient.
Stalling state will operate together with the node reading state
to ensure the correctness of memory reading. The execution
finally terminates by transferring to result outputting state
which gives the final result and sets the indication signal when
the tree leaf has been reached. The decision tree structure
memory is the fundamental component which preserves the
complete tree structure and the feature addresses for rule
decisions, as shown in Fig. 6. The memory uses the block
memory on FPGA.
The maximum execution time for a single invocation is
2n+1 cycles where n is the maximum depth of the tree. Note
that the decision tree structure is completely preserved in the
structure memory, meaning that the proposed hardware wrap-
per is generally applicable to all decision tree types varying in
depth or different pruning structures. In addition, the features
are the number of signal transitions in an estimation period and
therefore, they are unsigned integer numbers. Correspondingly,
the coefficients can also be revised in an unsigned integer
format when doing the rule decision, without loss of precision.
Following this observation, no floating point operations are
required for the hardware wrapper design, which contributes
to the high area efficiency of its implementation. Furthermore,
our proposed hardware wrappers are applied to HDL netlists
and the HDL netlists are lower level designs comprising
primitives and connections to describe the mapping of the RTL
designs after placement and routing. As a result, the integration
of the monitoring circuits preserves the RTL mapping.
Act counter 1
Act counter 2
Act counter N
.
.
.
Clk counter
Select & register
Act_value
Cal_start
Addr
Coeff
Next index
Is_leaf
Result
Done Result
Feature 
controller
Decision tree 
FSM
Decision tree 
structure memory
.
.
.
Rst
.
.
.
.
.
.
Act_sel
I R
N S
Fig. 5. Decision tree regression engine.
Rule:
Feature[addr]≤ Coeff
Y N Is_current
_leaf
Current_
coeff_val
Next_left_
addr
Next_right_
addr
Current_
act_addr
Is_current
_leaf
------ Result
Non-leaf
Leaf
.
.
.
Decision tree structure memory
Non-
leaf:
Leaf:
Fig. 6. Decision tree memory structure.
VI. EXPERIMENTAL RESULTS
Our proposed activity trace flow and power trace flow are
implemented in Vivado 2016.4 and we utilize Modelsim SE
10.3 for the generation of the .saif files. The targeted FPGA
platform is Virtex-7 XC7V2000tflg1925-1. The model synthe-
sis flow is developed based on the Scikit-learn 0.18.1 [18]
machine learning toolbox. We applied our methodology to
develop the decision-tree-based power models and respec-
tively build hardware wrappers for several benchmarks in
Chstone [19], Polybench [20] and Machsuite [21]. These
benchmark suites are C-based benchmarks and we derive the
synthesizable Verilog version using Vivado HLS 2016.4.
A. Model assessment
We systematically evaluate the model accuracy using dif-
ferent sets of benchmarks which can be categorized by the
200 400 600 800 1000 1200 1400
Sample number
(a)
0.025
0.020
0.015
0.010
0.005
Sc
or
e 
(n
eg
at
iv
e 
M
AE
)
Train(tree)
Validation(tree)
Train(linear)
Validation(linear)
200 400 600 800 1000 1200 1400
Sample number
(b)
0.011
0.010
0.009
0.008
0.007
0.006
0.005
0.004
Sc
or
e 
(n
eg
at
iv
e 
M
AE
)
Train(tree)
Validation(tree)
Train(linear)
Validation(linear)
200 400 600 800 1000 1200 1400
Sample number
(c)
0.16
0.14
0.12
0.10
0.08
0.06
Sc
or
e 
(n
eg
at
iv
e 
M
AE
)
Train(tree)
Validation(tree)
Train(linear)
Validation(linear)
200 400 600 800 1000 1200 1400
Sample number
(d)
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
Sc
or
e 
(n
eg
at
iv
e 
M
AE
)
Train(tree)
Validation(tree)
Train(linear)
Validation(linear)
200 400 600 800 1000 1200 1400
Sample number
(e)
1.4
1.2
1.0
0.8
0.6
0.4
0.2
Sc
or
e 
(n
eg
at
iv
e 
M
AE
)
Train(tree)
Validation(tree)
Train(linear)
Validation(linear)
200 400 600 800 1000 1200 1400
Sample number
(f)
2.5
2.0
1.5
1.0
0.5
0.0
Sc
or
e 
(n
eg
at
iv
e 
M
AE
)
Train(tree)
Validation(tree)
Train(linear)
Validation(linear)
Fig. 7. Learning curve: (a) Atax; (b) Bicg; (c) GemmNcubed; (d) Matrixmult; (e) Hybrid 1; (f) Hybrid 2.
utilized resources. We define the following three types of
benchmarks: LUT-based, DSP-based and hybrid. The LUT-
based benchmarks (i.e., Atax and Bicg) mainly use LUT
resources, whereas the DSP-based benchmarks (i.e., Gemm-
Ncubed and Matrixmult) utilize DSPs as the major resource
type. The hybrid benchmarks are a combination of benchmarks
showing a large proportion of resource utilization of both
LUTs and DSPs. The LUT-based and DSP-based benchmarks
are tested under the clock period of 10 ns and the hybrid
benchmarks are evaluated with the clock period of 15 ns.
The estimation period is 3µs. We collect 2000 samples for
each benchmark by randomly invoking the applications with
random input vectors. We select 20% of the samples as the test
set, whereas the other 80% will be used to train the models as
well as perform cross validation. We compare our decision-
tree-based models with linear regression models employed
by prior research work [7]–[9]. In the feature selection step
for linear models, we use the feature weights as the feature
importance described in Section IV-B for a fair comparison.
The resource utilization and the mean absolute error (MAE)
in percentage for dynamic power consumption is shown in
Table II. The average MAE percentage is 4.36% for our
proposed decision tree model whereas the linear regression
indicates an average error of 17.31%. From Table II, we can
see that the advantages of decision tree over linear model is
larger for DSP-based and hybrid designs, because the LUTs
are intrinsically better to be fitted in the linear regression
model while the DSPs are inclined to have non-linear power
patterns, as reported in [10], [11] that the complex arithmetic
units generally exhibit non-linear power behaviors.
The learning curves for training and cross validation further
reveal the difference of the decision tree models and linear
regression models regarding the capability to learn from sam-
ples, as shown in Fig. 7. Note that the power consumption
differs a lot for different benchmarks and thus the MAE varies
largely for different benchmarks even with similar accuracy.
TABLE II
BENCHMARK AND MODEL ASSESSMENT.
Benchmark Resource (%) MAE (%)LUT DSP FF BRAM Dtree Linear
Atax 4.82 0 0.21 0 4.14 12.46
Bicg 2.72 0 0.14 0 2.58 15.67
GemmNcubed 1.16 53.33 1.46 0 4.51 17.80
Matrximult 6.24 100 3.65 0 3.54 18.81
Hybrid 1 53.82 97.78 20.04 3.99 5.78 20.78
Hybrid 2 56.91 100 8.21 5.69 5.61 18.34
Regarding the linear regression, the learning curves imply a
high-bias scenario: the error is high and the model creases
to improve accuracy given more training samples. In general,
the high-bias situation means the underfitting problem of the
training data has occurred. This also accounts for the dete-
rioration in training accuracy as non-linear patterns increase
with more samples. Comparatively speaking, the decision
tree model exhibits more superior ability to learn from the
training samples as the sample size gets larger. As a result,
the decision trees exhibit notably lower errors compared with
linear regression models.
B. Frequency variation
In this experiment, we evaluate our proposed model re-
garding the effect on estimation accuracy when the operat-
ing frequency varies. We use the pre-trained power models
for Atax and GemmNcubed from Section VI-A to predict
power consumption and verify the models’ scalability when
frequency changes at runtime. These models are trained under
a clock period of 10 ns with the accuracy shown in Table II.
We employ the same estimation cycles (i.e., 300 cycles) and
run our design flow to collect new activity and power traces
for testing purpose under different frequencies. Note that the
power estimation works for the original period of 10 ns and we
compute the predicted power by scaling the power estimation
according to the ratio of frequencies (i.e., fcurrentfmodel ) because the
power consumption is proportional to the frequency as shown
in Equation (1). The results are shown in Fig. 8. Compared
with the baseline MAE for the clock period of 10 ns, the
degradation of error is within 0.2% under different frequencies.
Thereby, our proposed model is applicable to be used under
different operating frequencies with low generalization errors.
4.14 4.25 4.23 4.27 4.15 4
0
2
4
6
10 12 14 16 18 20
M
A
E
 /
 %
Clock period / ns
(a)
4.51
4.26 4.39 4.11 4.13
4.34
0
2
4
6
10 12 14 16 18 20
M
A
E
 /
 %
Clock period / ns
(b)
Fig. 8. Accuracy for frequency variation: (a) Atax (b) GemmNcubed.
C. Model overhead
The decision tree hyper-parameter settings for the tested
benchmarks are presented in Table III. We analyze the extra
overhead of integrating our proposed power models into the
benchmarks from the following three aspects: resource utiliza-
tion, operating frequency and power dissipation. The first three
benchmarks utilize the DSP-based activity counters while the
others use the LUT-based counterparts. The activity counter
width can be tuned according to the estimation period as stated
in Section V-B. Therein, we uniformly set the width as 20
bits which is sufficient to cover the highest activity for a wide
range of estimation periods. For each single application, ten
to twenty selected signals are monitored, each of which is
equipped with an activity counter. As shown in Table IV, the
monitoring circuits consume less than 0.01% of LUTs and FFs,
0.2% of BRAMs and 0.4% of DSPs. Note that by using LUT-
based activity counters instead of DSP-based ones, we can fur-
ther dispense with DSP resources. In contrast, the prior work
using linear models [8], [9] demonstrate the LUT overheads of
7% and 9% respectively and the software overhead occupying
5% CPU time. The decision tree exhibits much higher area-
efficiency because it mainly leverages integer comparisons
whereas the linear model occupies a large amount of floating-
point additions and multiplications, which accounts for its
high area overhead. The power dissipation of our decision tree
model is extremely low. Besides this, the operating frequency
fluctuates slightly, showing a maximum degradation of 0.70
MHz (1%). In conclusion, our proposed decision-tree-based
monitoring hardware demonstrates low overheads in resource
utilization, operating frequency and power dissipation. It can
be efficiently integrated in the RTL designs for on-board power
monitoring.
D. Model ensemble
In order to verify the accuracy of the ensemble strategy,
we combine Atax, Bicg and GemmNcubed with the pre-
trained power models and run the design flow to generate
testing samples. We then quantify the accuracy of aggregat-
ing separately pre-trained models as an ensemble model by
TABLE III
DECISION TREE HYPER-PARAMETER SETTINGS.
Benchmark Max Min split Min leaf Min leafdepth sample sample impurity
Atax 5 20 20 0.01
Bicg 5 5 5 0.001
GemmNcubed 5 20 20 0.03
Matrixmult 4 5 5 0.03
Hybrid 1 6 5 5 0.02
Hybrid 2 6 20 10 0.05
TABLE IV
MODEL OVERHEAD ANALYSIS.
Benchmark Resource (in number) Freq PowerLUT DSP FF BRAM (MHz) (mW)
Atax 127 7 198 2.5 +1.14 3
Bicg 125 6 176 0.5 +2.32 2
GemmNcubed 149 9 242 2.5 +1.35 4
Matrixmult 108 0 325 0.5 0 4
Hybrid 1 156 0 415 2.5 -0.70 5
Hybrid 2 162 0 508 2.5 +1.62 6
adding up the predicted power for different components. As a
comparison, we re-train a monolithic power model using new
samples. The MAE percentage of the ensemble model and
the re-trained model is 5.52% and 4.32%, respectively. The
accuracy deterioration of the ensemble model is 1.2%. This
is mainly owing to the changes in placement and routing. In
all, the model ensemble strategy dispenses with the need for
re-training new models with moderate accuracy degradation.
E. Fine-grained phase shedding for on-chip multi-phase volt-
age regulator
Noticing that the phase shedding for on-chip regulator
shows prominent efficiency improvement for the processor
in [3] by reducing the number of phases in light load situations,
we similarly investigate the viability for phase shedding in
FPGA for internal logic supplied by Vccint. However, different
from processors which internally incorporate multiple power
states, applications running in FPGA are fully customized by
designers and intrinsically there is a lack of indicative states
about runtime power to guide the phase shedding decision.
Following this observation, we leverage our proposed power
estimator to supervise a fine-grained phase shedding strategy
of an on-chip regulator.
We synthesize a two-stage power delivery system consisting
of both on-chip and off-chip multi-phase voltage regulators for
Vccint using the tool PowerSoC [22]. Both the on-chip and
off-chip regulators are buck converters with the parameters
specified in [22]. To approximate the power loss induced by
off-chip to on-chip and on-chip to die parasitic resistances,
we estimate the fabrication details including the resistance
of package vias and wires, interposer vias and wires, micro-
bumps and TSVs from work [23]–[25]. The number of the
above network components for the targeted device are esti-
mated from [25], [26]. Finally the power delivery system for
internal logic is constructed under the nominal power of 20 W.
The on-chip regulator has five phases with the voltage scaling
time between Vccint and ground within 20ns, which is safe for
fine-grained controlling with our 3µs power estimation period.
We first experimentally determine the optimal number of
phases that can maximize the power efficiency of the power
delivery network under different power values as shown in
Fig. 9. Then, we deploy a look up table based phase shedding
approach [3] and determine the optimal number of phases to
use at runtime based on the power of internal logic includ-
ing static power and monitored dynamic power. The power
efficiency improvement is derived according to Equation (3)
where i denotes the index of a specific estimation period, Pnopt
and Pnmax are the power under optimal number of phases and
maximum number of phases, respectively. The phase shedding
induced power overhead Ploss is calculated according to [22].
Experiments show that the efficiency improvement for Hy-
brid 1 and Hybrid 2 using the last 400 sample sequences
from Section VI-A are 13.6% and 14.4% respectively, with the
internal logic power ranging from 1 W to 11 W. As a result, the
proposed runtime power monitoring hardware provides fine-
grained power information for runtime phase shedding of an
on-chip regulator to boost 14% average efficiency of the power
delivery network. It is promising to be used in future FPGAs
with integrated voltage regulators and CPU-FPGA system-on-
chips as well for further power saving.
Effimpv = 1−
∑N
i (Pnopt(i) + Ploss(i))∑N
i Pnmax(i)
(3)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
P
o
w
er
 d
el
iv
er
y
  
n
et
w
o
rk
 e
ff
ic
ie
n
cy
Power consumption / W
Max # of phases
Opt # of phases
1 phase 5 phases4 phases3 phases2 phases
Fig. 9. Efficiency of power delivery network for FPGA internal logic.
VII. CONCLUSION
In this work, we leverage the state-of-the-art machine
learning techniques to establish a novel decision-tree-based
dynamic power monitoring approach for applications running
on FPGA. The proposed design flow and the decision tree re-
gression model can be generally applied for fine-grained power
prediction in FPGA. We also develop a light-weight and in-situ
monitoring hardware for the developed power model which
can be efficiently integrated into RTL designs with extremely
low overheads of area, power and performance. We investi-
gate our proposed methodology on three different types of
benchmarks: LUT-based, DSP-based and hybrid benchmarks.
Experimental results reveal that the decision tree models
outperforms the traditional linear models with more than 10%
reduction in mean absolute error. Besides this, the decision-
tree-based power monitoring exhibits a high capability to learn
from samples as the training sample size increases whereas
the linear regression is prone to the underfitting problem.
Moreover, we exploit a model ensemble method used for
library-based and IP-based designs, with the results exhibiting
an additional 1.2% error compared with a completely re-
trained model. Furthermore, we utilize our proposed power
monitoring scheme to guide the phase shedding of an on-
chip multi-phase regulator as a case study and the results
demonstrate 14% average improvement in the efficiency of
the power supply for the FPGA internal logic. In future work,
we plan to enhance the power monitoring scheme with static
power compensation and leverage fine-grained runtime power
monitoring for advanced power management schemes.
REFERENCES
[1] P. Mantovani et al., “An FPGA-based infrastructure for fine-grained
DVFS analysis in high-performance embedded systems,” in DAC, 2016.
[2] A. Lo¨sch et al., “Performance-centric Scheduling with Task Migration
for a Heterogeneous Compute Node in the Data Center,” in DATE, 2016.
[3] H. Asghari-Moghaddam et al., “VR-scale: runtime dynamic phase
scaling of processor voltage regulators for improving power efficiency,”
in DAC, 2015.
[4] B. Keller, “Opportunities for Fine-Grained Adaptive Voltage Scaling
to Improve System-Level Energy Efficiency,” Master’s thesis, EECS
Department, UCB, 2015.
[5] F. Li et al., “Architecture evaluation for power-efficient FPGAs,” in
FPGA, 2003.
[6] C. Najoua et al., “Accurate dynamic power model for FPGA based
implementations,” IJCSNS, 2012.
[7] A. Lakshminarayana et al., “High level power estimation models for
FPGAs,” in ISVLSI, 2011.
[8] M. Najem et al., “Method for dynamic power monitoring on FPGAs,”
in FPL, 2014.
[9] E. Hung et al., “KAPow: A System Identification Approach to Online
Per-Module Power Estimation in FPGA Designs,” in FCCM, 2016.
[10] D. Lee et al., “Learning-based power modeling of system-level black-
box IPs,” in ICCAD, 2015.
[11] A. Bogliolo et al., “Regression-based RTL power modeling,” TODAES,
2000.
[12] Xilinx, “Vivado Design Suite User Guide,” 2017.
[13] M. Klein and S. Kolluri, “Leveraging power leadership at 28 nm with
Xilinx 7 series FPGAs,” Xilinx whitepaper, 2013.
[14] L. Rutkowski et al., “The CART decision tree for mining data streams,”
Information Sciences, 2014.
[15] H. Ravishankar et al., “Recursive feature elimination for biomarker
discovery in resting-state functional connectivity,” in EMBC, 2016.
[16] J. D. Rodriguez et al., “Sensitivity analysis of k-fold cross validation in
prediction error estimation,” TPAMI, 2010.
[17] F. Saqib et al., “Pipelined decision tree classification accelerator imple-
mentation in FPGA (DT-CAIF),” TC, 2015.
[18] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” JMLR,
no. Oct, 2011.
[19] Y. Hara et al., “Chstone: A benchmark program suite for practical c-
based high-level synthesis,” in ISCAS, 2008.
[20] L.-N. Pouchet, “Polybench: The polyhedral benchmark suite,” URL:
http://www. cs. ucla. edu/pouchet/software/polybench, 2012.
[21] B. Reagen et al., “Machsuite: Benchmarks for accelerator design and
customized architectures,” in IISWC, 2014.
[22] X. Wang et al., “An Analytical Study of Power Delivery Systems for
Many-Core Processors Using On-Chip and Off-Chip Voltage Regula-
tors,” TCAD, 2015.
[23] Z. Xu et al., “Modeling of power delivery into 3D chips on silicon
interposer,” in ECTC, 2012.
[24] R. Wrona et al., “Resistance Measurements of BGA Contacts During
Reliability Tests,” in ISSE, 2006.
[25] R. Chaware et al., “Assembly and reliability challenges in 3D integration
of 28nm FPGA die on a large high density 65nm passive interposer,”
in ECTC, 2012.
[26] Xilinx, “7 Series FPGAs Packaging and Pinout,” 2011.
