FORECASTER: A Continual Lifelong Learning Approach to Improve Hardware
  Efficiency by Nguyen, Phat et al.
FORECASTER: A Continual Lifelong Learning Approach to
Improve Hardware Efficiency
Phat Nguyen1, Abhishek Taur1, Abdullah Muzahid1, Arnav Kansal2, and Mohamed Zahran2
1Department of Computer Science and Engineering, Texas A&M University
2Department of Computer Science, New York University
ABSTRACT
Computer applications are continuously evolving. However,
significant knowledge can be harvested from older applica-
tions or versions and applied in the context of newer ap-
plications or versions. Such a vision can be realized with
Continual Lifelong Learning. Therefore, we propose to em-
ploy continual lifelong learning to dynamically tune hardware
configurations based on application’s behavior. The goal of
such tuning is to maximize hardware efficiency (i.e., max-
imize an application’s performance while minimizing the
hardware’s energy consumption). Our proposed approach,
FORECASTER, uses deep reinforcement learning to contin-
ually learn during the execution of an application as well
as propagate and utilize the accumulated knowledge during
subsequent executions of the same or new application. We
propose a novel hardware and ISA support to implement
deep reinforcement learning. We implement FORECASTER
and compare its performance against prior learning-based
hardware reconfiguration approaches. Our results show that
FORECASTER can save as much as 17.5% system power over
the baseline set up with all resources. On average, FORE-
CASTER saves 16% system power over the baseline setup
while sacrificing an average of 4.7% of execution time.
1. INTRODUCTION
Computer architects are in continuous quest to find the best
hardware design for different program types. We cannot have
different application specific hardware designs for different
program types because this will be prohibitively expensive.
What makes things even more challenging is that a single
program passes through different phases during its execu-
tion lifetime and each phase has a different best hardware
configuration. This paper presents a step toward a solution.
Previously, designers used to gather profiling information
about a program execution on a hardware, then make use of
this profiling information to either enhance the hardware or
the program. However, this means that each program must
be instrumented first and the information gathered during
profiling is used on that application only. Our proposed idea
is based on a simple hypothesis: any phase of a program
execution has a best hardware configuration. Each phase has
certain characteristics. So, if there is a different program with
a phase with similar characteristics, then it can use the same
configuration to get the best performance. Therefore, if we
can learn the best configuration for different program phases,
we can use that to get the best configuration for new, un-
seen, programs. In other word, there is a finite set of patterns
along which hardware/software interactions can occur to give
best performance. For example, given a cache configuration,
there are finite set of memory access patterns that yield low
cache misses. Or, given a memory access pattern, we can
build the best cache configuration that yields the lowest num-
ber of misses. The main goal of this paper is to design
a hardware with configurable knobs that learns from its
interaction with programs to be able to reconfigure itself
to a configuration that achieves best performance for new
unseen programs. As the hardware executes more programs,
it learns more patterns and can achieve better performance
for more and more programs.
There are several challenges that need to be tackled in order
to reach this goal. First, what are the knobs to be changed?
There are many structures that can be designed to be recon-
figurable. Our main criteria is to pick knobs that have the
biggest impact on performance and power and at the same
time can be reconfigured with the least hardware cost and
modification. Second, how to learn the patterns of the hard-
ware/software interaction to suggest the best configuration?
The pattern means the profiling information such as telemetry
collected from performance counters. For each pattern, there
is a hardware configuration that leads to the best performance
and power, or any other metric that needs to be optimized. It
is clear that the number of patterns is large and, depending
on the number of type of knobs, the hardware configurations
is also large. This is why straightforward classifiers such as
bloom filter may not be a viable option. Using neural network
in a deep-learning setup does not lead to best results from an
early stage because it requires a large number of examples
in the training phase. Therefore, we need some kind of un-
supervised learning approach. We read profiling information
and we make changes to the hardware based on this informa-
tion. That is, we make changes to the environment and we
get feedback about how well we do. This is a description of
reinforcement learning.
The main contribution of this paper is to propose a
hardware scheme, called FORECASTER, that uses con-
tinuous learning, from one execution to another, using a
deep reinforcement learning to reconfigure certain knobs
to get the best performance and power for different pro-
grams. We implemented FORECASTER using Multi2Sim [32]
simulator. Our experimental results using Parsec benchmarks
show that the proposed technique can save as much as 17.5%
1
ar
X
iv
:2
00
4.
13
07
4v
1 
 [c
s.A
R]
  2
7 A
pr
 20
20
power over the baseline with all resources. On average, our
scheme saves 16% system power over the baseline setup
while sacrificing only an average of 4.7% execution time.
The rest of the paper is organized as follows: Section 2
presents some background materials; Section 3 describes the
main idea of FORECASTER; Section 4 shows the detailed
implementation of FORECASTER; Section 5 presents some
experimental results; Section 6 highlights some related work;
and finally, Section 7 concludes our work.
2. BACKGROUND
2.1 Hardware Adaptation
There is a considerable amount of prior work on reconfig-
urable architecture [3, 4, 8, 21, 33]. However, unlike FORE-
CASTER, the majority of the work did not use any learn-
ing [3,21]. Among the learning-based approaches, most used
offline training [4, 8, 33]. Only a few approaches utilized
online training; however, they focused on a single hardware
structure [12].
Choi and Yeung [6] perform microarchitectural resources
distribution in an SMT processor using hill-climbing algo-
rithm. Bitirgen et al. [4] propose a scheme to combine per-
formance prediction model of multiple applications to get
an aggregate performance prediction of the overall resource
distribution. The scheme is coupled with some limited proba-
bilistic search technique to find the optimal resource distri-
bution to improve performance. Petrica et al. [21] present
Flicker, a general-purpose multicore architecture that dynam-
ically adapts to varying limits on allocated power. A Flicker
core has reconfigurable lanes through the pipeline that allows
tailoring an individual core to the running application with
lower overhead.
Dubach et al. [8] propose the use of machine learning to
dynamically optimize the efficiency of some processor’s com-
ponents such as the Arithmetic Logic Unit, instruction queues,
register file, caches, branch predictor, and the pipeline depth.
During program execution, as soon as a phase change is de-
tected, the hardware starts to collect counters on a predefined
profiling configuration. These counters represent the usage of
the hardware resources in that interval. The model then pre-
dicts the optimal configuration and the system is reconfigured
accordingly for the rest of the phase. Unlike our approach,
Dubach et al. proposes learning for each program separately.
Moreover, they also use the offline training method, which
could limit the adaptability of the model to future unmet
programs.
There is also some other work in utilizing profiling infor-
mation for optimization. However, majority of the work is
related to software optimization [19]. For hardware designs,
profiling information has been traditionally used to make
design choices for hardware before fabrication [11].
2.2 Reinforcement Learning
2.2.1 Overview
Reinforcement learning is a subset area of machine learn-
ing concerned with autonomous agents that can learn without
supervision to optimize an objective [10]. In reinforcement
learning, the knowledge of the agent is built through trial
and error. For each time step, the agent takes an action and
observes feedback from the environment about how good it is
doing and how close it is to the goal. A reinforcement learn-
ing problem typically consists of three main components:
• a set of states that represent the environment at different
time steps;
• a set of possible actions that the agent can take;
• a reward f unction that issues a reward for each action
of the agent
A state is defined as the information about the condition
of the environment at a time step. The agent observes this
information and selects the most appropriate action. As
the agent taking the selected action, the environment transi-
tions from the current state to another state. After that, the
reward f unction assigns the agent with a reward. The value
of this reward depends on how good the state-action pair
is. Since the name of this reward function is Q− f unction,
the reward is called Q− value. The agent accumulates the
knowledge by storing and updating these Q-values after each
time step.
Ipek et al. [12] formulate DRAM scheduling as a reinforce-
ment Q-learning problem with the goal of optimizing bus
utilization and throughput. In every clock cycle, the agent
picks one out six possible actions available to the scheduler.
The agent is given a numerical Q-value of 1 whenever it is-
sues a command that increases the data bus utilization and 0
otherwise. The state is defined as a combination of attributes
that represent the state of the controller’s transaction queue.
Ipek et al. show that this approach can improve the bus uti-
lization and bandwidth efficiency by a significant amount
compared to the state-of-the-art DRAM scheduler.
2.2.2 Deep Q-learning
Early reinforcement Q-learning techniques use a Q-table
to store the Q-values and take the state-action pairs as the
indices. As modern problems getting more and more com-
plex, this approach become inefficient since the Q-table size
inflates with the number of state-action pairs. The answer
to this issue is Deep Q-learning (DQN), which is the cross
breeding of reinforcement Q-learning and deep learning tech-
niques. In DQN, the reward table is replaced by a multi-layer
neural network that predicts the Q-value for any particular
state-action pair.
There are a significant amount of work on the application
of DQN [17,27,28,29]. Mnih et al. [17] apply DQN to seven
2600 Atari games and show that it outperforms all previous
reinforcement learning approaches. Moreover, this DQN
model also manages to beat a human expert in three out of
seven games. The model uses only the raw pixels of the
application screen as input, and outputs the expected future
reward of the taken action. DeepMind Techologies uses DQN
in the series of AlphaGo programs [27, 28, 29] to solve the
game Go. The original AlphaGo version [27] outperforms all
previous Go programs and is the first Go program to beat a
professional human player. The latest version of the series,
AlphaZero, has the capability of teaching itself three different
games: Go, chess, and shogi [28]. However, at the time of
this paper, there has not any published research on applying
DQN in hardware optimization.
2
3. MAIN IDEA: FORECASTER
FORECASTER periodically collects hardware telemetry
during the execution of a program. The telemetry consists of
various hardware event counters maintained by the processor
architecture. FORECASTER uses the hardware telemetry in a
deep reinforcement learning algorithm to predict the optimal
configurations of tunable hardware resources. The goal of
the predicted configurations is to maximize the efficiency
of the hardware. The overall workflow of FORECASTERis
shown in Figure 1. FORECASTER reconfigures the hardware
resources according to the prediction and receives a reward
after a while. FORECASTER receives a positive reward if the
hardware efficiency improves due to reconfiguration. Oth-
erwise, it receives a zero or negative reward. Rewards are a
feedback mechanism to encourage configurations associated
with positive rewards while discouraging non-positive reward
related configurations. Based on the reward, FORECASTER
updates the Q-values (used by the reinforcement learning
algorithm) so that efficiency boosting configurations are pre-
dicted more frequently. Thus, FORECASTER continually
improves its prediction during the execution of an application.
Next time, the same or a new application executes, FORE-
CASTER reuses the Q-values learned from prior executions
and continues to improve its prediction accuracy. In other
words, FORECASTER keeps on learning from one execution
to the next both within and across applications, thereby real-
izing continual lifelong learning with the goal of maximizing
the hardware efficiency. In the next few sections, we will
elaborate on different steps of FORECASTER.
3.1 Initial Configuration
When an application starts, FORECASTER starts with max-
imum amount of hardware resources. This prevents any slow
down from the beginning. Progressively, FORECASTER tries
to reconfigure tunable hardware resources to maximize the
hardware efficiency. We used IPC3/Power as the metric for
hardware efficiency. Similar metric has been used in prior
work [8]. As for tunable hardware resources, we choose L2
and L3 caches as well as the Branch Target Buffer (BTB)
and prefetcher. We choose caches because they are the most
energy hungry resources in a chip [13]. We choose the other
resources because they can be easily clock-gated without in-
trusive changes to the pipeline circuitry. Although we demon-
strate the effectiveness of FORECASTER with these 4 tunable
resources, we argue that FORECASTER is general enough
to accommodate any number of tunable resources. Table 1
shows the tunable resources and possible configurations.
Tunable Resource Configuration
BTB Size 0.5K, 1K, 1.5K, and 2K Entries
Prefetcher On, Off
L2 (private) cache 256K, 512K, 768K, and 1024K Bytes
L3 (shared LLC) cache 4M, 8M, 12M, and 16M Bytes
Table 1: List of tunable hardware resources. Initial con-
figuration is shown in bold-faced.
3.2 Collecting HW Telemetry and Making Pre-
dictions
A program usually goes through distinct phases during its
execution [25]. Some phases may benefit from more caches
while others might benefit from having a larger BTB. We
collect hardware telemetry as an approximation of how a
program behaves. Modern processors provide hundreds of
hardware event counters as its telemetry. After inspecting
every hardware event counter, we choose n counters most
relevant to the tunable hardware resources. Let us denote
the set of counters (i.e., hardware telemetry) as T= {Ti}ni=1,
where each Ti is an individual counter. FORECASTER collects
these counters at a regular interval. At the beginning of
each interval, FORECASTER uses the counters for predicting
configurations of tunable resources.
FORECASTER uses reinforcement learning, more specifi-
cally Deep Q-learning (DQN) model for prediction. . In this
model, the current configuration of the hardware resources,
C, as well as the behavior of the program as specified by
the telemetry, T, is provided as a state, S. In other words,
S=< T,C>. Given a state, St , at a time period, t, the deep
Q-learning model predicts Q-values for all possible actions
in that state using a deep neural network (DNN). Each action
indicates a different configuration of the hardware resources.
Thus, if we have N possible actions, the model predicts N
different Q-values - one for each action, Ai, where 1≤ i≤ N.
The Q-value associated with action Ai, say QAi , is an estima-
tion of how good the new configuration (corresponding to Ai)
is in maximizing the hardware efficiency. Higher Q-value im-
plies better configuration. Therefore, FORECASTER chooses
the action related to the maximum Q-value.
Naively designating one action for each configuration leads
to a large number of actions. For example, based on Table 1,
we can have 4*4*2*4=128 possible configurations and hence,
the same number of actions. Reinforcement learning with a
large action space takes a long time to train due to the sparsity
of rewards [24]. Therefore, in order to reduce the number
of actions, we express each action in terms of the changes
in configurations. Suppose, ↑, ↓, and = indicate that a re-
source should be increased, decreased or kept at the same
level respectively. If a resource has only two configurations,
we can use ON and OF to indicate those configurations. With
these notations, we can define an action as A = < Ria >
n
i=1,
where Ri represents i-th resource for 1 ≤ i ≤ n and a rep-
resents a change in Ri’s configuration such as ↑,↓,=,ON,
or OF . For example, suppose the current configuration is
denoted by C =< L2512,L38,PFOF ,BT B0.5 >. Then, an ac-
tion < L2↑,L3↓,PFOF ,BT B↑ > will create a new configu-
ration denoted by < L2768,L34,PFOF ,BT B1 >. With this
new approach, the number of actions is reduced from 128 to
3*3*2*3=54 i.e., less than half of the initial actions. With
the reduced action space, the overall prediction process is
illustrated in Figure 2.
3.3 Reconfiguring HW Resources
FORECASTER reconfigures the tunable resources accord-
ing to the predicted configurations. Now, we describe how
each resources are reconfigured.
3.3.1 Branch Target Buffer (BTB)
BTB has 4 possible configurations (Table 1). Therefore,
we can partition BTB into 4 sections - B1, B2, B3, and B4
3
Q-values
Application Start execution with maximum resources
Periodically 
collect HW 
telemetry
Predict new 
configuration
Reconfigure 
resources 
Calculate efficiency,  
rewards, and
update Q-values
Q-values
Save Q-values at the 
end of execution
Load Q-values
Figure 1: Overall workflow of FORECASTER.
Tt
Ct
Q1
Q2
Q3
Qm
…
A1
A2
A3
Am
Se
le
ct
 A
ia
ss
oc
ia
te
d 
w
ith
 
th
e 
m
ax
im
um
 Q
-v
al
ue
Ct Ct+1
Apply changes 
according to Ai
Figure 2: FORECASTER uses hardware telemetry to pre-
dict configurations.
B1
B2
B3
B4
Reconfiguration 
Logic Indexing 
Logic
Clock
Figure 3: Logic for reconfiguring the BTB.
(Figure 3). For the first configuration (i.e., 0.5K entries), sec-
tions (B2, B3, B4) are clock-gated. Similarly, for the second
and third configurations, sections (B3, B4) and (B4) are clock-
gated respectively. The last configuration does not clock-gate
any section at all. On the other hand, Section B1 is never
clock-gated because at least those entries in BTB are used in
all configurations. We add a reconfiguration logic that creates
the appropriate clock-gating signal to enable the appropri-
ate sections. Moreover, for each configuration, the indexing
logic needs to reconfigure the indexing bits accordingly. In a
multicore processor with one BTB per core, FORECASTER
reconfigures all BTBs to the same configuration. This is
done is to simplify the prediction and reconfiguration logic
in FORECASTER.
3.3.2 Prefetcher
Prefetcher is used either completely or not at all. Therefore,
the prefetcher is clock-gated entirely or not at all. So, the
reconfiguration logic simply generates a single clock-gating
signal for the entire prefetcher.
…
Way
Reconfiguration 
Logic
Tag State Tag State
Clock
Data Data
Way 0 Way 1
Figure 4: Logic for reconfiguring L2 and L3 caches.
3.3.3 L2 and L3 Caches
In order to reconfigure caches, FORECASTER makes three
design choices. First, FORECASTER does not clock-gate an
entire set. As a result, the address decoding logic remains un-
changed. Second, from each set, FORECASTER clock-gates
the invalid lines. FORECASTER never clock-gates any valid
lines from the cache. Three, whenever more than the required
4
Collect HW telemetry Ht
and predict configuration Ct
Calculate efficiency Et, 
reward Rt, and
Update Q-values 
Reconfigure 
according to Ct
Reconfigure 
according to Ct+1
Configuration Ct-1 Configuration Ct Configuration Ct+1
Collect HW telemetry Ht+1
and predict configuration Ct+1
Time
Interval t Interval t+1Interval t-1
Figure 5: Timing of various steps of FORECASTER.
number of cache lines in a single set satisfy the selection
criteria, FORECASTER randomly choose some of them to
clock-gate. Figure 4 shows the schematic for reconfiguring
the caches. The way selection logic first determines how
many lines need to be clock-gated in each set. Then, it se-
lects which way to clock-gate based on the selection criteria
and then, sends a signal to that way. When an invalid line
is clock-gated, FORECASTER does not need to worry about
cache coherence issues. During the clock-gating process, the
cache controller blocks any incoming request to that particu-
lar cache set. The request is handled after the clock-gating is
complete.
3.4 Calculating Rewards and Updating
Q-values
FORECASTER collects the hardware telemetry, Ht at the
beginning of an interval, t, and determines the new configura-
tion, Ct . using the Q-value, Qt , predicted by DNN. Then, it
reconfigures the hardware resources accordingly, and contin-
ues the program execution. The timing is shown in Figure 5.
At the beginning of the next interval, t+1, FORECASTER
calculates the new efficiency that results from the reconfigu-
ration. FORECASTER compares the new efficiency with the
old one (the one calculated at the beginning of interval t).
If efficiency increases, FORECASTER receives +1 reward.
If the efficiency remains unchanged FORECASTER received
0 reward. On the other hand, if the efficiency decreases,
FORECASTER receives −1 reward. Based on the reward,
FORECASTER uses the commonly used temporal difference
method to update Qt−1 (the Q-value predicted at interval t-
1) [30]. In this method, the new Q-value, Q′t−1 is calculated
using the following equation:
Q′t−1 = (1−α)Qt−1+α[r+ γQt ]
Here, α is the learning rate and γ is the discounted factor.
The DNN uses back propagation method to learn Q′t−1.
3.5 Continual Lifelong Learning
Use of DQN in FORECASTER provides a natural way to
implement continual lifelong learning. In the DQN model,
FORECASTER learns by training a DNN with Q-values. To
continue learning from one execution to the next, FORE-
CASTER stores the DNN topology and weights in a file at the
end of each execution. Section 4 presents an extension to the
ISA that is used to read the topology and weights of the DNN
and write them back in a special file. At the beginning of the
next execution, FORECASTER loads the topology and weights
of the DNN and continues to learn from where it left in the
last execution. If multiple applications are concurrently run-
ning in a processor, each application will store its own DNN
topology and weights at the end of the respective execution.
In that case, FORECASTER uses an offline process to merge
the networks periodically (e.g., once a day). For merging,
FORECASTER uses TensorFlow [1] to load all the DNNs, and
generates a number of random state samples. For each state
sample, FORECASTER calculates the Q-value of each action
using all the DNNs, takes an average of the Q-values, and
retrains the largest DNN. Thus, the largest DNN accumulates
the knowledge of all DNNs. During the beginning of the next
execution of an application, FORECASTER loads this DNN
and continues execution.
4. IMPLEMENTATION
In this section, we outline the implementation of Deep
Q-Learning in FORECASTER as well as the extension to the
ISA.
4.1 Deep Q-Learning (DQN) Module
We propose to add a DQN module in the chip. The module
contains a Neural Processing Unit (NPU) for implementing
the DNN along with additional buffers such as input and
replay buffers, and a control logic. Figure 6 shows the high
level schematic of the DQN module.
There are several NPU designs in literature [2, 9, 23]. We
propose to use an NPU similar to the one proposed by Es-
maeilzadeh et al. [9]. It consists of a number of Processing
Elements (PEs) and a scheduler. Each PE implements an in-
dividual artificial neuron. Each PE contains input and weight
registers, a multiplier, an adder, a partial sum register, and a
comparator. Input and weight registers along with the adder,
multiplier and partial sum register are used to calculate the
dot product of inputs and weights. The comparator is used
to implement ReLU activation function [18]. The scheduler
schedules each layer of the DNN in the PEs starting from the
intput layer. After calculating the Q-values, the current input
and Q-values are stored in the replay buffer. When FORE-
CASTER receives a reward and calculates the updated Q-value,
the reply buffer provides the saved inputs and Q-values to the
5
NPU to learn the new Q-value.
Neural 
Processing 
Unit
Input 
Buffer
Replay 
Buffer
Control Logic
PE1 PE2
PE3 PE4
PE5 PE6
PE7 PE8
Scheduler
Input Reg
Weight Reg
Multiplier
Partial 
Sum Reg
Comparator
DQN Module
Adder
Figure 6: Details of the DQN module.
The control logic sequences the operations to implement
the DQN algorithm. The control logic also contains three
special registers to store learning rate, discount factor and
exploration ratio. The learning rate and discount factor are
used to calculate new Q-value (Section 3.4). Exploration
ratio dictates how many times the module will choose a ran-
dom exploratory action as opposed to an action based on the
maximum Q-value. DQN uses exploratory action to explore
actions that would have been otherwise never selected. This
is done to find a potentially better action that the one based
on prior knowledge.
4.2 Communicating Telemetry and Reconfig-
uration Decision
Based on our experiments (Section 5.2.6) and intuition,
we select the following hardware counters as telemetry - (i)
number of integer instructions, (ii) number of logical instruc-
tions, (iii) number of floating point instructions, (iv) number
of memory access instructions, and (v) number of control
flow instructions. Each core collects the telemetry indepen-
dently and sends to the DQN module after every n (e.g., say
n=10,000) instructions. When DQN module receives teleme-
try of at least a total of N (e.g., say, N= 500,000) instructions,
FORECASTER assumes the start of a new interval. DQN
module aggregates the telemetry and normalizes each counter
with respect to the total instructions of the interval that just
finished. DQN module predicts the new configuration and
sends a reconfiguration message to each core.
4.3 ISA Extension
We extend the ISA with instructions to set and get DQN
configurations. We propose a fixed format for DQN configu-
rations. The format is as follows - <BOC>, Layer1, Layer2,
..., LayerN, <EOL>, Weght1, Weight2, ..., WeightM, <EOW>,
LearningRate, DiscountFactor, ExplorationRatio, <EOC>.
Here, <BOC>, <EOL>, <EOW>, <EOC> are special mark-
ers (values) to indicate the beginning of configuration, end
of layers, end of weights, and end of configurations respec-
tively. We propose two instructions - setconf %ri, %rc
and getconf %rc, %ri. setconf %rc, %ri sets the con-
figuration value at address [%ri] into the configuration regis-
ter %rc. On the other hand, getconf %rc, %ri reads from
the configuration register %rc into the address [%ri]. In
order to initialize the DQN module with a particular configu-
ration, FORECASTER needs to invoke a function that executes
a sequence of setconf instructions in a loop until <EOC>
marker is reached. Similarly, in order to save the current
configuration of the DQN module, FORECASTER executes
a sequence of getconf instructions in a loop until <EOC>
marker is reached.
5. EXPERIMENTAL EVALUATION
(a)
(b)
(c)
Figure 7: Avg amount of (a) L2, (b) L3, and (c) BTB
turned off during the execution of streamcluster.
5.1 Experimental Setup
Table 2 shows the parameters of the simulated hardware
that we use to conduct the experiments. We use a modified
version of Multi2Sim [32] and McPAT [15] to simulate the
experimental hardware and its power consumption. PAR-
SEC 3.0 benchmark suite is used with small inputs. Due
to resource and time constraints, all benchmarks are run to
completion or 1 billion instructions. The interval size N is
set at 0.5M instructions.
We conduct three experiments on three versions of FORE-
CASTER:
• Experiment 1: FORECASTER is implemented with a
giant table to store and update the Q-values. All ap-
plications are run five times. Each run starts with an
empty Q-table. This version is essentially an adoption
6
Figure 8: Normalized power consumption of five executions of applications
Parameter Value
CPU 8-core @ 2.4Ghz, SMT off
Private L1 cache (I/D) 32KB, 64B line, 8-way
Private L2 Cache 1024K, 64B line, 8-way
Shared L3 Cache 16M, 64B line, 16-way
Coherence Protocol Directory-based MOESI
Table 2: Parameters of the simulated hardware.
of prior reinforcement learning-based approach in the
current usage scenario [12].
• Experiment 2: FORECASTER is implemented with a
deep neural network to predict the Q-values. All ap-
plications are run five times. Each run starts with an
untrained neural model.
• Experiment 3: FORECASTER is implemented with a
deep neural network to predict the Q-values. All ap-
plications are run five times. Each run starts with the
trained neural model inherited from the previous execu-
tion.
In the first experiment, each run starts with an empty Q-
table, which means there is no knowledge accumulation be-
tween executions. This technique is basically the Q-learning
adopted from [12]. In the second experiment, we replace
the Q-table with a deep neural network to see how good
DQN is compared to the vanilla Q-learning. Experiment 3 is
similar to experiment 2 except an execution starts with the
model taken from the previous execution. The purpose of
this experiment is to investigate the efficacy of knowledge
accumulation.
5.2 Results
5.2.1 In-flight Analysis
Figures 5(a), 5(b), 5(c) show how FORECASTER manages
the hardware resources during an execution of streamcluster.
On average, FORECASTER can turn off 64%, 66%, 66% L2
cache, L3 cache and the BTB respectively. FORECASTER
also deactivates the prefetcher for 26% of all intervals. Simi-
lar behavior can be seen for other programs in the benchmark
suite. FORECASTER is able to determine the best size for
each structure for each phase. This can be seen from the
repetitive pattern in the figures, which maps to phases in each
program. In this paper, we use static phases, fixed number
of instructions. In the future, we plan to use phase detection
techniques [7, 26] and this is expected to make the scheme
even more efficient.
5.2.2 Power Consumption
Experimental results shows that FORECASTER with con-
tinuous learning uses the least power compared to other tech-
niques and similar to the best static configuration, as shown
in Figure 8. On average, FORECASTER with accumulated
knowledge can save 16% of power across all applications
compared to the baseline. This is a 2% more than the version
without continuous learning and 8% more than the version
with basic Q-table.
5.2.3 Efficiency
The efficiency of each experiment is shown in Figure 9. In
general, our scheme outperforms the baseline configuration
in all benchmarks except from canneal. Interestingly, the Q-
table version gives the best efficiency compared to the other
two versions with the neural network. This may be because
the Q-table does not require much time to learn compared
to the neural network. Due to the time constraint, only two
executions of canneal are completed for experiment 3. That
is why the neural network does not perform as expected.
5.2.4 Performance
Figure 10 shows that there is not much IPC degradation
when using FORECASTER. Specifically, the system IPC when
7
Figure 9: Normalized efficiencies of five executions of applications
running swaption is virtually unchanged across all version
of FORECASTER. The Q-table version of FORECASTER has
the most consistent performance as it only cause a 1.2% IPC
overhead on average. This result is comparable to the best
static configuration. In canneal, two versions with the deep
neural network performs badly as they degrade the system
IPC by about 15%. One reason is because it takes time to train
the neural network before it can have reasonable accuracy.
The normalized execution time measured in terms of num-
ber of cycles is shown in Figure 11. In overall, the execution
time overhead incurred by FORECASTER is less than 5%.
FORECASTER tends to perform better in multi-threaded ap-
plications as seen in streamcluster, swaptions, f luidanimate
compared to single-threaded applications such as canneal.
This is because FORECASTER only makes one prediction
for all cores, and the prediction is largely dependent on
the resource of the core that is heavily used. For example,
single threaded programs only use one core, therefore the
L2 cache of that core is mostly occupied. However, when
FORECASTER reconfigure the hardware, it turns off the same
amount of L2 cache on every core, even though L2 caches
on other cores are mostly empty. This is a limitation of
FORECASTER that can be the subject of a future research.
5.2.5 Cost
The cost of the proposed design can be divided into three
parts: delay or latency cost, hardware cost, and power con-
sumption cost. As for the latency cost, reading the hardware
telemetry and making a reconfiguration decision does not
happen in the critical path. The hardware will continue in its
old configuration till the decision is made for a new configu-
ration.
The hardware cost consists of the DQN hardware and the
extra hardware used to implement the knobs. The DQN uses a
seven-layer neural network with six neurons per layer. There
is also an input layer of 10 neurons and an output layer of one
neuron. So, we use eight processing elements to implement
the input-layer, in two cycles as it needs to do the work of 10
neurons, and then one-cycle per each layer. Each processing
element (PE) is a simple execution unit that can do a fused
multiply-add operation per cycle similar to the execution
units found in traditional Graphics processing units (GPUs).
The PEs are organized together in a design similar to the
neural processing unit (NPU) described in [9]. We also need
two extra registers for the old Q value and the new Q value
(calculated by the neural network based on the reward). A
simple computation unit is needed to calculate the new Q
value as shown in Section 3.4.
The hardware needed for the knobs is straightforward. The
prefetcher is just clock-gated as the knob is on/off. The
BTB also uses clock-gating depending on the configuration.
We have four configurations so a small 2x4 decoder will do
the job as shown in the reconfiguration logic of Figure 3.
Clock gating the cache ways is simplified by the fact that the
way-reconfiguration logic, shown in Figure 4, never gates a
valid entry so no change to the cache controller or coherence
hardware. The way-reconfiguration logic is not complicated
because it exploits the fact that large caches (such as L3) is
usually partitioned. Therefore we have one logic circuitry per
partition.
The power consumption of the above hardware is not high
due to several factors. First, that extra hardware is activated
only at the end of each program phase to make prediction
and reconfigure the knobs. Second, the extra power con-
sumption is much smaller than the power-saved by gating
the reconfigured structures. Finally, there are several options
to design the neural network ranging from executing it, as a
software component, on a CPU of GPU, or designing it as
digital ASIC [9], FPGA [14], or analog ASIC [5, 16]. Each
approach has its own characteristics of area, power, and cost.
5.2.6 Sensitivity Analysis
8
Figure 10: Normalized IPCs of five executions of applications
Figure 11: Normalized number of cycles taken between
experiments
We conduct three additional experiments in order to deter-
mine the optimal interval size, history length, and number of
counters to collect. The history length experiment shows how
long into the past should we take into account for determin-
ing the best configuration of the current interval. Simulation
results show that increasing the history length from 1 to 2
intervals reduces the efficiency gains by 3% as shown in
Figure 12.
The number of counters experiment shows how many coun-
ters should be considered to best represent an interval. We test
with 3 sets: 3-counter, 5-counter and 8-counter sets. Below
is the list of 8 counters that we are collecting:
• Normalized number of dispatched integer instructions.
• Normalized number of dispatched logic instructions
• Normalized number of dispatched floating point instruc-
tions.
• Normalized number of dispatched memory instructions.
• Normalized number of dispatched control instructions.
• Minimum free space across all L2 caches
• Free space of shared L3 cache
• Branch predictor misprediction rate.
8-counter set includes all of counters above. 5-counter set in-
cludes the normalized dispatched instructions, leaving out the
last 3 counters. 3-counter set only includes the number of dis-
patched integer, memory and control instructions. Figure 13
shows that a set of 5 counters gives the best efficiency. A
set of 3 counters does not have enough representation power
while a full set of 8 counters is redundant.
The third experiment determines how big an interval size
should be. We test with interval sizes of 0.25M, 0.5M, 1M,
2M instructions. Simulation results shows that setting interval
size at 0.5M instructions gives 0.02%, 0.11%, and 0.11%
more efficiency gain than 2M, 1M and 0.25M instructions,
respectively.
Figure 12: Efficiency comparison between different his-
tory lengths
5.2.7 Summary
In overall, the continuous learning version of FORECASTER
can save up to 17.5% of power consumption in some appli-
cations and 16% on average compared to the baseline setup.
It gives an efficiency gain of 4% while sacrificing 4.7% of
execution time.
6. RELATED WORK
9
Figure 13: Efficiency comparison between different num-
ber of counters
Figure 14: Efficiency comparison between different in-
terval sizes
Tarsa et al. [31] propose a lightweight ML framework that
can be distributed through firmware updates to the microcon-
troller for post-silicon CPUs. The ML model is first trained
offline with a diverse collection of applications to avoid sta-
tistical blind spots. During execution, the CPU dynamically
sets the issue width of a clustered hardware component while
clock-gating unused resources based on the prediction of the
ML model.
Pan et al. [20] present a multi-level reinforcement learning
framework (MLRL) to address the scalability issue of the
dynamic power management in multi-core processors. MLRL
effective reduce the exponential decision process into a linear
problem by exploiting the hierarchical paradigm. In MLRL,
core states and Q-values are propagated from the bottom to
the top of the tree structure, then decisions are propagated
back down the tree, providing an efficient control mechanism.
Ravi et al. [22] propose CHARSTAR, a clock tree aware
resource optimizing mechanism. CHARSTAR incorporates
a multi-layer perceptron with one hidden layer to predict
the optimal configuration in each execution phase. The neu-
ral network takes into account the clock hierarchy and the
topology overhead in order to improve the power savings.
However, the offline trained model may soon be obsolete for
future unmet programs. Secondly, CHARSTAR only works
for single-threaded programs, and a multi-threaded version
may cause a super-linearly increase in the size of the neural
network model.
7. CONCLUSIONS
This work presents the potential of dynamically tuning
hardware components to save power with a small perfor-
mance overhead. Our scheme, FORECASTER, when incor-
porated a continuous learning deep neural network, can save
up to 17.5% of power consumption compared to the base-
line configuration. On average, FORECASTER can reduce
the power usage by 16% while sacrificing 4.7% of execution
time, thus leads to a 4% efficiency gain. Future research
may focus on improving the efficacy of Forecaster as well as
extending the control of FORECASTER over more hardware
resources to achieve more efficiency gain.
REFERENCES
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,
“TensorFlow: Large-scale machine learning on heterogeneous systems,”
2015, software available from tensorflow.org. [Online]. Available:
http://tensorflow.org/
[2] M. M. u. Alam and A. Muzahid, “Production-run software failure
diagnosis via adaptive communication tracking,” in Proceedings of the
43rd International Symposium on Computer Architecture, ser. ISCA
’16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 354–366. [Online].
Available: https://doi.org/10.1109/ISCA.2016.39
[3] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and
S. Dwarkadas, “Memory hierarchy reconfiguration for energy and
performance in general-purpose processor architectures,” in
Proceedings of the 33rd Annual ACM/IEEE International Symposium
on Microarchitecture, ser. MICRO 33. New York, NY, USA: ACM,
2000, pp. 245–257. [Online]. Available:
http://doi.acm.org/10.1145/360128.360153
[4] R. Bitirgen, E. Ipek, and J. F. Martinez, “Coordinated management of
multiple interacting resources in chip multiprocessors: A machine
learning approach,” in Proceedings of the 41st Annual IEEE/ACM
International Symposium on Microarchitecture, ser. MICRO 41.
Washington, DC, USA: IEEE Computer Society, 2008, pp. 318–329.
[Online]. Available: https://doi.org/10.1109/MICRO.2008.4771801
[5] V. Calayir, M. Darwish, J. Weldon, and L. Pileggi, “Analog
neuromorphic computing enabled by multi-gate programmable
resistive devices,” in Proceedings of the 2015 Design, Automation &
Test in Europe Conference & Exhibition, ser. DATE âA˘Z´15. San
Jose, CA, USA: EDA Consortium, 2015, p. 928âA˘S¸931.
[6] S. Choi and D. Yeung, “Learning-based smt processor resource
distribution via hill-climbing,” in 33rd International Symposium on
Computer Architecture (ISCAâA˘Z´06), Jun 2006, p. 239âA˘S¸251.
[7] A. S. Dhodapkar and J. E. Smith, “Managing multi-configuration
hardware via dynamic working set analysis,” in Proc. 17th
International Symposium on Computer Architecture, 2002.
[8] C. Dubach, T. M. Jones, E. V. Bonilla, and M. F. P. O’Boyle, “A
predictive model for dynamic microarchitectural adaptivity control,” in
2010 43rd Annual IEEE/ACM International Symposium on
Microarchitecture, Dec 2010, pp. 485–496.
[9] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural
acceleration for general-purpose approximate programs,” in
Proceedings of the 2012 45th Annual IEEE/ACM International
Symposium on Microarchitecture, ser. MICRO-45. Washington, DC,
USA: IEEE Computer Society, 2012, pp. 449–460. [Online].
Available: https://doi.org/10.1109/MICRO.2012.48
[10] L. Graesser and W. L. Keng, Foundations of Deep Reinforcement
Learning: Theory and Practice in Python. Boston, MA, USA:
Addison-Wesley Professional, 2018.
[11] H. Hubert and B. Stabernack, “Profiling-based hardware/software
co-exploration for the design of video coding architectures,” in IEEE
Transactions on Circuits and Systems for Video Technology, Sep 2009,
pp. 1680 – 1691.
[12] E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana, “Self-optimizing
memory controllers: A reinforcement learning approach,” in
Proceedings of the 35th Annual International Symposium on
Computer Architecture, ser. ISCA ’08. Washington, DC, USA: IEEE
Computer Society, 2008, pp. 39–50. [Online]. Available:
https://doi.org/10.1109/ISCA.2008.21
[13] C. Isci, A. Buyuktosunoglu, C. Cher, P. Bose, and M. Martonosi, “An
10
analysis of efficient multi-core global power management policies:
Maximizing performance for a given power budget,” in 2006 39th
Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO’06), 2006, pp. 347–358.
[14] M.-J. Li, A.-H. Li, Y.-J. Huang, and S.-I. Chu, “Implementation of
deep reinforcement learning,” in Proceedings of the 2019 2nd
International Conference on Information Science and Systems, ser.
ICISS 2019. New York, NY, USA: Association for Computing
Machinery, 2019, p. 232âA˘S¸236. [Online]. Available:
https://doi.org/10.1145/3322645.3322693
[15] S. Li, H. Ann, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P.
Jouppi, “Mcpat: An integrated power, area, and timing modeling
framework for multicore and manycore architectures,” in 2009 42nd
Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO), Oct 2009, pp. 469–480.
[16] D. Maliuk and Y. Makris, “An analog non-volatile neural network
platform for prototyping rf bist solutions,” in Proceedings of the
Conference on Design, Automation & Test in Europe, ser. DATE
âA˘Z´14. Leuven, BEL: European Design and Automation
Association, 2014.
[17] V. Mnih, K. Kavukcuoglu, and D. Silver, “Human-level control
through deep reinforcement learning,” in Nature, vol. 518, Feb 2015, p.
529âA˘S¸533.
[18] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
boltzmann machines,” in Proceedings of the 27th International
Conference on International Conference on Machine Learning, ser.
ICMLâA˘Z´10. Madison, WI, USA: Omnipress, 2010, p. 807âA˘S¸814.
[19] D. Novillo, “Samplepgo - the power of profile guided optimizations
without the usability burden,” in 2014 LLVM Compiler Infrastructure
in HPC, Nov 2014, p. 22âA˘S¸28.
[20] G.-Y. Pan, J.-Y. Jou, and B.-C. Lai, “Scalable power management
using multilevel reinforcement learning for multiprocessors,” ACM
Trans. Des. Autom. Electron. Syst., vol. 19, no. 4, Aug. 2014. [Online].
Available: https://doi.org/10.1145/2629486
[21] P. Petrica, A. M. Izraelevitz, D. H. Albonesi, and C. A. Shoemaker,
“Flicker: A dynamically adaptive architecture for power limited
multicore systems,” in Proceedings of the 40th Annual International
Symposium on Computer Architecture, ser. ISCA ’13. New York, NY,
USA: ACM, 2013, pp. 13–23. [Online]. Available:
http://doi.acm.org/10.1145/2485922.2485924
[22] G. S. Ravi and M. H. Lipasti, “Charstar: Clock hierarchy aware
resource scaling in tiled architectures,” in Proceedings of the 44th
Annual International Symposium on Computer Architecture, ser. ISCA
âA˘Z´17. New York, NY, USA: Association for Computing
Machinery, 2017, p. 147âA˘S¸160. [Online]. Available:
https://doi.org/10.1145/3079856.3080212
[23] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M.
HernÃa˛ndez-Lobato, G.-Y. Wei, and D. Brooks, “Minerva: Enabling
low-power, highly-accurate deep neural network accelerators,” in
International Symposium on Computer Architecture (ISCA), 2016.
[Online]. Available: http://vlsiarch.eecs.harvard.edu/wp-
content/uploads/2016/05/reagen_isca16.pdf
[24] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. van de
Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning by
playing solving sparse reward tasks from scratch,” in Proceedings of
the 35th International Conference on Machine Learning, ser.
Proceedings of Machine Learning Research, J. Dy and A. Krause,
Eds., vol. 80. StockholmsmÃd’ssan, Stockholm Sweden: PMLR,
10–15 Jul 2018, pp. 4344–4353. [Online]. Available:
http://proceedings.mlr.press/v80/riedmiller18a.html
[25] T. Sherwood, E. Perelman, and B. Calder, “Basic block distribution
analysis to find periodic behavior and simulation points in
applications,” in Proceedings of the 2001 International Conference on
Parallel Architectures and Compilation Techniques, ser. PACT âA˘Z´01.
USA: IEEE Computer Society, 2001, p. 3âA˘S¸14.
[26] T. Sherwood, S. Sair, and B. Calder, “Phase tracking and prediction,”
SIGARCH Comput. Archit. News, vol. 31, no. 2, p. 336âA˘S¸349, May
2003. [Online]. Available: https://doi.org/10.1145/871656.859657
[27] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den
Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,
I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel,
and D. Hassabis, “Mastering the game of go with deep neural
networks and tree search,” Nature, vol. 529, pp. 484–489, 2016.
[28] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,
M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap,
K. Simonyan, and D. Hassabis, “A general reinforcement learning
algorithm that masters chess, shogi, and go through self-play,” Science,
vol. 362, no. 6419, pp. 1140–1144, 2018. [Online]. Available:
https://science.sciencemag.org/content/362/6419/1140
[29] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,
A. Guez, T. Hubert, L. R. Baker, M. Lai, A. Bolton, Y. Chen, T. P.
Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and
D. Hassabis, “Mastering the game of go without human knowledge,”
Nature, vol. 550, pp. 354–359, 2017.
[30] R. S. Sutton and A. G. Barto, Reinforcement Learning: An
Introduction. Cambridge, MA, USA: A Bradford Book, 2018.
[31] S. J. Tarsa, R. B. R. Chowdhury, J. Sebot, G. Chinya, J. Gaur,
K. Sankaranarayanan, C.-K. Lin, R. Chappell, R. Singhal, and
H. Wang, “Post-silicon cpu adaptation made practical using machine
learning,” in Proceedings of the 46th International Symposium on
Computer Architecture, ser. ISCA âA˘Z´19. New York, NY, USA:
Association for Computing Machinery, 2019, p. 14âA˘S¸26. [Online].
Available: https://doi.org/10.1145/3307650.3322267
[32] R. Ubal, J. Sahuquilo, S. Petit, and P. LÃs¸pez, “Multi2sim: A
simulation framework to evaluate multicore-multithreaded processors,”
in 19th International Symposium on Computer Architecture and High
Performance Computing, Oct 2007, pp. 62–68.
[33] J. Wildstrom, P. Stone, E. Witchel, and M. Dahlin, “Machine learning
for on-line hardware reconfiguration,” in IJCAI 2007, Proceedings of
the 20th International Joint Conference on Artificial Intelligence,
Hyderabad, India, January 6-12, 2007, 2007, pp. 1113–1118. [Online].
Available: http://ijcai.org/Proceedings/07/Papers/180.pdf
11
