Parallel Implementation of Reinforcement Learning Q-Learning Technique for FPGA by Matheus, Torquato
 Cronfa -  Swansea University Open Access Repository
   
_____________________________________________________________
   
This is an author produced version of a paper published in:
IEEE Access
                               
   
Cronfa URL for this paper:
http://cronfa.swan.ac.uk/Record/cronfa49022
_____________________________________________________________
 
Paper:
Da Silva, L., Torquato, M. & Fernandes, M. (2019).  Parallel Implementation of Reinforcement Learning Q-Learning
Technique for FPGA. IEEE Access, 7, 2782-2798.
http://dx.doi.org/10.1109/ACCESS.2018.2885950
 
 
 
 
 
 
 
_____________________________________________________________
  
This item is brought to you by Swansea University. Any person downloading material is agreeing to abide by the terms
of the repository licence. Copies of full text items may be used or reproduced in any format or medium, without prior
permission for personal research or study, educational or non-commercial purposes only. The copyright for any work
remains with the original author unless otherwise specified. The full-text must not be sold in any format or medium
without the formal permission of the copyright holder.
 
Permission for multiple reproductions should be obtained from the original author.
 
Authors are personally responsible for adhering to copyright and publisher restrictions when uploading content to the
repository.
 
http://www.swansea.ac.uk/library/researchsupport/ris-support/ 
 Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Parallel Implementation of
Reinforcement Learning Q-learning
Technique for FPGA
LUCILEIDE M. D. DA SILVA1, MATHEUS F. TORQUATO2, and MARCELO A. C. FERNANDES3
1Federal Institute of Education, Science and Technology of Rio Grande do Norte (IFRN), Santa Cruz, RN, Brazil (e-mail: lucileide.dantas@ifrn.edu.br)
2College of Engineering, Swansea University, Swansea, Wales, UK (e-mail: m.f.torquato@swansea.ac.uk)
3Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte (UFRN), Natal, RN, Brazil (e-mail: mfernandes@dca.urn.br)
Corresponding author: Marcelo A. C. Fernandes (e-mail: mfernandes@dca.urn.br).
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) - Finance Code 001.
ABSTRACT Q-learning is an off-policy reinforcement learning technique which has as main advantage
the possibility of obtaining an optimal policy interacting with an unknown model environment. This work
proposes a parallel fixed-point Q-learning algorithm architecture implemented on field programmable gate
arrays (FPGA) focusing on optimizing the system processing time. Convergence results are presented and
the processing time and occupied area were analyzed for different states and actions sizes scenarios and
various fixed-point formats. Studies concerning the accuracy of the Q-learning technique response and
resolution error associated with a decrease in the number of bits were also carried out for the hardware
implementation. Architecture implementation details were featured. The entire project was developed using
the System Generator platform (Xilinx), with a Virtex-6 xc6vcx240t-1ff1156 as the target FPGA.
INDEX TERMS FPGA, Q-learning, Reinforcement learning, Reconfigurable computing.
I. INTRODUCTION
REINFORCEMENT, is an artificial intelligence formal-ism that allows an agent to learn from the interaction
with the environment where it is inserted [1]. This approach
is indicated for situations in which there is not enough infor-
mation about the behavior that the agent must take to reach
its objective, that is, the agent without previous knowledge
learns through interaction with the environment, receiving
rewards for his actions and finding, the optimal policy [2].
The development of the Q-learning reinforcement learning
technique in hardware enables designing faster systems than
their software equivalents, thus opening up possibilities of
its use in problems where meeting tight time constraints
and/or processing a large data volume is required. It is also
possible to reduce power consumption by reducing clock
cycles in applications where processing speed is not relevant
or less limiting than the need for low power consumption.
Navigation algorithms on mobile robotics applications, in
general, respond in hundred of the milliseconds and this
property enables solutions on dedicated hardware work with
a low clock frequency regards the other software solutions
embedded on micro-controllers and microprocessors.
Real-time applications may have different time restric-
tions. Some examples of applications with the greatest re-
strictions are: systems for monitoring signals in health fa-
cilities, industrial systems control, digital communication
systems, robots and even cars and aircraft. Traditional mech-
anisms and methods are not always able to overcome the bar-
riers imposed by the more challenging time constraints. The
research and development of artificial intelligence hardware
algorithms for real-time applications has grown significantly
in recent years due to their sampling time performance poten-
tial [3]–[8]. One of the purposes for the Q-learning technique
implementation in hardware is to accelerate the algorithm
processing and to obtain a faster optimal policy so that it can
be used in high demanding applications.
Another motivation for the development of this work is
the possibility to accelerate applications with great data flow
such as in Big Data processing. Another application with the
same burden of handling large amounts of data is Bioinfor-
matics which, usually, needs to handle a large amount of
genomic sequencing data [9]. It is also possible to use this
approach in Data Mining applications to discover relevant
information, which happens to be masked in large amounts
of data [10].
Unlike general-purpose processors that usually have their
VOLUME 4, 2016 1
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
clock at the maximum throughput, on field programmable
gate arrays (FPGAs) the clock depends on what is running
on it. Using a clock rate less than the maximum theoretical
operating frequency causes the dynamic power consumed to
decrease. The lower the clock, the lower the consumption
[11], making it suitable as well for low consumption systems.
For this work, FPGA was chosen because it provides high
performance with a low operating frequency through the
exploration of parallelism [12]. The latest FPGAs can deliver
ASIC-like performance and density with the advantages of
reduced development time, ease and speed of reprogram-
ming, not to mention its flexible architecture [13].
Thus, this work presents a modular and parallel architec-
ture proposal for the Q-learning technique implementation on
FPGA reconfigurable hardware with the purposes of reducing
processing time, allowing the algorithm to be used both in
high dynamic systems and large data flow as well as low
power consumption applications. The development of this
work, as well as all simulations and results, was carried out
using the development platform Xilinx System Generator
[14] configured to work with a FPGA Virtex 6 xc6vcx240t
1ff1156 [15].
A. RELATED WORK
Machine learning, artificial intelligence and signal process-
ing have been widely used in many recent applications. It is
important to note two important new features for these types
of applications: the amount of data that needs to be processed
is constantly growing and mobile and robotic systems are
becoming increasingly important [16]. As a result, several
machine learning algorithms and artificial intelligence hard-
ware implementations can be found in the related literature.
In [3], an overview of hardware implementations of arti-
ficial neural networks and fuzzy systems is presented, high-
lighting the main limitations, advantages and disadvantages
of various application techniques. The author also performs
an analysis of various hardware performance parameters,
bottlenecks and and the cost-benefit intrinsic to the various
implementation methodologies. In [5] the implementation
of an FPGA hardware architecture for a neural network of
associative memory applied to image recognition systems is
described, where a detailed study of the network performance
is conducted, including data such as occupancy rate, process-
ing speed and consumption of the system in hardware.
However, little can be found regarding hardware imple-
mentation of fixed and programmable architecture for the Q-
Learning reinforcement learning technique.
In the work of [17], a hardware pipeline architecture was
described for the selection mechanism of the best action
in the Q-learning state. According to the author, the algo-
rithm delay increases with the number of actions, being the
bottleneck of the system and it is possible to reduce this
delay with the implementation of a pipeline architecture to
select the best action in the state. It has also been proved,
through a consistent mathematical reasoning, that the value
function converges to an optimal policy, despite the pipeline
architecture implementation. However, no hardware solu-
tions were presented for the other mechanisms inherent to the
Q-learning algorithm for reinforcement learning, nor were
the details of the implementation, occupation analysis or
tool used for the hardware design of this system mechanism
presented. In [18], a hardware architecture of the SARSA
(or On-line Q-learning) learning algorithm for reinforcement
learning was developed for dynamic power management
application. The main difference between SARSA and Q-
learning is that SARSA is on-policy, that is, it learns action
values related to the policy it follows, while Q-learning is off-
police, not depending on the policy which is being used. The
author converted the SARSA algorithm into its equivalent
hardware modeling the architecture in VHDL (modelsim).
The architecture implements a power management system
that is able to change policy according to its workload.
The proposal was simulated and synthesized in the Xilinx
Spartan 2E. Implementations of Deep Q-Learning algorithms
on FPGA are presented in [19]–[21] where the results show
the of using FPGA compared to GPU and CPU. However,
the works [19]–[22] are applying a semi-parallel implemen-
tation technique which differ from the approach proposed
in this work that it uses a full-parallel implementation. The
full-parallel approach allows a high throughput performance
when compared with another implementation techniques.
Certain applications in signal processing and machine
learning impose hardware technical limitations. In addition,
the amount of data that needs to be processed is constantly
growing [16]. A practical alternative for the design of hard-
ware architectures is the use of reconfigurable tools such
as FPGAs which provides a density performance similar to
an ASIC (Application Specific Integrated Circuit) with the
advantage of using rapid and flexible prototyping [4]. Due
to its reconfigurable nature, new functions can always be
added and the system upgrade can be performed as needed
[13]. Another relevant advantage for the use of FPGA is
the possibility of designing hardware modules that work in
parallel, then, increasing the system processing capacity [23],
which allows different parts of the algorithm to be executed
simultaneously in order to reduce the overall processing time.
This reduction results in an interesting alternative with a
better performance than conventional microprocessors such
as CPUs or GPUs, especially for applications where there are
severe time restrictions.
Some papers found in the literature point out the design
of parallel algorithms to increase their processing capacity.
In [24], a decomposition technique for the Markov Decision
Process (MDP) is approached in sub-problems, presenting
a structure for the parallelization of reinforcement learning
techniques. This technique, according to the work, is able
to decrease processing time by up to ten times. In [25],
an implementation of the Q-learning algorithm is proposed
in a massively parallel machine using a Parallel Virtual
Machine (PVM) message exchange paradigm with a cache-
based communication scheme. This work presents significant
convergence results and increase of speed, pointing the paral-
2 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
lelization as an interesting training time reduction alternative
for the policy learning. The work shown in [26] presents a
comparative study of several parallel implementations of the
Q-learning algorithm with computer clustering architecture.
The parallel Q-Learning (PQL) methods studied were the
State Division Learning Method (SDLM), the Prioritized
Field Learning Method (PFLM) and the Parallel Fuzzy Q-
Learning (PFQL). A parallel algorithm with multi-agent
learning using Q-learning is proposed in [27]. The algorithm
is called PQL with Co-allocation of Storage and Processing
(PCSP), and it uses a table partition strategy for sharing Q-
table information among the processing nodes. The works
presented in [24]–[27] have as the objective the improvement
in the Q-learning processing using the High-Performance
Computing (HPC), however this alternative has been iden-
tified as a costly solution regards to power consumption per
processing.
Other works support using FPGA by presenting some
advantages over other platforms for applications with time
restrictions. In [12] a comparative performance study involv-
ing FPGA, GPU and CPU in image processing problems
is conducted. Despite the possibility of using parallelism in
multi-cores microprocessors, which improves performance
for a large number of applications, cores are all grouped,
and data transfer between them is very limited. Exclusively
for some simple problems (e.g. naive algorithms) the GPU is
able to achieve a performance similar to the FPGA. For more
sophisticated algorithms (e.g. shared arrays), GPUs do not
demonstrate the same performance since they have memory
access limitations as a result from its architecture.
FPGAs provide hardware platforms suitable for deploying
software algorithms [23]. From the theoretical basis pre-
sented, it is possible to conclude that the low execution
time of FPGA devices in comparison with its software coun-
terparts is the main reason for its use as a platform for
the development of the Q-learning Reinforcement Learning
Technique.
B. MAIN CONTRIBUTIONS
This work presents as contribution a hardware parallel archi-
tecture on FPGA of the Q-learning reinforcement learning
technique. The main idea is based on the development of a
modular and parallel architecture to enable an increase in the
algorithm execution speed or lower power consumption by
decreasing the clock frequency. The intrinsic properties of
the FPGA, such as: flexibility and parallel processing, were
fundamental to achieving this goal. The parallelization of the
data flow on FPGA allows the Q-learning technique to be
used in applications where there are a significant data flow
and strict processing time restrictions. Another possibility
of application of this architecture is in low consumption
systems, where the system clock can be reduced in way to
reduce the power consumption.
C. PAPER ORGANIZATION
This paper is organized as described in the following para-
graphs.
In this first section a brief introduction was presented, in
which the problem to be approached was contextualized. A
bibliographical and state of the art review was also conducted
as well as the main objectives to be achieved were presented.
In section II a theoretical foundation on reinforcement
learning will be presented, exploring the main characteristics
and advantages of the algorithm that was implemented in
Hardware, the Q-learning.
In section III, a detailed description of the architecture de-
velopment and implementation will be explained, describing
the various modules used to build the algorithm in hardware.
In Section IV, the system validation will be performed by
simulating few problems in the architecture presented, com-
paring with results obtained by simulating the same problems
in software. The hardware synthesis analysis for different
implementation scenarios was also performed, alongside
the evaluation of parameters such as occupation area and
throughput (or sampling frequency).
The section V will present the final considerations.
II. Q-LEARNING TECHNIQUE
This section aims to discuss the concepts and uses of rein-
forcement learning, emphasizing the technique used in this
work, Q-learning.
Reinforcement learning is the maximization of numerical
rewards by mapping events defined by states and actions [1].
The agent does not receive the information of what action to
take as in other forms of machine learning, instead, it must
find out which actions will produce the best reward in each
state from interactions with the environment. As described
in [1], the agent determines an action to be performed from
situations encountered in the environment. The executed ac-
tion transforms the environment and disturbs the state in the
inverse of reaching the goal. Modifications are transmitted to
the agent through a reward and next state.
The algorithm goal is to find a sequence of actions that
determines an optimal policy, defined as the state mapping in
actions that maximize the sum of the reinforcement values.
Figure 1 summarizes the described agent-environment inter-
action process, where:
• sk is a representation of the environment state where st
∈ S and S is the set of possible states;
• ak is an action representation, where ak ∈ A and A is
the set of possible actions in the state sk;
• rk+1 is a numerical reward, a consequence of the action
ak taken;
• sk+1 is the new state.
Q-learning [28] is one of the learning techniques classified
as an off-policy time difference method, since the conver-
gence to optimal values of Q does not depend on the policy
being used. The future reward function in the s state when
performing an a action, denoted as Q(s, a), is assimilated
VOLUME 4, 2016 3
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
AGENT
ENVIRONMENT
action 
ak 
 
reward 
rk 
 
state 
sk 
rk+1 
sk+1 
FIGURE 1: Interaction agent-environment in reinforcement
learning [1].
by interactions with the environment. It is considered as the
most popular RL algorithm and has been proposed as a way
to iteratively learn optimal policy when the system model is
not known.
The equation for updating the value function of the state-
action pairs Q(s, a) is based on the value-action function
expressed as:
Qk+1(s
n
k , a
z
k) = Qk(s
n
k , a
z
k) + α[r(s
n
k , a
z
k)+
γmax(Q(snk+1, a
z
k+1))−Qk(snk , azk)](1)
where
• k is the discretization instant, with sampling period Ts.
• snk is the n-th environment state in the k-th iteration;
• azk is the z-th action taken in the n-th state s
n
k also in the
k-th iteration;
• Qk(snk , a
z
k) is the accumulated result for the agent hav-
ing chosen the action azk in the state s
n
k in the instant
k;
• r(snk , a
z
k) is the immediate reinforcement received in s
n
k
for taking action azk;
• snk+1 is the future state;
• max(Q(snk+1, a
z
k)) is the value Q corresponding to the
maximum value function in the future state.
• α and γ are positive constants of value less than the
unit that represent the learning coefficient and discount
factor, respectively.
The learning coefficient determines to what extent new
information will replace the previous ones, while the closest
discount factor reduces the influence of immediate rewards
and considers those in the long run. The reward function
r indicates the immediate promising actions and the value
function Q indicates the total accumulated gain. When the
agent changes from a state to a future one, Q-learning updates
the new Q value function estimate from the new state to the
previous state.
A relevant aspect of the Q-learning reinforcement learning
technique is that the choice of actions to be performed during
the process of estimating the Q(s, a) value function can be
performed by any method of exploration/exploitation or even
randomly. As demonstrated by [28], if each action-state pair
is visited an infinite number of times the value function Q
will converge with probability 1 to its optimal value, using a
sufficiently small alpha learning coefficient.
Figure 2 shows the Q-learning algorithm pseudocode. It
has as inputs the learning coefficient, the time discount rate
and the reward function. It begins with the initialization of the
Q values function matrix and initial state s0. The algorithm
chooses an action from among the possible ones for the
current state and observes the next state and reward. The
value of Q is updated, the new state is defined and then the
process is repeated until it returns the updated Q matrix.
1: Q ← Q0(sn0 , az0)
2: s← s0
3: while k < steps limit do
4: Drawn an action azk for the state s
n
k
5: Execute the action azk
6: Observe the state snk+1 and updateQk(s
n
k , a
z
k) accord-
ing with:
7: Qk+1(s
n
k , a
z
k) = Qk(s
n
k , a
z
k) + α[r(s
n
k , a
z
k) +
γmaxQ(snk+1, a
z
k+1)−Qk(snk , azk)]
8: sk ← sk+1
9: end while
10: return matrix Q
FIGURE 2: Q-learning algorithm.
Although the Q-learning convergence criterion requires
state-action pairs to be visited infinite times, in practice it
is possible to reach quite relevant values when executing a
sufficiently large number of iterations (considering the task
to be learned). For a problem of 18 states and 2 actions, as
described in [2], where Q-learning was used to determine
an optimum selection policy between beam conformation
and power control of an adaptive arrangement of antennas,
the matrix Q convergence happens after approximately 2500
iterations. In the work of [29], Q-learning was used to make
adaptive thermal management of multicores systems, in order
to improve the reliability and extend their useful life. In this
second work, the RL algorithm learned the relation between
the core mapping, the core frequency and its temperature,
defining the thermal stress intervals as states and the trheads,
voltages and operating frequencies as actions. For a quad-
core Intel, the system was modeled by 12 state and 8 actions,
requiring a total of 5500 iterations for the Q value function
convergence. In the work presented in [30] the Q-learning
algorithm was used to determine the policy optimization of
three controllers applied to a tank system in order to take
advantage of the positive characteristic of each of them, thus
optimizing the system output. In this case the error signal
was discretized in 41 intervals, characterizing the admissible
states, and the choice among one of the three controllers
to the actions of the problem. A system with 41 state and
3 actions, randomly choosing the actions, converges to an
optimal policy in approximately 8000 iterations.
III. IMPLEMENTATION DESCRIPTION
In this section the implementation details of the developed
architecture are described. In III-A is presented an FPGA
Hardware overview of the Q-learning architecture, where
4 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
the notation used to describe the structure is defined. In
the following topics, particularities of each of the system
modules are discussed, detailing the mechanisms used for the
hardware implementation of the algorithm shown in Figure 2.
A. PROPOSED ARCHITECTURE OVERVIEW
An overview of the developed hardware architecture is pre-
sented in Figure 3. The system receives the FPGA clock as
input and the initial state value, s0, must be randomly initial-
ized in the REG1 register. The whole system is detailed from
this diagram, where the main mechanisms of action choice,
value function calculations, state-action pairs update and the
mechanisms of future state selection will be explained. The
system is designed to operate withN states andZ actions and
therefore a combination of N ×Z possible state-action pairs.
The architecture was developed in an attempt to parallelize as
much as possible the algorithm execution in order to decrease
the Q-learning processing time.
The notation used in the figures is described below:
• k is the discretization time, with sampling time Ts, in
which one can measure the transfer rate, known by
throughput. The throughput (or sampling frequency)
can be expressed as Fs = 1/Ts in samples per seconds
(Sps) or iterations per seconds (Ips);
• snk is the n-th environment state in k-th iteration;
• azk is the z-th action taken in the n-th state s
n
k also in the
k-th iteration;
• unk is a Z vectors elements that will enable or disable
the registers that store the value assigned to the value
function in the n-th state;
• rn is a constant value vector with the immediate rein-
forcement for the Z actions of the n-th state;
• snk+1 is the n-th future state;
• maxQn is the Q value corresponding to the action with
higher reinforcement value in the state, updated at each
iteration;
• Qnk is a vector containing the elements of the value
function assigned to the Z actions of the n-th state;
• α and γ are positive constants of value less than the unit
that represent the learning coefficient and the discount
factor, respectively.
The architecture is composed by five main modules types:
The GA module, responsible for randomly choosing the
actions of the algorithm; The EN modules, which determine
which state-action pair should be updated; The RS modules,
responsible for storing the reward values; The S modules,
responsible for the calculation of the Q value function; And
the SEL module, where the future state selection and the
storage of the Q value function are made. Each of the system
modules is detailed individually in the following sections.
B. GA - ACTION DRAW
As seen in the Q-learning pseudo-code shown in Figure 2
(Section II) it is necessary to draw the azk action for the s
n
k
state. For this purpose, a Pseudo Random Number Generator
(PRNG) was implemented. The generator draws from all
possible actions (0 to Z − 1) what action will be taken. Each
z-th action is formed by one word of log2(Z) bits. The first
s0 state is randomly initialized, among all possible states (0
through N − 1), in the REG1 register, and has a size of
log2(N) bits.
The pseudo random number generator is the starting point
for executing the algorithm. From the second iteration, the
following states are defined by feedback, as a consequence
of the system actions and the actions continue to be randomly
defined.
For the algorithm convergence, it is necessary for all state-
action pairs to be visited a sufficiently large (ideally infinite)
number of times. And for this it’s used the pseudo-random
number generator based on the numerical congruence de-
scribed in [31]. The expression used to implement the PRNG
is presented as
azk = P1 × azk−1 + P2 (mod Z), (2)
where the values P1 and P2 are constants, azk is an integer
between 0 and Z− 1 and azk−1 is the value from the previous
instant of the pseudo-random number series. The internal
architecture of the random number generator is illustrated in
Figure 4.
Here, the constant P2 was adopted as zero. The modulo
operation (mod Z) is performed from an intrinsic multiplier
overflow function, the wrap-around (i.e. values that exceed
the maximum number of bits are bypassed within the repre-
sentable range by saving only the least significant bits).
However, problems can occur where the number of possi-
ble actions are not multiple of two. To solve this limitation,
an artifice shown in Figure 5 was adopted. A PRNG with a
number of possible combinations (N◦max) much larger than
the number of desired actions is utilized. Then, the interval
containing all combinations is divided into Z equal intervals,
where Z is the desired number of actions. If the number
drawn is 0 < x < C1 the action azk will be a
0
k. In case the
number drawn is C1 < x < C2 the action a1k and so on until
x > CZ−1 ,when the action will be aZ−1k . This division is
made using comparators and combinational logic.
Once determined the action-state pair at each iteration, it
is known which of the elements of the value function matrix
Qk(s
n
k , s
z
k) must be updated. As the actions are randomly
chosen (pseudo-random), the architecture performs only the
exploration (training) of the environment by the agent to
obtain the optimal policy, not worring about the exploitation.
C. EN - UPDATE MODULE
The update modules called EN are responsible for selecting
which state-action pair (snk , a
z
k) will be updated. Each n-
th module, ENn, is a combinational logic block that has as
inputs the k-th state, snk , action a
z
k, and its output is a vector
VOLUME 4, 2016 5
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
γ.maxQ
α
azk
SnkD Q
Q
γ.maxQ
Sk+1
Future state  
selection 
 
and 
 
Value function
storage 
 
 
 
 
 
SEL 
S0k+1
maxQ0
Q0k
S1
u0kEN1
r0RS1
...
...
...
γ.maxQ
α
azk
S1k+1
maxQ1
Q1k
S2
u1kEN2
r1RS2
... ...
α
azk
SN-1k+1
maxQN-1
QN-1k
SN
uN-1kENN
rN-1RSN 
...
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
...
...
...
azk
GA 
FIGURE 3: Overview of the proposed architecture.
azk-1
DQ
Q
GA 
azk
P1 
FIGURE 4: Pseudo random number generator architecture.
of Z elements, here denoted as unk and represented by
unk =

un,0k
un,1k
...
un,Z−1k
 (3)
where un,zk is a bit that when at high logic level represents
that the matrix element of the Q value function referring to
the n-th state and the z-th action must be updated. Consider-
azkCombinational
=c1
No(max)
Z
= (Z − 1)cZ−1
No(max)
Z
= 2c2
No(max)
Z
. . .
GA 
. . .
FIGURE 5: Hardware implemented of the random number
generator.
ing the outputs of all N modules, there are N × Z outputs,
the same number of state-action pairs. However, only one
output from one of these modules will have high logic level
for each iteration. The logical operation that determines unk
is expressed as
unk =
(1 >> a
z
k)||A if snk = n
A
(4)
where >> is a logical shift operator to the right and A is a Z
zeros vector.
The values for each value function Qk(snk , s
z
k) action-
state pair are stored in N × Z registers which have their
enable inputs connected to the outputs of the EN modules.
6 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Therefore, the value function is only updated when one of
the elements of one of the n-th unk vectors is at the high logic
level.
D. RS - REWARD FUNCTION MODULE
The values of the immediate reinforcements, or reward, are
stored in the N modules called RS. The reward function rn
is a vector of Z elements, represented as
rn =

rn,0
rn,1
...
rn,Z−1
 , (5)
which indicates the immediate promising actions in that state.
Each n-th RS module has Z constant, rn,z , associated with
each of the Z actions of the n-th state. These constants
express the goal the agent wants to achieve. Each z-th rn,z
variable consists of a word of B bits. Actions leading to
the target state have a positive numerical rn,z reinforcement
value. Undesired actions in the state receive a negative nu-
meric boost rn,z . The actions that lead to other states receive
rn,z = 0.
E. SN - VALUE FUNCTION CALCULATION MODULE
The Q-learning hardware architecture, as observed in the
main diagram shown in Figure 3, is paralleled regarding its
states (snk ). The nth-state module, Sn, is subdivided into
two other different functions modules. Its configuration is
illustrated in Figure 6.
Snk+1SFn 
Qn 
maxQn
Qnk
azk
unk
rn
α
γ.maxQ Sn 
FIGURE 6: Sn module architecture.
In the SFn module, the future state, snk+1, is determined
locally, from the draw action information only, azk. In the Qn
module, the calculation of the value function vector elements
Qnk is performed, and the value function corresponding to the
action with the highest value, maxQn, is determined for the
n-th state.
1) Qn - Value Function Calculation
Each n-th module Qn computes the vector
Qnk =

Qn,0k
Qn,1k
...
Qn,Z−1k
 . (6)
Qnk is a vector of Z elements, where each Q
n,z
k element is
formed by B bits. The set ofN vectors, Qnk , forms the matrix
value function Qk(snk , a
z
k) that can be expressed as
Qk(s
n
k , a
z
k) =
[
Q0k Q
1
k ... Q
N−1
k
]
=

Q0,0k Q
1,0
k ... Q
N−1,0
k
Q0,1k Q
1,1
k ... Q
N−1,1
k
...
... ...
...
Q0,Z−1k Q
1,Z−1
k ... Q
N−1,Z−1
k
 .(7)
The inputs for this module are the enable vector unk , the r
n
reward function, the learning coefficient α and the Q value
corresponding to the action with the greatest future reinforce-
ment value in the future discounted state of γ (γ.maxQ). Part
of the internal architecture of the Sn modules is illustrated
in Figure 7, which is also a parallel architecture, paralleled
regarding the system actions. At each iteration the Q matrix
is updated. In order to do so, 2 × Z adders (SOM1nz
and SOM2nz), Z subtractors (SUBnz), and Z multipliers
(MULTnZ) are used in each of the n-th Sn modules that
implement Equation 1 which is the fundamental Q-learning
equation. Additionally, Z registers are used for the Qnk stor-
age.
In addition to calculating the value function, in module Sn
is also calculated the Q value corresponding to the action
with the highest value in the n-th state. This variable is
illustrated in Figure 7 as maxQn and it is obtained from the
comparison (COMPn) of all z-th elements of vector Qnk in
the n-th state.
2) SFn - Future State
There is also a third functionality implemented in the Sn
module. In it is determined what would be the future Snk+1
status for the ak action drawn by GA. The structure shown
in Figure 8 represents the portion of the architecture of the
module Sn, the internal module SFn, responsible for execut-
ing this functionality. Since it is a parallelized architecture, a
future state is determined in each of theN modules Sn taking
into account only the information of the action ak drawn. In
the SEL module it is decided which n-th future state Snk+1
will continue in the algorithm and become the current state
in the next iteration.
Therefore, the Sn block delivers three information to the
system: the value function vector for the n-th state Qnk
actions, the value function correspondent to the action with
the highest value maxQn and the n-th future state snk+1(s)
determined from the action taken.
VOLUME 4, 2016 7
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
+ - + D Q
Q
SUM1n1
REGn1SUM2n1
en
SUBn1
MULTn1
un,0k
Qn,0k
γ.maxQ
rn,0
α
+ - + D Q
Q
SUM1n2
REGn2SUM2n2
en
SUBn2
MULTn2
un,1k
Qn,1k
rn,1
α
. 
. 
. 
+ - + D Q
Q
SUM1nZ
REGnZSUM2nZ
en
SUBnZ
MULTnZ
un,Z-1k
Qn,Z-1k
rn,Z-1
α
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
maxQnCOMPn 
FIGURE 7: Qn module architecture - function value calculation.
Snk+1MUX1n
Sn,0k+1 
Sn,1k+1 
Sn,Z-1k+1 
. 
. 
. 
ak
SFn 
FIGURE 8: SFn module architecture - future state choice.
F. SEL - FUTURE STATE SELECTION MODULE
The last module from the architecture is the SEL module. Its
structure is shown in Figure 9. It is the algorithm junction
point, where information parallel computed in previous mod-
ules meet. It is in this module that it is determined which will
be the next state to be explored by the architecture. It is also
where the action with greater value in the future state maxQ
is determined and where the N vectors Qnk are assembled to
construct the system value function matrix Qk(snk , a
z
k).
In order to determine the future state sk+1, all n-th future
snk+1 states from the N modules Sn are placed in a MUX2
multiplexer, which selects the current state sk. This future
state value is fed back to the beginning of the architecture
and becomes the current state in the next iteration.
In an effor to determine the action with the highest value
in the future state maxQ, the future state sk+1 is used as
the selector of the MUX3 multiplexer that has as input the
N actions with the highest value of the N states from the
maxQn architecture. The value of maxQ is multiplied by the
time discount factor γ and fed back to the inputs of the Sn
modules for the calculation of the Qnk vectors that make up
the value function.
IV. RESULTS
In this section, simulation and hardware synthesis results for
the architecture proposed in this work are presented. Sim-
ulations and syntheses for different scenarios were carried
out and the numbers of states and actions were varied. All
scenarios were simulated and synthesized for different bits
resolutions. The simulation results were used to validate the
hardware architecture and to evaluate the resolution error
from the bits number. The synthesis results allowed the sys-
tem analysis regarding important parameters for the design of
hardware architectures, such as occupation rate and sampling
time.
The applicability of the proposed architecture to real prob-
lems is also analyzed in this section. Applications found in
the literature, using the Q-learning algorithm for training
the agent were synthesized on FPGA in order to obtain the
throughput, Fs, and the time of convergence for the optimal
policy.
8 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
MUX2. 
. 
. 
Sk 
SEL 
MUX3. 
. 
. 
Qk(Snk, azk) . 
. 
. 
γ.maxQ
 
γ
S0k+1 
S1k+1 
SN-1k+1 
maxQ0 
maxQ1 
maxQN-1 
Q0k 
Q1k 
QN-1k 
Sk+1
FIGURE 9: SEL module architecture.
A. SIMULATION RESULTS
For the the Q-learning algorithm architecture simulation and
validation, a scenario in which a robot moved in an arena,
aiming to reach a certain region of it was analyzed. The
number of states in this problem represents the granularity
of the arena, while the actions represent the robot possible
movements. The arena was divided into six regions (states:
s1, s2, s3, s4, s5 and s6) and the robot had four possible
directions of movement (actions: a1 - up, a2 - down, a3 - left,
a4 - right). The problem described is illustrated in Figure 10
and was simulated with fixed-point digital representation for
five different resolutions. The notation used is [n.b] where n
is the total number of bits, b bits represent the fractional part
and (n− b) bits represent the integer part.
It was desired that the agent could reach room 6 (s6), re-
gardless of the room in which it was in. To define room 6 as a
goal, an immediate rn,z = 100 reinforcement was associated
with actions that directly lead to the desired region. Blue
FIGURE 10: Example used for simulation and validation of
the hardware implementation.
arrows were used in Figure 10 to illustrate these actions. If the
agent performed an action that resulted in collision with the
edges of the arena, it received an immediate negative reward
(rn,z = −500). These actions are represented by red arrows.
In all other transitions, the received reward is zero (rn,z = 0),
represented by the white arrows. The r matrix
r =

−500 0 −500 0
−500 0 0 0
−500 100 0 −500
0 −500 −500 0
0 −500 0 100
0 −500 0 −500
 (8)
shows all rewards for all state-action pairs, where states (s1,
s2, s3, s4, s5, s6) are represented in the array rows, while
the (a1, a2, a3, a4), are represented by the columns. Each
element of the r array is formed by one word of [n.b] bits.
The numerical results of the function value Qk(snk , a
z
k)
after the developed architecture simulation were compared
with results obtained through an Matlab floating point im-
plementation, IEEE 754 standard. To simulate the example
described, a learning coefficient α = 0.8 and a discount
rate γ = 0.8 were used as parameters. These parameters
were used both in the Matlab floating-point simulation and
in the parallel hardware architecture simulation performed
using System Generator.
The floating-point value function matrix in the IEEE 754
standard is shown below.
Qk(s
n
k , a
z
k) =

−357.8 177.8 −357.8 177.8
−322.2 222.2 142.2 222.2
−277.8 277.8 177.8 −277.8
142.2 −322.2 −322.2 222.2
177.8 −277.8 177.8 277.8
222.2 −322.2 222.2 −322.2

(9)
The hardware architecture simulation results are shown in
Table 1. The hardware architecture was simulated with digital
fixed-point representation in four different scenarios. In the
first scenario, 24 bits were used, 14 bits for the binary part
(Table 1(a)). In the second, 20 bits, being 10 for the binary
part (Table 1(b)). In the third one, 16 bits with 6 bits in the
VOLUME 4, 2016 9
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
binary part were used (Table 1(c)). In the last scenario, 12
bits were used with 2 bits in the binary part (Table 1(d)).
After the simulations, it was possible to observe from the
optimal policy results obtained, the the lower the number of
bits, the greater the resolution error (e) obtained regarding the
floating point. However, it is important to emphasize that the
resolution of the Q matrix is not as significant as long as its
optimal policy is well defined. Figure 11 is the representation
of the value function obtained in the floating-point simulation
from a color matrix where the values of the actions are codi-
fied in a linear scale from−400 to 400. The smallest value of
the scale is represented by the darkest blue while the highest
value is represented by lightest shade or ref. Figure 12 is
the representation, in color matrices, of the functions values
obtained from the fixed-point architecture simulation using
the same scale as reference. From this it is possible to observe
that, despite the resolution error associated with the decrease
on the number of bits, the policy is well characterized. Even
in the worst simulated case, with only 12 bits of resolution
where an error greater than 30% was obtained, the best and
worst actions, despite having different colors of the reference
matrix, are well defined and present the same policy when
compared with the floating point and with the other cases
simulated in different resolutions.
 
 
0.5 1 1.5 2 2.5 3 3.5 4 4.5
1
2
3
4
5
6
FIGURE 11: Color matrix for function value obtained at
floating point representation.
The resolution error can always be improved by increasing
the number of bits. However, as will be detailed in the
following sections, this directly implies the increase of the
occupied area in the FPGA and increase in the processing
time.
B. SYNTHESIS RESULTS
For the hardware architecture synthesis analysis, in addition
to the scenario presented in Figure 10, nine other distinct
scenarios were analyzed, with different numbers of states
and actions. The scenarios were characterized using also as
problem the robot, described in section IV-A, that moves in
an arena. The greater the granularity of the arena, the greater
the number of states. Six scenarios were determined for four
possible actions: a1 - up, a2 - down, a3 - left, a4 - right, where
the agent moves from a position in the direction indicated
by the action. Other four scenarios were characterized for
eight possible actions: a1 - up, a2 - down, a3 - left, a4 -
0.5 1 1.5 2 2.5 3 3.5 4 4.5
1
2
3
4
5
6
(a) [24.14] bits
0.5 1 1.5 2 2.5 3 3.5 4 4.5
1
2
3
4
5
6
(b) [20.10] bits
0.5 1 1.5 2 2.5 3 3.5 4 4.5
1
2
3
4
5
6
(c) [16.06] bits
0.5 1 1.5 2 2.5 3 3.5 4 4.5
1
2
3
4
5
6
(d) [10.00] bits
FIGURE 12: Color matrices for value functions obtained at
fixed-point representation.
right, a5 - up2, a6 - down2, a7 - left2, a8 - right2, where the
behavior of the agent is the same as in the previous scenarios
for the first four actions while in the other actions it moves
two positions in the direction indicated by the action. The size
of all examples implemented and simulated are presented in
Table 2, where N represents the number of states and Z the
number of actions. The parameters used in the tests of these
10 scenarios are shown in Table 3. All results were obtained
for the Xilinx Virtex-6 FPGA [15].
Figure 13 shows the hardware setup for the experiments.
It was used the Virtex-6 FPGA ML605 Evaluation Kit by
Xilinx [15], [32]. The architecture was developed using
structural modeling with the System Generator for DSPTM
[14]. The system generator is the architecture-level design
tool used to create high-performance algorithms on Xilinx
devices using Matlab/Simulink (license number 1080073)
[33] together with the Xilinx Vivado Design Suite (license
number 505318) [34].
FIGURE 13: Hardware setup for the experiments with the
Xilinx Virtex-6 FPGA ML605 Evaluation Kit.
10 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 1: Value function for different binary representations.
Q =

−357.8 177.8 −357.8 177.8
−322.2 222.2 142.2 222.2
−277.8 277.8 177.8 −277.8
142.2 −322.2 −322.2 222.2
177.8 −277.8 177.8 277.8
222.2 −322.2 222.2 −322.2
 Q =

−358.0 177.5 −358.0 177.5
−322.5 222.0 142.0 222.0
−277.0 277.5 177.5 −278.0
142.0 −322.5 −322, 5 222, 2
177.5 −278.0 177.5 277.5
222.0 −322.5 222.0 −322.5

(a) [24.14] bits and e = 0.01% (b) [20.10] bits and e = 0.17%
Q =

−361.5 173.9 −361.5 173.9
−326.1 218.2 138.5 218.2
−281.8 273.9 173.9 −281.2
138.5 −326.1 −326.1 218.2
173.9 −281.8 173.9 273.9
218.2 −326.1 218.2 −326.1
 Q =

−405.0 127.2 −405.0 127.2
−372.7 170.0 95.0 170.0
−330.0 227.2 127.2 −330.0
95.0 −372.7 −372.7 170.0
127.2 −330.0 127.2 227.2
170.0 −372.7 170.0 −372.7

(c) [16.6] bits and e = 2.61% (d) [12.2] bits and e = 33.20%
TABLE 2: Synthetized cases.
Case I II III IV V VI VII VIII IX X
N 6 12 12 20 20 30 30 56 56 132
Z 4 4 8 4 8 4 8 4 8 4
TABLE 3: Synthesis parameters.
Parameters Values
Number of States N
Number of Actions Z
Learning Coefficient (α) 0.8
Discount rate (γ) 0.8
Digital Representation Fixed-point
Number of Bits [n.b]
All scenarios were synthesized with all variables at fixed-
point. The Tables 4 - 13 illustrate the results obtained both
in terms of occupancy rate and throughput, Fs, in Mega-
Samples per second (MSps) or Mega-Iterations per second
(MIps). In the first columns is indicated the bit resolution
synthesized for the variables rn and Qnk , the other variables
have its resolution fixed. For this synthesis analysis, four
different representations were implemented.
TABLE 4: Hardware synthesis - Scenario I (N = 6, Z = 4).
[n.b]
Multipliers
Registers
LUTs
Fs Power
bits (flip-flops) (MSps) (mW)
[24.14]
34 788 2597
23.02 5
(4%) (< 1%) (2%)
[20.10]
34 669 2159
21.80 4
(4%) (< 1%) (1%)
[16.06]
34 548 1734
23.54 3
(4%) (< 1%) (1%)
[10.00]
34 367 1086
26.42 3
(4%) (< 1%) (< 1%)
TABLE 5: Hardware synthesis - Scenario II (N = 12, Z =
4).
[n.b]
Multipliers
Registers
LUTs
Fs Power
bits (flip-flops) (MSps) (mW)
[24.14]
106 1509 5070
20.83 8
(13%) (< 1%) (3%)
[20.10]
58 1270 4222
22.23 6
(7%) (< 1%) (3%)
[16.06]
58 1029 3387
22.27 6
(7%) (< 1%) (2%)
[10.00]
58 668 2133
24.76 6
(7%) (< 1%) (1%)
TABLE 6: Hardware synthesis - Scenario III (N = 12, Z =
8).
[n.b]
Multipliers
Registers
LUTs
Fs Power
bits (flip-flops) (MSps) (mW)
[24.14]
202 2661 10379
18.08 15
(26%) (< 1%) (7%)
[20.10]
106 2230 8639
20.71 15
(13%) (< 1%) (5%)
[16.06]
106 1797 6960
20.67 10
(13%) (< 1%) (4%)
[10.00]
106 1148 4415
23.45 10
(13%) (< 1%) (3%)
C. ANALYSIS OF HARDWARE OCCUPATION RESULTS
In Tables 4 - 13, the second column was used to display
the number of multipliers used. The multipliers are used
in the PRNG (GA module), as well as in the S modules
where the value function is calculated, specifically for the
multiplication of the learning coefficient (α) (MULTnz) and
for the multiplication of the factor (γ) for the actions with the
highest future reinforcement value (maxQ) (SEL module).
An increase in the number of multipliers, in all scenarios, is
observed for configurations larger than 20 bits. This is due
VOLUME 4, 2016 11
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 7: Hardware synthesis - Scenario IV (N = 20, Z =
4).
[n.b]
Multipliers
Registers
LUTs
Fs Power
bits (flip-flops) (MSps) (mW)
[24.14]
170 2470 8349
17.82 14
(22%) (< 1%) (5%)
[20.10]
90 2071 6966
19.87 9
(11%) (< 1%) (4%)
[16.06]
90 1670 5594
20.16 10
(11%) (< 1%) (3%)
[10.00]
90 1069 3541
21.23 9
(11%) (< 1%) (2%)
TABLE 8: Hardware synthesis - Scenario V (N = 20, Z =
8).
[n.b]
Multipliers
Registers
LUTs
Fs Power
bits (flip-flops) (MSps) (mW)
[24.14]
330 4390 17188
16.74 23
(42%) (1%) (11%)
[20.10]
170 3671 14373
15.96 20
(22%) (1%) (9%)
[16.06]
170 2950 11561
17.66 21
(22%) (< 1%) (7%)
[10.00]
170 1869 7321
19.79 20
(22%) (< 1%) (4%)
to the size of the hardware multipliers built into the used
FPGA, which is of 48 bits. Therefore, arithmetic operations
performed with a multiplier now need more than one unit.
All the multiplications carried out in the first 8 scenarios
(I-VIII) were implemented through embedded multipliers
(DSP48E1) from the Virtex-6 FPGA. In scenarios IX and
X, due to a greater systems complexity, some multiplication
operations were implemented through logical cells (lookup
table - LUTs) as the chosen FPGA did no have enough
embedded multipliers to implement these scenarios when
synthesized in higher resolution.
The third column displays the number of registers for the
implementation. The area occupied by the registers is due to
the storage of the Qk(snk , a
z
k) (REGnz) value function which
is calculated for each of the state-action pairs. Registers have
also been used to store the value of the function correspond-
ing to the action with the highest value maxQn for each state
and to store the future state (sk+1) during an iteration before
it is fed back to the beginning of the architecture and becomes
the current state (sk)(REG1).
The fourth column shows the number of logical cells used.
The occupation of the logical cells is related to the arithmetic
operations implemented in the N S blocks to enable the
calculation of the value function Q. In the more complex
scenarios, IX and X, this occupation is directly related to
the use of LUTs to carry out the multiplication operations
in place of the embedded multipliers.
Figures 14, 15 and 16 illustrate the occupation varying
with the states and actions for the scenarios already character-
TABLE 9: Hardware synthesis - Scenario VI (N = 30, Z =
4).
[n.b]
Multipliers
Registers
LUTs
Fs Power
bits (flip-flops) (MSps) (mW)
[24.14]
250 3670 12379
16.57 19
(32%) (1%) (8%)
[20.10]
130 3071 10299
20.19 13
(16%) (1%) (6%)
[16.06]
130 2470 8272
19.67 13
(16%) (< 1%) (5%)
[10.00]
130 1569 5206
21.29 13
(16%) (< 1%) (3%)
TABLE 10: Hardware synthesis - Scenario VII (N = 30,
Z = 8).
[n.b]
Multipliers
Registers
LUTs
Fs Power
bits (flip-flops) (MSps) (mW)
[24.14]
490 6550 25627
15.53 32
(63%) (2%) (17%)
[20.10]
250 5471 21406
16.48 25
(32%) (2%) (14%)
[16.06]
250 4390 17208
16.45 26
(32%) (1%) (11%)
[10.00]
250 2769 10915
16.29 28
(32%) (< 1%) (7%)
ized. Figures 14(a), 15(a) and 16(a) illustrate the occupation
for the highest synthesized resolution, [24.14] bits. Figures
14(b), 15(b) and 16(b) were constructed from the lowest
resolution data, [10.0] bits. It is possible to notice that the
occupied area increases almost linearly with the number of
state-action pairs. The discontinuities present in Figures 14
and 16 therefore appear when replacing some embedded
multipliers by LUTs, in the scenarios of greater complexity,
consequently increasing the proportional number of LUTs
and decresing the number of multipliers. When comparing
the occupied area for the case of higher resolution with the
case of lower resolution, it is possible to observe that the
behavior is practically the same regarding the area occupied
by multipliers (Figure 14) and by registers (Figure 15).
However, it is possible to notice that for LUTs this does not
happen. The growth in LUTs is much higher when the num-
ber of bits is increased and LUTs are used for multiplication
operations in place of the embedded multipliers.
It is observed that occupation area is determined both by
the number of bits and by the complexity of the problem,
i.e. the higher the resolution used in the problem and the
more state-action pairs (n, z) the more space occupied in the
FPGA. In situations where it is necessary to reduce the FPGA
occupancy, it is possible to modify the resolution, because as
demonstrated in section IV-A (Figure 12), it is possible to
obtain the optimal policy even though there is an associated
resolution error.
12 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 11: Hardware synthesis - Scenario VIII (N = 56,
Z = 4).
[n.b]
Multipliers
Registers
LUTs
Fs Power
bits (flip-flops) (MSps) (mW)
[24.14]
490 6792 23117
15.90 55
(59%) (2%) (15%)
[20.10]
234 5673 19252
15.59 28
(30%) (2%) (14%)
[16.06]
234 4552 15526
15.15 29
(30%) (1%) (10%)
[10.00]
234 2875 9808
20.20 28
(30%) (< 1%) (6%)
TABLE 12: Hardware synthesis - Scenario IX (N = 56, Z =
8).
[n.b]
Multipliers
Registers
LUTs
Fs Power
bits (flip-flops) (MSps) (mW)
[24.14]
746 12248 90917
13.75 62
(97%) (4%) (60%)
[20.10]
378 10153 55573
14.89 54
(49%) (3%) (36%)
[16.06]
378 8136 48536
14.41 47
(49%) (2%) (32%)
[10.00]
378 5115 28849
17.88 47
(49%) (2%) (19%)
D. ANALYSIS OF SAMPLING RESULTS
In Tables 4 - 13, the throughput, Fs, is observed in before
the last column. Since the system was developed in such a
way as to parallelize the data stream to the maximum, the
sampling rate does not vary significantly, maintaining a very
high throughput, Fs, in the order of 10MSps.
From the sampling data, a decrease in the maximum
throughput is observed with the increase in the number of
states of the problem. It is also possible to note that this
throughput decreases with the increase in the number of
possible actions per state. This is explained by the increase
in the complexity of the problem which results in a greater
amount of data to be processed. Figures 17(a) and 17(b)
present the throughput for different state numbers and actions
with [24.14] and [10.00] bits, respectively. As the system
was developed in such a way as to parallelize data flow to
the maximum, this variation is not so significant. Since even
increasing the number of state-action pairs, the path traveled
during the processing of the information is not significantly
modified. A small reduction in the throughput, Fs, occurs
because by increasing the number of actions per state, it is
necessary to make the value function Q(s, a) comparison
(COMP n) of all possible actions in order to determine the
value function maximum maxQ(st+1, a) for the best action
in the nth-state. By increasing the number of states, conse-
quently the number of multiplexers inputs that select the next
state (MUX1 n) in the modules S n also grows. The same
occurs in the multiplexers that select the next state (st+1)
TABLE 13: Hardware synthesis - Scenario X (N = 132,Z =
4).
[n.b]
Multipliers
Registers
LUTs
Fs Power
bits (flip-flops) (MSps) (mW)
[24.14]
730 15958 142778
12.23 91
(95%) (5%) (94%)
[20.10]
370 13175 77574
13.40 78
(48%) (4%) (51%)
[16.06]
370 10533 70311
13.35 67
(48%) (3%) (46%)
[10.00]
370 6626 40764
14.00 60
(48%) (2%) (27%)
0
8
200
400
60
M
u
lt
ip
li
er
s 600
Actions (Z)
6 40
States (N)
800
20
4 0
(a) [24.14] bits
0
8
200
60
M
u
lt
ip
li
er
s
Actions (Z)
6 40
States (N)
400
20
4 0
(b) [10.00] bits
FIGURE 14: Occupied area in multipliers for different sce-
narios.
(MUX2), and the maximum function value of the future state
maxQ(st+1, a) (MUX3). These are the main bottlenecks for
the complete parallelization of the system and the factors that
influence the decrease of the sampling rate as the complexity
of the problem increases. Despite bottlenecks, it is observed
that between the worst (N = 132, Z = 4) and the best case
(N = 6, Z = 4) the variation of the sampling period is less
VOLUME 4, 2016 13
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
0
8
5000
60
R
eg
is
te
rs
(fl
ip
-fl
op
s)
10000
Actions (Z)
6 40
States (N)
15000
20
4 0
(a) [24.14] bits
0
8
2000
60
R
eg
is
te
rs
(fl
ip
-fl
op
s)
4000
Actions (Z)
6 40
States (N)
6000
20
4 0
(b) [10.00] bits
FIGURE 15: Occupied area in register for different scenarios.
than 50 ns.
E. POWER CONSUMPTION ANALYSIS
Tables 4 - 13 show the dynamic power (last column) for
each of the scenarios and the four different representations
in bits (first column). Figure 18 shows the dynamic power as
a function of the number of bits for all scenarios (see Table
2), and it is possible to observe that the power consumption
increase with the number of bits and the size of matrix Q
(N × Z) for all scenarios. However, it is possible to notice
that the dynamic power consumption is small, in the order of
90mW for the scenarios with greater complexity and higher
resolution (N = 132, Z = 4).
The Figure 19 shows the dynamic power consumption as a
function of the size of matrix Q (N × Z) and also according
to the throughput (color bar) and the number of bits (size
of circle). All situations were plotted with four different
resolutions (or the number of bits), an they were [10.00],
[16.06], [20.10]] and [24.14]. There are always four circles
with the same size of matrix Q (there are ten different sizes
of matrix Q).
Analyzing the Figure 19, it can be inferred that the increase
0
8
5
60
L
U
T
s
(l
og
ic
al
ce
ll
s)
×104
Actions (Z)
6 40
States (N)
10
20
4 0
(a) [24.14] bits
0
8
1
60
L
U
T
s
(l
og
ic
al
ce
ll
s)
×104
2
Actions (Z)
6 40
States (N)
3
20
4 0
(b) [10.00] bits
FIGURE 16: Occupied area in LUTs for different scenarios.
of the Q matrix size, N ×Z, reduces the throughput, Fs, and
increases the power consumption. Another critical point, it
is the non-linear power growth as showed in Figure 19. The
number of bits has a minor impact on power consumption for
low Q matrix sizes (N × Z < 200) however it has a high
nonlinear impact for large Q matrix size (N × Z > 200).
When there is an increase in the number of states (N ) or a
problem with a greater number of actions (Z), for the same
amount of states, it implies an increase in power consumption
and a reduction of the throughput (Fs). That is, the power
consumption and the throughput (Fs) are determined by the
complexity of the problem.
F. REAL-WORLD EXPERIMENTS
Table 14 shows some practical applications of Q-learning
found in the literature. The first column contains the ref-
erence. In the second and third columns are presented the
dimensions of the problem (number of states and actions).
In the fourth column are presented the number of iterations
necessary for the problems, according to those dimensions, to
converge to the optimal value function if implemented in the
14 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
10
0
15
T
h
ro
u
gh
p
u
t
(F
s
)
in
M
S
p
s
20
20
25
States (N)
840 7
Actions (Z)
6560 4
(a) [24.14] bits
15
0
20
8
T
h
ro
u
gh
p
u
t
(F
s
)
in
M
S
p
s
25
20 7
States (N) Actions (Z)
30
640 5
60 4
(b) [10.00] bits
FIGURE 17: Throughput (Fs) in MSps for different scenar-
ios.
Q-learning hardware architecture proposed in this work. In
the following column is presented the sampling rate that the
problem could reach if Q-learning were implemented in the
proposed architecture. In the last column, from the number
of iterations and the maximum throughput, an estimation of
the convergence time for the optimal policy of the mentioned
problems is calculated. In a problem like the one presented
in [30], which has the same order of complexity as scenario
VIII, it is possible to execute the Q-learning with a sampling
rate of approximately 15MSps. As seen in Section II, it
takes 8000 iterations for the convergence of the Q matrix,
which means that the optimal policy would be available to
the system after 0.5ms. In the problem presented in [29],
which has the same number of states and actions of scenario
III as the value function matrix, which converges with 5500
iterations, it would need 0.2ms. In the problem presented
in [2], with 2500 iterations, it would take 0.1ms. If there
is a time restriction to acquire the necessary data for the
calculation of the policy, it is also possible to reduce the
system clock so that, as demonstrated in [11], the system
consumption is reduced and the execution time is adjusted
for the data acquisition.
G. COMPARISON WITH OTHER PLATFORMS
In [25] was propose a parallel implementation for a PVM
platform. A two-dimensional maze problem was studied
with 135 states and 4 actions each of them. The maximum
iterations (or episodes) used was of 32000 and the time for
1, 2, 4 and 8 processors were 4.32 s (throughput about 7400
samples or iterations per second), 3.4 s (throughput about
9411 samples or iterations per second), 1.85 s (throughput
about 172971 samples (or iterations) per second) and 2.65 s
(throughput about 12075 samples or iterations per second),
respectively. The hardware proposed in this work has a
throughput about 12.23MSps (or 12.23 Mega-iterations per
second) in the scenario with N = 132 states, Z = 4 actions
and this is equivalent of 2.61ms for 32000 iterations. For this
case, the hardware implementation proposed here reaches a
speed up about 1655×, 1302×, 708× and 1015× for 1, 2, 4
and 8 processors used in [25], respectively.
The Q-learning FPGA implementation is shown in [19]
and [22] where it is proposed a semi-parallel approach. For a
scenario where theQmatrix has about 240 elements (N×Z),
the works [19] and [22] achieved a throughput of about
2.34MSps (or 2.34Mega-iterations per second). On the other
hand, the throuhput achieved by the present implementation
as 15.53MSps (or 15.53 Mega-iterations per second), i.e.,
a speed up of about 6.63×. Another FPGA implementation
of the Q-learning is shown in [21] where for the scenario
with N = 27 states and Z = 5 actions (Q matrix with
135 elements). In this case, it was obtained a throughput of
about 25000Sps (or 25000 iterations per second) for the best
example. In the similar scenario with N = 20 and Z = 8
(Qmatrix with 160 elements), the architecture here proposed
achieved a throughput of about 16.74MSps (or 16.74 Mega-
iterations per second), i.e., speed up about 669.6×.
Table 15 shows the speedups achieved by the approach
proposed in the work here presented in comparison with the
schemes presented in [19], [22], [25].
V. CONCLUSION
This work illustrated a Hardware parallel architecture pro-
posal of the Q-learning technique on FPGA. The state of
the art of implementations of machine learning techniques in
hardware was presented, giving emphasis to RL techniques.
Applications using use Q-learning as a technique were listed,
thus illustrating the main motivations for the development of
this work. The choice of FPGA was justified due to its high
performance, low operation frequency and to have several
processing cores.
Q-learning is a reinforcement learning technique that has
as its main advantage the possibility of obtaining an op-
timal policy interacting with the environment without any
knowledge about the system model needed. Developing this
technique in hardware allows shortening the system pro-
cessing time. The FPGA was used due to the possibility of
VOLUME 4, 2016 15
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
[24.14] [20.10] [16.06] [10.00]
Number of bits
0
10
20
30
40
50
60
70
80
90
100
P
ow
er
(m
W
)
N = 6, Z = 4
N = 12, Z = 4
N = 12, Z = 8
N = 20, Z = 4
N = 20, Z = 8
N = 30, Z = 4
N = 30, Z = 8
N = 56, Z = 4
N = 56, Z = 8
N = 132, Z = 4
FIGURE 18: The dynamic power as a function of the number of bits for all scenarios (see Table 2).
TABLE 14: Convergence time and sampling Rate of literature applications using the Q-learning hardware architecture.
Reference
Number of
states
Number of
actions
Number of
iterations
Sampling
throughput (Fs)
Convergence
time
[30] N = 41 Z = 3 ∼ 8000 ∼ 15MSps ∼ 0.5ms
[29] N = 12 Z = 8 ∼ 5500 ∼ 23MSps ∼ 0.2ms
[2] N = 18 Z = 2 ∼ 2500 ∼ 25MSps ∼ 0.1ms
TABLE 15: Comparative results regarding the speed up with other platforms.
Reference Hardware
Q matrix size Reference Obtained
Speedup
(N × Z) throughput throughput
[25]
CPU - 1 processor
528
7400 Sps
12.23MSps
1655×
CPU - 2 processors 9411 Sps 1302×
CPU - 4 processors 172971 Sps 708×
CPU - 8 processors 12075 Sps 1015×
[19], [22] FPGA 240 2.34MSps 15.53MSps 6.63×
[21] FPGA 160 25000 Sps 16.74MSps 669.6×
rapid prototyping and flexibility, parallelism and low power
consumption, some of the main advantages of FPGA. Details
of the implementation of the Hardware architecture have
been described. It was also discussed details of individual
system modules and what hardware mechanisms were used
to implement the Q-learning algorithm. It was proposed a
generic architecture of N states and Z actions that makes the
data processing from a parallel and distributed implementa-
tion so that the processing time of Q-learning is optimized.
A detailed analysis of the implementation was conducted,
in addition to the analysis of simulation and synthesis data.
From the simulation data, the architecture was validated. It
was also investigated the impacts of the resolution error to
obtain the problem optimal policy. It is possible to observe
that obtaining the optimal policy can also happen even for
low resolutions in bits that imply in a smaller area of occu-
pancy. The analysis of the synthesis data allowed to verify the
behavior of the system regarding important parameters, such
as occupancy rate and sampling time. By observing FPGA
synthesis performed in this work it was observed that with
the development of this algorithm, directly in hardware, it
is possible to reach high performance, especially in terms
of processing time and/or low power consumption when
compared with their counterparts in software. These charac-
teristics allow their use in more complex practical situations
and with the most diverse types of applications, mainly
time-constrained applications such as real-time applications,
telecommunications applications where data flow needs to
be handled very quickly; or in applications where there are
power restrictions and low power consumption is required
for these devices.
FUNDING
This study was financed in part by the Coordenação de
Aperfeiçoamento de Pessoal de Nível Superior (CAPES) -
Finance Code 001.
ACKNOWLEDGMENTS
The authors wish to acknowledge the financial support of
the Coordenação de Aperfeiçoamento de Pessoal de Nível
Superior (CAPES) for their financial support.
16 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
0 50 100 150 200 250 300 350 400 450 500 550
Size of Q matrix (N ×Z)
0
10
20
30
40
50
60
70
80
90
100
P
ow
er
(m
W
)
[24.14] bits
[20.10] bits
[16.06] bits
[10.00] bits
11.00
14.40
17.80
21.20
24.60
28.00
T
h
ro
u
g
h
p
u
t
(F
s
)
in
M
S
p
s
FIGURE 19: The power consumption as a function of the size
of matrix Q (N × Z) and also according to the throughput
(color bar) and the number of bits (size of circle).
REFERENCES
[1] R. S. Sutton, Reinforcement Learning: A Special Issue of Machine Learn-
ing on Reinforcement Learning, 8th ed. Kluwer Academic Publishers,
1992.
[2] N. C. de Almeida, M. A. C. Fernandes, and A. D. D. Neto, “Beamforming
and power control in sensor arrays using reinforcement learning,” Sensors,
vol. 15, 2015.
[3] L. M. Reyneri, “Implementation issues of neuro-fuzzy hardware: going
toward hw/sw codesign,” IEEE Transactions on Neural Networks, vol. 14,
no. 1, pp. 176–194, 2003.
[4] A. C. D. de Souza and M. A. C. Fernandes, “Parallel fixed point imple-
mentation of a radial basis function network in an fpga,” Sensors, vol. 14,
no. 18223-18243, 2014.
[5] B. J. Leiner, V. Q. Lorena, T. M. Cesar, and M. V. Lorenzo, “Hardware ar-
chitecture for fpga implementation of a neural network and its application
in images processing,” in Eletronics, Robotics and Automotive Mechanics
Conference (CERMA ’08), Morelos, 2008, pp. 405–410.
[6] F. Mengxu and T. Bin, “Fpga implementation of an adaptive genetic
algorithm,” in 12th International Conference on Service Systems and
Service Management (ICSSSM), Guangzhou, 2015, pp. 1–5.
[7] M. F. Torquato and M. A. C. Fernandes, “High-performance parallel
implementation of genetic algorithm on fpga,” 2018.
[8] S. Usenmez, R. A. Dilan, M. Dolen, and A. B. Koku, “Real-time hardware-
in-the-loop simulation of electrical machine systems using fpgas,” in
nternational Conference on Electrical Machines and Systems, 2009, pp.
1–6.
[9] H. Saldanha, E. Ribeiro, M. Holanda, A. Araújo, G. Rodrigues, M. Walter,
J. Setubal, and A. Dávila, “A cloud architecture for bioinformatics work-
flows,” in 1st International Conference on Cloud Computing and Services
Science. Noordwijkerhout, Netherlands, 2011, pp. 477–483.
[10] L. A. V. de Carvalho, Datamining: A Mineração de Dados no Marketing,
Medicina, Economia, Engenheria, e Admistração. Editora Ciência
Moderna, 2005.
[11] L. Shang, A. S. Kaviani, and K. Bathala, “Dynamic power consumption
in virtex-ii fpga family,” in Proceedings of the 2002 ACM/SIGDA Tenth
International Symposium on Field-programmable Gate Arrays, ser. FPGA
’02. New York, NY, USA: ACM, 2002, pp. 157–164.
[12] S. Asano, T. Maruyama, and Y. Yamaguchi, “Performance comparison of
fpga, gpu and cpu in image processing,” in International Conference on
Field Programmable Logic and Applications, Prague, 2009, pp. 126–131.
[13] I. Kuon and J. Rose, “Measuring the gap between fpgas and asics,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 26, no. 2, pp. 203–215, 2007.
[14] Xilinx. (2018) System Generator for DSP. Accessed 2018-
11-17. [Online]. Available: https://www.xilinx.com/products/design-
tools/vivado/integration/sysgen.html
[15] ——. (2018) Virtex-6 Family. Accessed 2018-11-17. [Online]. Avail-
able: https://www.xilinx.com/support/documentation-navigation/silicon-
devices/fpga/virtex-6.html
[16] H. Woehrle and F. Kirchner, “Reconfigurable hardware-based acceleration
for machine learning and signal processing,” in SyDe Summer School.
Wiesbaden: Springer Fachmedien Wiesbaden, 2015.
[17] Z. Liu and I. Elhany, “Large-scale tabular-form hardware architecture for
q-learning with delays,” in Midwest Symposium on Circuits and Systems,
Montreal, Que, 2007, pp. 827–830.
[18] V. L. Prabha and E. C. Monie, “Hardware architecture of reinforcement
learning scheme for dynamic power management in embedded systems,”
EURASIP Journal on Embedded Systems, no. 1, p. 065478, 2007.
[19] P. R. Gankidi and J. Thangavelautham, “Fpga architecture for deep learn-
ing and its application to planetary robotics,” in 2017 IEEE Aerospace
Conference, March 2017, pp. 1–9.
[20] S. Shao, J. Tsai, M. Mysior, W. Luk, T. Chau, A. Warren, and B. Jeppesen,
“Towards hardware accelerated reinforcement learning for application-
specific robotic control,” in 2018 IEEE 29th International Conference
on Application-specific Systems, Architectures and Processors (ASAP).
IEEE, 2018, pp. 1–8.
[21] J. Su, J. Liu, D. B. Thomas, and P. Y. Cheung, “Neural network based
reinforcement learning acceleration on fpga platforms,” ACM SIGARCH
Computer Architecture News, vol. 44, no. 4, pp. 68–73, 2017.
[22] P. R. Gankidi, “Fpga accelerator architecture for q-learning and its ap-
plications in space exploration rovers,” Ph.D. dissertation, Arizona State
University, 2016.
[23] R. Faraji and H. R. Naji, “An efficient crossover architecture for hardware
parallel implementation of genetic algorithm,” Neurocomputing, vol. 128,
pp. 316–327, 2014.
[24] J. T. Barron, D. S. Golland, and N. J. Hay, “Parallelizing reinforcement
learning,” Ph.D. dissertation, UC Berkeley, 2009.
[25] A. M. Printista, M. L. Errecalde, and C. I. Montoya, “A parallel im-
plemetation of q-learning based on cammunication with cache,” Journal
of Computer Science and Technology, vol. 1, no. 6, 2002.
[26] M. Kushida, K. Takahashi, H. Ueda, and T. Miyahara, “A comparative
study of parallel reinforcement learning methods with a pc cluster system,”
in 2006 IEEE/WIC/ACM International Conference on Intelligent Agent
Technology, Dec 2006, pp. 416–419.
[27] M. Camelo and J. F. a. S. Latré, “A scalable parallel q-learning algorithm
for resource constrained decentralized computing environments,” in 2016
2nd Workshop on Machine Learning in HPC Environments (MLHPC),
Nov 2016, pp. 27–35.
[28] C. J. C. H. Watkins and P. Dayan, “Technical note: Q-learning,” Machine
Learning, vol. 8, no. 3-4, pp. 279–292, 1992.
[29] A. Das, R. A. Shafik, and G. V. Merrett, “Reinforcement learning-based
inter- and intra-application thermal optimization for lifetime improvement
of multicore systems,” in ACM/EDAC/IEEE Design Automation Confer-
ence (DAC), San Francisco, CA, 2014, pp. 1–6.
[30] A. A. R. Diniz, A. J. J. L. Filho, P. R. da Motta Pires, S. M. Kanazava, J. D.
de Melo, and A. D. D. Neto, “Application of q-learning to define optimal
policy for triggering pid, neural and fuzzy controllers in a level control
process,” Brazilian Congress of Automation, 2010.
[31] P.-C. Wu, “Multiplicative, congruential random-number generators with
multiplier &plusmn; 2k1 &plusmn; 2k2 and modulus 2p - 1,” ACM
Transactions on Mathematical Software, vol. 23, no. 2, pp. 255–265, 1997.
[32] Xilinx. (2018) Virtex-6 FPGA ML605 Evaluation Kit. Accessed 2018-
11-17. [Online]. Available: https://www.xilinx.com/products/boards-and-
kits/ek-v6-ml605-g.html
[33] MathWorks. (2018) Matlab/Simulink. Accessed 2018-11-17. [Online].
Available: http://www.mathworks.com
[34] Xilinx. (2018) Vivado Design Suite. Accessed 2018-11-17. [Online].
Available: https://www.xilinx.com/products/design-tools/vivado.html
VOLUME 4, 2016 17
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
LUCILEIDE M. D. DA SILVA was born in Natal,
Brazil. She received the BS degree in Electrical
Engineering in 2012, the MS degree in Com-
puter Engineering in 2016, from the Federal Uni-
versity of Rio Grande do Norte, Natal, Brazil.
Currently. She is a Professor in the Informatics
Group, at the Federal Institute of Rio Grande
do Norte, Santa Cruz - RN, Brazil. She is part
of the Robotics Group’s at the Institute, training
teenagers in different competitions and developing
Assistive Technology (AT) to children with special needs. She is also part of
the Research Group on Embedded Systems and Reconfigurable Hardware
where the main research topic is the acceleration of artificial intelligence
(AI) algorithms through reconfigurable computing (RC) on FPGA in the
Department of Computer Engineering and Automation, Federal University
of Rio Grande do Norte, Natal, Brazil. Her research interests include arti-
ficial intelligence, embedded systems, reconfigurable hardware, educational
robotics for teenagers and special needs children and tactile internet.
MATHEUS F. TORQUATO was born in Natal,
Brazil. He received the BSc degree in Science
and Technology in 2013, the B.E. degree in Com-
puter Engineering in 2015 and the M.Sc. degree in
Computer Engineering in 2017 from the Federal
University of Rio Grande do Norte, Natal, Brazil.
Currently, he is part of the Research Group on
Embedded Systems and Reconfigurable Hardware
where the main research topic is the acceleration
of artificial intelligence (AI) algorithms through
reconfigurable computing (RC) on FPGA. Apart from his main research
topic of AI and RC at his home university he has other research experiences
such as Human-Computer Interaction at the Future Interaction Technology
lab from Swansea University (Wales-UK) and Computer Vision at the
Sensing and Machine Vision for Automation and Robotic Intelligence lab
from the University of Ottawa (Ontario-Canada). His research interests
include artificial intelligence, embedded systems, reconfigurable hardware,
human-computer interaction and tactile internet.
MARCELO A. C. FERNANDES was born in
Natal, Brazil. He received the BS degree in Elec-
trical Engineering in 1997, the MS degree in Elec-
trical Engineering in 1999, from the Federal Uni-
versity of Rio Grande do Norte, Natal, Brazil and
the Ph.D. degree in Electrical Engineering in 2010,
from the University of Campinas, Campinas, SP,
Brazil. Currently, he is an Adjunct Professor in
the Department of Computer Engineering and Au-
tomation, Federal University of Rio Grande do
Norte, Natal, Brazil. From 2015 to 2016 his worked with visiting researcher
in Centre Telecommunication Research (CTR) at King’s College London, in
London, UK. He is the leader of the Research Group on Embedded Systems
and Reconfigurable Computing (RESRC) and, coordinator of the Laboratory
of Machine Learning and Intelligent System (LMLIS). His research interests
include artificial intelligence, digital signal processing, embedded systems,
reconfigurable hardware, and tactile internet. Dr. Fernandes is the author and
co-author of many scientific papers and practical studies with reconfigurable
computing on FPGA to accelerate artificial intelligence algorithms.
18 VOLUME 4, 2016
