Abstract-This paper introduces a novel architecture for performing the core computations required by dynamicprogramming (DP) techniques. The latter pertain to a vast range of applications that necessitate an optimal sequence of decisions to be issued. An underlying assumption is that a complete model of the environment is provided, whereby the dynamics are governed by a Markov decision process (MDP). Existing DP implementations have traditionally been realized in software. Here, we present a method for exploitating the data parallelism associated with computing both the value function and optimal action set. An optimal policy is obtained four orders of magnitude faster than traditional software-based schemes, establishing the viability of the approach for real-time applications.
Abstract-This paper introduces a novel architecture for performing the core computations required by dynamicprogramming (DP) techniques. The latter pertain to a vast range of applications that necessitate an optimal sequence of decisions to be issued. An underlying assumption is that a complete model of the environment is provided, whereby the dynamics are governed by a Markov decision process (MDP). Existing DP implementations have traditionally been realized in software. Here, we present a method for exploitating the data parallelism associated with computing both the value function and optimal action set. An optimal policy is obtained four orders of magnitude faster than traditional software-based schemes, establishing the viability of the approach for real-time applications.
Dynamic Programming applies Markov decision processes (MDPs) formalism to determine the optimal policy, or control mechanism, in a dynamic stochastic environment. Existing research efforts [1] [2] have traditionally focused on softwarebased realizations of DP techniques to determine the mapping of states to actions, or policy, required to derive an optimal controller. The probability of transitioning to a next state, s', given a current statelaction pair, s and a, is defined as
(1) However, it does not provide information regarding the expected reward that is to be received upon arrival at a given state. For a given current statelaction pair, the expected reward for each possible subsequent state, sl, is given by The reward function embeds information that can be used to determine the value of the next state given that a specific action is taken. The long-term value associated with each state is provided by the valuejknction. The latter is commonly defined as the expected sum of discounted rewards [3] . where y is the discounting factor, given that the system starts in state s and follows policy n such that
The common goal in DP application is to derive an optimal policy that maximizes the value function. Utilizing (2) and (3), the optimal value function, in terms of the transition probabilities and reward function, can be expressed as Attainment of the optimal policy is often achieved by utilizing the generalized policy iteration (GPI) algorithm [3]. GPI involves an evaluation phase, whereby a policy, n(s), is evaluated to determine its value function, and an improvement phase, during which the policy is updated in a greedy manner with respect to the value function. In the evaluation phase, the value function improves according to the policy, whereas in the improvement phase, the policy is improved according to the value function. Through this successive iterative improvement process, it has been shown that convergence to the optimal value function is guaranteed, from which an optimal policy can be derived [4].
OPTIMIZATION FOR HARDWARE REALIZATION
A principle limitation of hardware design for the policy iteration algorithm pertains to its prohibitive storage requirements. It is apparent that storage of both the transitional probabilities, e s , , and the reward function, R:, , , mandate an O(S2A) memory requirement. To illustrate the impact of this storage requirement, we will first consider an application that consists of a small state set, comprised of 225 states, and restrict the total number of actions that can be taken at each state to be 11. Furthermore, the transitional probabilities and rewards will be limited to 12-bit and 9-bit representations, respectively. The memory requirement, presented in Table I , needed to store the transitional probabilities and the reward function is 11.7 Mbits, which exceeds the on-chip memory resources available in current FPGA devices.
In order to realize the policy iteration algorithm in hardware, it is imperative that the storage requirements be reduced. To facilitate such reduction, we first rewrite the policy improvement expressions as G(s, a) , is given by and the discounted probability, H z , , is represented as p,",, R:a, Furthermore, we can apply a similar technique in an effort to reduce the storage requirements, characterizing the improvement phase. To that end, the value function may be expressed in the form Unfortunately, it is infeasible to similarly reduce the number of unique transitional probabilities from O(S2A). However, we can decrease the number of bits required to store each unique probability value. This reduction is achieved by mapping probabilities to a discrete set of 7-bit values. In representing the distribution as a discrete set of potential values, some error is introduced. However, this error constitutes less than one-percent per calculation and is distributed uniformly over what is already an inherent approximation. Moreover, we can now offer a greater dynamic range for the reward values, having reduced the storage requirements, that was not possible previously. Applying the aforementioned observations, the total memory requirement, as presented in Table 11 , has been reduced from 11.7 Mbits to 3.89 Mbits, thus allowing representation of the policy iteration algorithm in many of the larger FPGA devices currently available.
The final observation to be made does not relate to the storage properties, but rather to the reduction in computation requirements resulting when utilizing the offline processing method used to condense the reward function. Recognizing that the product of the discount parameter, 7 , and transitional
Bits 7 18
Altera Stratix I1 EP2S180 FPGA device. This implementation considered pertains to the car rental problem discussed in [3] , utilizing 121 states with a total of 11 potential actions that can be issued at each state. The system architecture utilized S parallel 18-bit floating-point multipliers and a tree structure of 18-bit floating point adders to compute the optimal policy. The complete system, once implemented, achieved a post-fit operating frequency of 103 MHz (9.7 ns). The design required 47,453 adaptive logic modules (ALMs), or 66% of the ALMs available on the target device. Since the architecture requires a single RAM such that values for each of the 121 states can be accessed in parallel, the device consumes 493(64%) of the M4K RAMS available. Additionally, a single M5 12 RAM unit was required for the G (s, a ) values. The total memory consumed was 1.772 Mbits (i.e. 18% of the RAM).
To establish the performance gain that can be achieved when implementing the hardware-based approach, a software simulation was used. When executed, the software-based policy iteration algorithm converged after an average of 33 evaluation phases and 4 improvement phases with a run time of 9.174 seconds. Similarly, an analysis of the hardware-based approach determined the number of evaluation cycles required for convergence was 163 clock cycles, and the number of improvement iteration cycles to be 1,373 cycles. Thus, the total time required for the hardware to obtain the optimal policy is 270.1 ps. Hence, the hardware-based solution yields an improvement of four orders of magnitude in speed when compared to the software-based systems. Although the improvement observed for the particular example chosen does not necessarily reflect on that which will characterize other problems, it provides insight to the potential gain of the framework proposed. IV. SUMMARY AND FUTURE WORK In this paper, a novel architecture was introduced for the real-time computation of an optimal control policy employing dynamic programming (DP). It has been demonstrated that the proposed architecture allows a policy to be obtained four orders of magnitude faster than is achieved by a traditional software-based system. The framework is suitable for a broad range of model-based problems requiring real-time decision making. FPGA implementation results were presented and discussed to emphasize the viability and scalability of the proposed architecture. Future work will focus on supporting learning algorithms that employ similar computation techniques, as well as studying further architectural trade-offs to increase operating frequencies in FPGA-based realizations. . . -probabilities, P,",,, can be computed offline, we obtain a
