The succesful application of model predictive control (MPC) in fast embedded systems relies on faster and more energy efficient ways of solving complex optimization problems. A custom quadratic programming (QP) solver implementation on a field-programmable gate array (FPGA) can provide substantial acceleration by exploiting the parallelism inherent in some optimization algorithms, apart from providing novel computational opportunities arising from deep pipelining. This paper presents a new MPC algorithm based on multiplexed MPC that can take advantage of the full potential of an existing FPGA design by utilizing the provided 'free' parallel computational channels arising from such pipelining. The result is greater acceleration over a conventional MPC implementation and reduced silicon usage. The FPGA implementation is shown to be approximately 200x more energy efficient than a high performance general purpose processor (GPP) for large control problems.
INTRODUCTION
MPC is an advanced optimal control technology that has proven to be very succesful due to its capability of returning an optimal strategy without violating the physical limitations of the system. The need to solve a computationally intensive QP problem at every sampling instant has restricted its applicability to slow plants, such as those encountered in the chemical process industries (Maciejowski, 2001) , where sampling times can be on the order of seconds or minutes. As the computational power of new devices continues to rise, MPC is now being proposed for higher bandwidth applications, such as aerospace, robotics, electrical power and automotive. There is a growing demand for ways of accelerating the solution of QP problems so that the success of MPC can be extended to areas where the computational burden has so far been considered too great.
⋆ The authors would like to acknowledge the support of the EPSRC (Grant EP/G031576/1), discussions with Prof. Jan Maciejowski, and industrial support from Xilinx, the Mathworks and the European Space Agency.
One approach is through hardware acceleration. Recent advances in reconfigurable hardware technology have made the FPGA a suitable platform for accelerating scientific computations. FPGAs are a good alternative to application specific integrated circuits (ASICs) for high speed embedded MPC applications since they offer much reduced low-volume cost, greater flexibility, and a shorter design cycle, reducing the risk while still maintaining deterministic execution time and a high power efficiency.
A parametric FPGA implementation of an interior-point solver for linear MPC has been presented in (Jerez et al., 2011) . The design is able to efficiently process problems that are several orders of magnitude larger than previous implementations (Knagge et al., 2009; Ling et al., 2008; Vouzis et al., 2009 ) and achieves substantial acceleration over a general purpose microprocessor. In this paper we present a new MPC algorithm based on multiplexed MPC (Ling et al., 2010b ) that makes use of the slack in the FPGA implementation, in the form of unused parallel computational channels, to achieve lower closed-loop cost. We also demonstrate the large advantage in terms of power efficiency that the FPGA implementation brings over a GPP implementation.
A review of MPC and the primal-dual interior point algorithm (Wright, 1997 ) is presented in Sections 2 and 3. The main features of the FPGA implementation are highlighted in Section 4. Section 5 presents the new MPC algorithm and Section 6 shows the results of the parameterized FPGA design. The performance improvement is demonstrated with an example in Section 7.
REVIEW OF LINEAR MPC
Unlike conventional control techniques, MPC takes into account the physical limitations of the system by incorporating constraints into the problem formulation, delivering extra performance gains (Maciejowski, 2001 ). However, due to the presence of constraints it is not possible to obtain an analytic expression for the optimum solution and we have to solve an optimization problem at every sample instant, resulting in very high computational demands. In linear MPC, we have a linear model of the plant, linear constraints, and a positive definite quadratic cost function, hence the resulting optimization problem is a convex quadratic program (Boyd and Vandenberghe, 2004) . Without loss of generality, the time-invariant problem can be described by the following equations:
subject to:
contain all u k and x k respectively, x is the current estimate of the state of the plant, N is the horizon length,
m×m is symmetric positive definite (SPD) to guarantee uniqueness of the solution, S d ∈ R n×m is such that (1) is convex, Q ∈ R n×n is an approximation of the cost from k = N to infinity and is SPSD and ≤ denotes componentwise inequality. J ∈ R l×n , E ∈ R l×m and d ∈ R l describe the physical constraints of the system and l is the number of constraints.
At every sample instance a measurement of the system's output is taken, from which the current state of the plant is inferred (Maciejowski, 2001) . The optimization problem (1)-(2) is then solved but only the first part of the solution u * 0 ( x) is implemented. Due to disturbances, model uncertainties and measurement noise, there will be a mismatch between the next output measurement and what the controller had predicted, hence the whole process has to be repeated again at every sample instant to provide closed-loop stability and robustness.
PRIMAL-DUAL INTERIOR-POINT ALGORITHM
Following the non-condensed method, where the states are left as decision variables (Rao et al., 1998) , the optimal control problem (1)-(2) can be formulated as a sparse QP of the following form:
subject to
where
, and p is the number of outputs to the plant.
The primal-dual interior-point algorithm is a suitable algorithm, which uses Newton's method (Boyd and Vandenberghe, 2004) for solving a perturbed version of a nonlinear system of equations, known as the Karush-KuhnTucker (KKT) optimality conditions. At each iteration, three tasks need to be performed: linearization around the current point, solving the resulting linear system to obtain a search direction, and performing a line search to update the solution to a new point.
The algorithm can be described by the following pseudocode, where ν and λ are known as Lagrange multipliers, s are known as the slack variables, Λ k and S k are diagonal matrices containing the elements of λ k and s k respectively, and I IP is the number of iterations (refer to (Wright, 1997) for more details). Algorithm 1. QP pseudo-code
T > 0 All variables are arbitrarily set to 1. 
Linear Solver
Most of the computational complexity in each iteration of the interior-point method is associated with solving the system of linear equations A k z k = b k . After appropriate row re-ordering (interleaving elements of θ and ν), the indefinite symmetric matrix A k becomes banded. The size and half-bandwidth of A k in terms of the control problem parameters are given, respectively, by
Notice that the number of outputs p and the number of constraints l does not affect the size of A k , which determines the computation time.
The minimum residual (MINRES) method is a suitable iterative algorithm for solving linear systems with indefinite symmetric matrices (Fisher, 1996) . At each MINRES iteration, a matrix-vector multiplication accounts for the majority of the computations. This kind of operation is easy to parallelize and consists of multiply-accumulate instructions, which are known to map efficiently into hardware in terms of resources, in contrast with division operations abundant in factorization based methods. In addition, iterative methods allow one to trade off accuracy for computation time.
In (Boland and Constantinides, 2008 ) the authors propose an FPGA implementation for solving this type of linear system using the MINRES method, reporting speed-ups of approximately one order of magnitude over software implementations. Most of the acceleration is achieved through a deeply pipelined dedicated hardware block that parallelizes dot-product operations for computing the matrixvector multiplication in a row-by-row fashion. We use this architecture in our design with a few modifications to customize it to the special characteristics of the matrices that arise in MPC. Notice that the size of the dot-products that are computed in parallel is independent of the control horizon length N (refer to (5b)), thus computational resource usage does not scale with N .
Sequential Block
The remaining operations in the interior-point iteration are undertaken by a separate hardware block, which we call Stage 1. The resulting two-stage architecture is shown in Figure 1 . Since the linear solver will provide most of the acceleration by consuming most resources it is vital that it remains busy at all times. Hence, the parallelism in Stage 1 is chosen to be the smallest possible such that the linear solver is always active. The whole circuit can be seen as a two-stage high-level pipeline where the computational times have to be matched to achieve the highest hardware efficiency.
When computing the coefficient matrix A k , only the diagonal matrix W k , changes from one iteration to the next, thus the complexity of this calculation is relatively small. If the structure of the problem is taken into account, The input is the current state measurement x and the output is the next optimal control move u * 0 ( x). we find that the remaining calculations in an interiorpoint iteration are all sparse and very simple compared to solving the linear system. Comparing the computational count of all the operations to be carried out in Stage 1 with the latency of the linear solver implementation, we come to the conclusion that for most control problems, the optimum implementation of Stage 1 is sequential, as this will be enough to keep the linear solver busy at all times. This is a consequence of the latency of the linear solver being Θ(N 2 ) (Boland and Constantinides, 2008) , whereas the number of operations in Stage 1 is only Θ(N ). For cases where computing Stage 1 takes longer than solving the linear system, another instance of Stage 1 can be synthesized to operate in parallel and share the same common hardware control block.
Parallel Channels
Notice that if both blocks are to be doing useful work at all times, while the linear system for a specific problem is being solved, Stage 1 has to be updating the solution and linearizing for the next iteration for another independent problem. In addition, the dot-product architecture described in (Boland and Constantinides, 2008 ) is deeply pipelined, thus it can complete a matrix-vector multiplication in Z cycles (Z rows in A k ), but the latency of one MINRES iteration is longer than Z cycles due to other operations. In order to keep the dot-product hardware busy at all times, P independent problems can be multiplexed into the linear solver for simultaneous processing. Hence, our design can process 2P independent QP problems simultaneously at no extra cost in terms of computational resources.
The expression for P for the current implementation is given by
The linear terms result from the row by row processing for the matrix-vector multiplication (Z dot-products) and serial-to-parallel conversions, whereas the log term comes from the depth of the adder reduction tree in the dotproduct block. The constant term comes from the other operations in the MINRES iteration. Figure 2 combines (6) and (5) to show the number of parallel channels available in terms of the parameters of the control problem. For large problems, six parallel channels are available.
Latency and Throughput
Unlike a software implementation, the FPGA's deterministic execution time can provide the precise timing guarantees required for interfacing a controller to the physical system. The overall latency of the circuit will be given by
where I M IN RES is the number of iterations the MINRES method takes to solve the linear system to the required accuracy, I IP is the number of outer iterations in the interiorpoint method (Algorithm 1), FPGA f req is the FPGA's clock frequency, P N is the latency of one MINRES iteration (Boland and Constantinides, 2008) , and P is given by (6). In that time the controller will be able to output the result to 2P problems.
PARALLEL MULTIPLEXED MPC
Multiplexed MPC (MMPC) has been proposed elsewhere (Ling et al., 2010b) . This section extends the MMPC algorithm in a way such that the architecture proposed in Section 4, which is capable of solving 2P QP problems in parallel, can be exploited. We called this version of MMPC, Parallel MMPC.
The original formulation of MMPC was for implementation on a single core processor, solving one QP problem per sampling interval. The key idea in MMPC is that, for an m-input plant, instead of optimizing over all the m input channels in one large QP, the inputs are optimized one channel at a time, in a pre-planned periodic sequence, and the control moves updated as soon as the solution becomes available. This results in a smaller QP at each sampling instant, hence reduced online computational load which in turn enables faster sampling, leading to faster response to disturbances, despite finding a sub-optimal solution to the original optimization problem (Ling et al., 2010a) .
MMPC is closer to industrial practice in cases where there is a complex plant with network constraints, meaning that all control inputs cannot be updated simultaneously due to limitations in the communication channels between the actuators and the controller. Parallel MMPC helps to choose which inputs are best to update at any given sampling interval.
A detailed derivation of the Parallel MMPC is beyond the scope of this paper. Instead, we set out the following algorithm which outlines the key steps in Parallel MMPC: Algorithm 2. Parallel MMPC 1. Initialize by optimizing over all the control moves.
Stored planned moves (N moves for each input).
while (1) 3. Apply the first control move for all inputs and shift the plan.
4. Obtain new measurement x.
5. Solve m different copies of MMPC in parallel. For each copy, optimize with respect to different subsets of control moves.
6. Evaluate and select from these m copies of MMPC, the set of control moves that gives the smallest cost.
7. Update the plan for the set of control moves that gives the smallest cost.
end
As can be seen from Algorithm 2, Parallel MMPC uses MMPC as an elementary building block. In Parallel MMPC, for a plant with m inputs, at a given time, there can be up to m copies of MMPC. Each of these operates independently and in parallel, and when given the current plant state x, optimized with respect to different subsets of control moves. The set of control moves which produces the smallest cost is selected and applied to the plant. The process is repeated at the next updating instant. The resulting updating sequence does not follow a pre-planned sequence and is not necessarily periodic.
Note that Step 1 of Algorithm 2 involves solving for inputs across all input channels. This type of initialization requirement is common in distributed MPC. Subsequent optimizations use this initial solution, but optimize with respect to a subset of control moves. The stability property of MMPC does not depend on the optimality of this initial solution, only its feasibility (Ling et al., 2010b) . Proposition 1. Parallel MMPC, obtained by implementing Algorithm 2, gives closed-loop stability.
Proof. A detailed proof is beyond the scope of this paper and only an outline is provided. The proof follows standard arguments used by most MPC stability proofs. It depends on the constrained optimization being feasible at each step. In the proposed Parallel MMPC algorithm, the default MMPC is always evaluated at every iteration, among the x(t) ′ Q c x(t) + u(t) ′ R c u(t) dt. 
CONCLUSION
This paper has described the characteristics of an existing FPGA implementation of a QP solver for the optimization problems arising in MPC. A new MPC algorithm that makes use of the slack parallel computational channels offered by the FPGA architecture has been presented. Employing this new strategy allows to save resources and achieve greater acceleration, leading to better quality control. The large power efficiency advantage of the FPGA over a high performance GPP implementation has been established.
We are currently working on new MPC algorithms that can take advantage of the parallel computational channels offered by our design without requiring many inputs to the system, hence applicable to a more general class of systems.
