Abstract-Alternative and more efficient computational methods can extend the applicability of model predictive control (MPC) to systems with tight real-time requirements. This paper presents a system-on-a-chip MPC system, implemented on a field-programmable gate array (FPGA), consisting of a sparse structure-exploiting primal dual interior point (PDIP) quadratic program (QP) solver for MPC reference tracking and a fast gradient QP solver for steady-state target calculation. A parallel reduced precision iterative solver is used to accelerate the solution of the set of linear equations forming the computational bottleneck of the PDIP algorithm. A numerical study of the effect of reducing the number of iterations highlights the effectiveness of the approach. The system is demonstrated with an FPGA-inthe-loop testbench controlling a nonlinear simulation of a large airliner. This paper considers many more manipulated inputs than any previous FPGA-based MPC implementation to date, yet the implementation comfortably fits into a midrange FPGA, and the controller compares well in terms of solution quality and latency to state-of-the-art QP solvers running on a standard PC.
I. INTRODUCTION
A STRENGTH of model predictive control (MPC) is its ability to systematically handle constraints [1] - [4] . At each sampling instant, the solution of a constrained finitehorizon optimal control problem is used as the basis for the control action applied to the plant. With a linear prediction model, a convex quadratic cost function and linear inequality constraints on the plant states and inputs, this can be posed as a convex quadratic program (QP). For MPC with offsetfree reference tracking, it is commonplace to also calculate a steady-state equilibrium target to which the MPC regulates [5] - [7] . For a system with redundant actuators, this requires the solution of an additional (albeit smaller) constrained optimization problem.
For small predictive control problems, an attractive option is to precompute the explicit solution of the QP as a piecewise-affine function of the current state (and reference) using multiparametric programming [8] , [9] . For large predictive control problems, the computation and storage of the explicit solution is impractical, and the constrained optimization problem must be solved online. For a broad range of control applications that could benefit from employing predictive control, the cost and power requirements of general purpose computing platforms necessary to meet hard real-time requirements are unfavorable, and custom circuits designed specifically for the predictive control application are an attractive alternative. In this paper, a field-programmable gate array (FPGA)-based implementation of an MPC controller for a large airliner is demonstrated. Tracking of roll, pitch and airspeed are achieved through the use of a steady-state target calculator and MPC regulator [5] , [6] , each requiring the solution of a constrained QP. While arguments for the use of MPC in flight control can be found in [10] - [12] , the focus of this paper is on the FPGA-based methodology for implementation of the optimization scheme rather than tuning controller parameters or obtaining formal certificates of stability, for which a mature body of theory is already available [3] , [13] .
A. Background and Motivation
FPGAs are reconfigurable chips that can be customized for a specific application. This enables the implementation of complex algorithms exploiting wide parallelization. For predictive control applications, FPGAs are easily embedded as a system component, and offer cycle-accurate timing guarantees. The low clock frequencies in comparison with high-performance microprocessors can translate to lower power consumption. Computational speed is regained through parallelization and customizability. Transistor scaling is still improving the performance and reducing the power consumption and cost in each new technology generation.
There have been several previous FPGA implementations of QP solvers for predictive control. In [21] , speeds comparable with MATLAB implementations were achieved for an aircraft example with four states, by using a high-level synthesis tool to convert a C-like description into a hardware description. A soft-core (sequential) processor implemented on the FPGA fabric was used in [20] to execute a C implementation of the QP solver and demonstrate the performance on a twostate drive-by-wire system. In [22] and [23] , a mixed software/hardware implementation is used where the core matrix computations are carried out in parallel custom hardware, while the remaining operations are implemented in a general purpose microprocessor. The performance was evaluated on two-state systems. In contrast, in [17] the linear solvers were implemented in software while custom accelerators were used for the remaining operations. A motor servo system with two states was used as a case study. The use of nonstandard number representations was studied in [15] with a hybrid fixed-point floating-point architecture and a nonstandard MPC formulation tested on a satellite example with six states. While [14] presents a full fixed-point implementation, no analysis or guarantees are provided for handling the large dynamic range manifested in interior-point methods. Recently, activeset [18] and interior-point [16] , [19] architectures were proposed using (very) reduced precision floating-point arithmetic and solving a condensed QP. Feasibility was demonstrated on an experimental setup with a 14th-order SISO open-loop stable vibrating beam, with impressive computation times. Many online optimization-based FPGA controllers have been for low-dimensional systems-a domain in which explicit MPC could also potentially be an option. In [24] a summary of FPGA-based MPC implementations up until 2010 is presented. Table I in this paper shows a survey of more recent developments, albeit highlighting little progress. A common trend is the use of dense QP formulations in contrast with the current trends in research for structure-exploiting optimization algorithms for predictive control.
B. Summary of Contribution
This paper considers a plant model with significantly more manipulated inputs than prior FPGA-based implementations. In addition, the sparse structure of the uncondensed optimization problem is exploited and unlike prior FPGA-based implementations of MPC, the steady-state target calculator for offset free control is included on-chip as a separate constrained QP. The implementation is also capable of running reliably at higher clock rates than prior designs. Table II summarizes key characteristics of the design.
1) MPC Regulator:
The MPC regulator is based on a modification to a VHDL-based design, first presented in [24] , [25] and applied in [26] . A primal-dual interior-point (PDIP) algorithm is used to solve the constrained QP. A parallel implementation of the minimum residual (MINRES) algorithm is used to solve the system of linear equations occurring at each iteration of the PDIP algorithm. Unlike many embedded microprocessors, an FPGA does not preclude standard double precision arithmetic. However, single precision arithmetic is used to reduce the number of hardware resources required, since this reduces the size, cost and power consumption of the FPGA device needed to realize the design.
The MINRES algorithm is known to be sensitive to numerical precision and conditioning, and a preconditioner is often needed to accelerate convergence and to improve the quality of the converged solution. In [24] and [25] no preconditioning was used, while in [26] a scaling of the state space prediction matrices was demonstrated as an effective offline preconditioner.
To demonstrate the tradeoff between control performance and solution time, a numerical study is performed to investigate the compromise between the number of iterations of the inner MINRES algorithm, and the effect of offline model scaling and online matrix preconditioning. All of these directly influence the total solution time and the solution quality, both in terms of fidelity of the computed control input with respect to that obtained from a standard QP solver, and in terms of the resulting closed-loop control performance. Despite using single precision arithmetic, the proposed design using offline scaling and online preconditioning, running on an FPGA with the circuit clocked at 250 MHz, gives solution accuracies and times comparable with those of standard matrix factorization-based algorithms using double precision arithmetic on a midrange (circa 2012) laptop computer.
2) Target Calculator:
The design for the MPC regulator exploits the sparse structure of the optimization problem [27] , [28] . This structure is not present in the steady-state target calculation problem and therefore the design cannot be re-used without substantial modification. FPGA resource constraints make implementing a second floating point interior point solver an unrealistic option. Since the target calculation problem is dense and bound constrained, the steady-state target calculator is realized using the fast gradient method (FGM) [29] . This was carried out using the MathWorks' HDL coder [30] , using fixed point arithmetic.
3) System Integration: The two QP solvers are each implemented as peripherals to a Xilinx MicroBlaze soft-core processor [31] , which is used to handle data transfer between the two QP solvers and the outside world via UDP/IP over 100-Mbit/s Ethernet. In contrast to [17] , [22] , and [23] where the custom circuit is used to accelerate key parts of the QP solver with the remainder implemented in software on a conventional processor, the present design uses the MicroBlaze solely to bridge communication. This can be replaced with a custom interface layer to suit the demands of a given application.
The resulting system-on-a-chip is verified using a testbench system with the plant model (in this case a highfidelity nonlinear model of the rigid body dynamics of a large airliner [11] ), simulated faster-than-real-time in SIMULINK. The plant state estimates and output reference setpoints are communicated to the system running on the FPGA in the payload of a UDP packet at each sampling instant, with the FPGA returning control commands in the same manner.
C. Outline
The remainder of this paper is organized as follows. Section II summarizes the MPC control problem. Section III introduces the case study application. Section IV describes the hardware implementation. Section V presents a numerical investigation of the tradeoff between MINRES iterations, online and offline preconditioning and control performance. Section VI describes the test bench. Section VII presents numerical results and Section VIII concludes this paper.
II. PREDICTIVE CONTROL PROBLEM
In this section, the constrained, linear predictive tracking control problem with quadratic cost is briefly summarized. The forms of the two QPs to be solved on the FPGA at each sampling instant are described.
A. Predictive Regulation Control Problem
Let x ∈ R n x and u ∈ R n u denote the state and controlled input of a nonlinear plant at an equilibrium trim point, where n x is the number of states and n u is the number of inputs. Defining the nonlinear plant state at time k as x(k) ∈ R n x and input u(k) ∈ R n u , respectively, and letting δx(k) x(k) − x, and δu(k) u(k) − u, a linear prediction model of the form
can be obtained by linearization of a nonlinear model around this (x, u), where A ∈ R n x ×n x , B ∈ R n x ×n u and B d ∈ R n x ×n w , and w(k) ∈ R n w is an unmeasured disturbance. Letting δx s (r (k)) and δu s (r (k)) be state and input setpoints relative to the equilibrium trim point, N be the control horizon (equal to the prediction horizon), δx(k) be the current estimate of the linearized state,ŵ(k) be an estimate of the unmeasured disturbance, δx i and δu i denote predicted state and input deviations from the trim point at a time instant i steps into the future, θ
M be a shorthand for · T M · , and P ≥ 0, Q ≥ 0, R > 0 be appropriately sized weighting matrices, then the finite-horizon optimal control problem to solve at each time step is
For simplicity, state constraints are neglected and with only input constraints considered, no further measure is needed to ensure that a feasible solution to the QP exists, hence no method to detect infeasibility is required in the QP solver. Letting ⊗ denote the Kronecker product, ⊕ denote the matrix direct sum, I i ∈ R i×i be an identity matrix, and 1 i ∈ R i×1 be a vector of ones, let
The optimal control problem (OCP) (2) is posed as a parametric QP
where vectors h and f change between problem instances.
B. Steady-State Target Calculation
For nominal offset-free steady-state tracking of reference setpoints it is common to use a target calculator [5] - [7] to calculate δx s and δu s that satisfy
where C r ∈ R n r ×n x , and r (k) ∈ R n r is a vector of n r reference setpoints to be tracked without offset if, for some value r ∞ , r (k) → r ∞ as k → ∞. A feasible solution to (5) is not guaranteed to exist. Let
and W > 0 be a weighting matrix. The solution of
subject to (5c) will find a solution satisfying the equality constraints if one exists and return a least-squares approximation if one does not. This is a dense QP with no equality constraints; however, it is not guaranteed that A T s W A s > 0 and the solution of (6) 
subject to (5c). The optimal values δx * s (r (k)) and δu * s (r (k)) are then used as the setpoints in the regulation problem (2).
III. CASE STUDY-CONTROL OF A LARGE AIRLINER
For this implementation, the control of the roll, pitch and airspeed of a nonlinear Simulink-based model [11] , [32] of the rigid-body dynamics of a Boeing 747-200 with individually manipulable controls surfaces is considered.
A prediction model of the form (1) is obtained by linearization of the nonlinear model about an equilibrium trim point for straight and level flight at an altitude of 600 m and an airspeed of 133 ms −1 , discretized with a sample period of T s = 0.2 s. The linearized model considers 14 states (roll rate, pitch rate, yaw rate, airspeed, angle of attack, sideslip angle, roll, pitch, yaw, altitude, and four engine power states). Yaw angle and altitude are neglected in the prediction model used for the predictive controller, since they do not affect the roll, pitch and airspeed (leaving 12 remaining states). The 17 inputs considered consist of four individually manipulable ailerons, left spoiler panels, right spoiler panels, four individually manipulable elevators, a stabilizer, upper and lower rudder, and four engines. The effects of landing gear and flaps are not considered as these substantially change the local linearization. The disturbance input matrix B d is selected to   TABLE III  COST FUNCTION   TABLE IV  INPUT CONSTRAINTS describe an input disturbance in discrete time that is equivalent to a constant disturbance to the rate of each of the first 10 states in continuous time. The cost function (2a) is chosen with the weights in Table III and the constraints on the inputs are summarized in Table IV. The twelve states δx(k) of the model (1) are assumed measurable, along with the two variables that were neglected in the prediction mode: the altitude, and the yaw angle. The disturbance w(k) cannot be measured. As is standard practice in predictive control, an observer is therefore used to estimate w(k). The observer includes a one step ahead prediction to allow the combined target calculator and predictive regulator a deadline of one sampling period for computation.
The target calculator is configured with C r = [e 4 , e 7 , e 8 ] T , where e i is the i th column of the 29 × 29 element identity matrix, in order that the references to be tracked are airspeed, roll, and pitch angle. The weighting matrix W is selected as an appropriately sized identity matrix.
IV. HARDWARE IMPLEMENTATION
While for software running on desktop-oriented platforms the use of IEEE double precision (64-bit) floating-point arithmetic is generally unquestioned, for embedded platforms single precision (32-bit) arithmetic is often preferred due to the super-linear relationship between word-length and silicon resources [33] . In addition, memory storage and bandwidth requirements-key issues for resource-constrained embedded applications-decrease with the number of bits used. Many applications, including predictive control, do not require the accuracy provided by double precision arithmetic if sufficient care is taken to formulate the problem in a numerically favorable way.
Factors different to those important for a software implementation can motivate the choice of algorithm for a custom hardware design. In sequential software, a smaller floatingpoint operation count leads to shorter algorithm runtimes. In hardware it is the ratio of parallelizable work to sequential work that determines the potential speed of an implementation.
In addition, the proportion of different types of operations can also be an important factor. For example, multiplication and addition have lower latency and use fewer hardware resources than division or square root operations.
A. MPC QP Solver
This subsection describes the main architectural and algorithmic details for the design of the solver for problem (4).
1) PDIP Algorithm:
The present design uses a PDIP algorithm [28] with an iterative linear solver at its core (Fig. 1) . Iterative linear solvers can be preferable over direct (factorization-based) methods in this context, despite the problem sizes being small in comparison with the problems for which iterative methods have been used historically. Firstly, matrix-vector multiplication accounts for most of the computation at each iteration, an operation offering multiple parallelization and pipelining opportunities. Secondly, there are few division and square root operations compared with factorization-based methods. Finally, these methods allow one to trade off accuracy for computational time by varying the number of iterations. Despite iterative methods being more sensitive to poor conditioning than direct methods, for the present application, with suitable problem scaling, a relatively small number of iterations can be sufficient to obtain adequate accuracies, as observed in Section V. Conversely, direct methods are more difficult to parallelize and pipeline, with many points at which all subsequent operations depend upon the solution of a single division operation (with a comparatively long latency).
Because there is no matrix factorization to reuse, a simple PDIP algorithm [28] is employed instead of Mehrotra's predictor-corrector algorithm [34] , which is commonly found in software packages. Rather than checking a termination criterion, the number of interior-point iterations is fixed to 18 since this guarantees that a (possibly suboptimal) solution is available at a given time. A detailed investigation into the number of iterations needed by interior-point methods is not the subject of this paper, however, a posteriori testing indicates that 18 iterations is sufficient to achieve the accuracy of solution needed, and a crude bound on the duality measure at the final iteration indicates μ 18 ≥ 1.5 2 · 0.35 18 ≈ 1.4×10 −8 . Both this and its reciprocal are well within the dynamic range of single precision floating point arithmetic. In step 5 of the PDIP algorithm, a backtracking line search algorithm is used, reducing the step-length by a factor of 0.5 per iteration, over a maximum of 17 iterations before rounding to zero.
To accelerate the convergence of the iterative linear solver for Step 3 in the PDIP algorithm, a positive diagonal preconditioner [33] , [35] is used whose elements (indexed with braced subscripts) are given by
where Z is the number of columns of the matrix. The problem derivation of this preconditioner was to bound variables rather than to improve convergence [33] , [35] , it has a positive effect on the convergence of the iterative algorithms for this and other applications. The hardware implementation is outlined in Section IV-A3 and its effect on reducing the number of iterations required for convergence is illustrated in Sections V and VII.
2) Architecture of QP Solver: This design is based on the initial architecture described in detail in [24] and [25] , implemented using VHDL and Xilinx IP-cores for floating point arithmetic and RAM structures. The implementation is split into two distinct blocks: one block accelerates the computational bottleneck of the algorithm (solving the linear equations in Fig. 1, step 3) implementing a parallel MINRES solver; the other block computes all the remaining operations.
The block that prepares the linear systems consists of a custom sequential machine with custom complex instructions that reduce instruction storage requirements and is close to 100% efficient, i.e., there are no cache misses or pipeline stalls, so a useful result is produced at every clock cycle. The original sequence of instructions has been modified for problems with only input bound constraints, significantly reducing the amount of computation for calculating k . In addition, more instructions have been added to calculate the preconditioner M and to recover the search direction z k from the result for the preconditioned system given by the linear solver. Since very few elements in k are changing from iteration to iteration, the updating of the preconditioner M is not costly.
The parallel MINRES solver accelerates computations by performing dot-products in a parallel fashion. This block consists of a bank of multipliers that perform all the multiplications concurrently followed by an adder reduction tree that accumulates the results. Since matrix A k can be rearranged into a banded form by interleaving the primal and dual variables [27] , the size of the dot-products is equal to the size of the band 2V − 1, where V = 2n x + n u is the half-band of the matrix. A customized storage scheme is used to exploit the fact that many elements in A k are constant or zero and there is repetition due to the multistage problem structure and timeinvariance. This saves approximately 70% of memory storage requirements in comparison with a standard banded storage scheme [24] . Exploitation of this structure would not have been possible with direct methods, since the nonzero band in the factorized matrices becomes dense.
Another benefit of FPGA technology for real-time applications is the ability to provide cycle accurate computation time guarantees. Let I I P and I M R be the number of PDIP and MINRES iterations, respectively. Let f c denote the clock frequency (250 MHz in this implementation) and c relate the time spent by the sequential block to the time spent in a MINRES iteration (this varies with different implementations as shown in Table V ). Let Z = N(2n x + n u ) + 2n x denote the number of elements in b k , and P 2Z + V + 12 log 2 (2V − 1) + 230 /Z , then for the current design, computation time is given by
3) Online Preconditioning Implementation: Two options for implementing the online preconditioning can be considered. The first option is to compute the preconditioned matrix in the sequential block and store it in the linear solver. This requires no extra computational resources; however, it imposes an extra computational load on the sequential block, increasing overall latency. It also prohibits the use of the customized reduced storage scheme, since the nonzero elements that were previously constant between iterations are no longer constant in the preconditioned matrix.
The second option, adopted by the present implementation, only computes the preconditioner in the sequential block. The original matrix is stored in RAM in the linear solver component, and the preconditioner is applied by a bank of multipliers inserted at the output of the memory. This requires approximately three times as much computation; however, this is not on the critical path, i.e., memories storing the matrix can be read earlier, so there is no effect on execution speed. The reduced storage scheme is retained at the cost of a significant increase in the number of multipliers. There is a tradeoff between the extra resources needed to implement this procedure and the amount of acceleration gained through a reduction in iteration count, as investigated in Section VII.
B. Target Calculator
The QP solver described in Section IV-A2 is designed to directly exploit the sparse structure of the problem (4). The target calculation problem (7) does not exhibit such a structure, thus the solver of Section IV-A2 cannot be applied directly. A separate QP solver is therefore required.
Matrix H s ∈ R n s ×n s , where n s = n x +n u is relatively small, positive definite by construction, and the constraints (5c) are simple bounds. The FGM algorithm [29] is considered, being division free and well-suited to fixed-point implementation (which uses significantly fewer FPGA resources).
A diagonal scaling matrix M s is obtained in the same manner as (8 
The implementation is prototyped at a register transfer level using SIMULINK and the MATLAB Fixed Point Toolbox, with control logic specified using M-code function blocks and programmatically converted to VHDL using MathWorks' HDL coder R2012a. The target calculator circuit is implemented as four distinct subsystems. Subsystem #1 calculates f s F s b s through multiplication of an n s × n b element matrix by n b element vector (where n b = n d + n r ). Since this occurs only once per sample, this matrix-vector multiplication is carried out sequentially, with one multiplication and one addition happening simultaneously per clock cycle, to conserve FPGA resource usage. Subsystem #2 contains the implementation of the FGM (Fig. 2) .
As with MINRES, the majority of the effort in this algorithm occurs in a matrix vector multiplication, which happens once per iteration. The matrix H s is stored with each column in a separate RAM, so that the multiplication of the elements of each row of H s with the elements of y (k−1) can be performed in parallel, with the sum of the elementwise products calculated using a pipelined tree reduction structure. This latter being conveniently generated by HDL coder by using the sum of elements block configured to use a tree structure, with register balancing. Subsystem #3 reverses the preconditioning, element-by-element in series, and subsystem #4 calculates T Section IV-A2, but scaled by diagonal matrices T Q and T R to improve problem conditioning (Section V).
The fixed point data types (Table VI) are chosen to match the resources present in the particular FPGA device targeted. In particular, data types for multiplication operations are chosen with consideration of the DSP48E resources on the FPGA, each of which includes a 25×18 bit integer multiplier. Matrix values are represented with a 25 bit data type and vector variables with a 35 bit data type, meaning that two DSP48E units are required per vector element to enable a throughput of one element-wise vector product per clock cycle.
The length of the integer portions are chosen based on the maximum ranges expected based on the input constraints from the case study (Section III). The number of clock cycles needed is shown by stage in Table VII , where I F G denotes the number of FGM iterations. The time for each FGM iteration is insignificant compared with the interior point regulator, so I F G = 1000 is chosen as a conservative value. A detailed analysis of the convergence properties of FGM using fixed point arithmetic can be found in [36] .
V. OFFLINE PRESCALING FOR PDIP/MINRES
For each PDIP iteration, the convergence of the MINRES algorithm used to solve A k z k = b k , and the accuracy of the final estimate of z k are influenced by the eigenvalue distribution of A k . When no scaling is performed on the prediction model and cost matrices for this application, and no preconditioning is applied online, inaccuracy in the estimates of z k leads the PDIP algorithm to not converge to a satisfactory solution. Increasing the number of MINRES iterations fails to improve the solution, yet increases the computational burden.
Preconditioning applied online at each iteration of the PDIP algorithm can accelerate convergence and reduce the worstcase solution error of z k . In [26] , an offline prescaling was used in lieu of an online preconditioner, with the control performance demonstrated competitive with the use of conventional factorization-based algorithms on a general purpose platform. The rationale behind the prescaling is now restated and numerical results presented to demonstrate that combining systematic offline prescaling with online preconditioning k corresponding to active constraints becoming large, as long as only a handful of these exist at any point, the perturbation toÂ is of low rank, and will have a relatively minor effect on the convergence of MINRES. Hence, rescaling the control problem to improve the conditioning ofÂ should also improve the conditioning of A k in some sense.
Prior to scaling, for N = 12, cond(Â) = 1.77 × 10 7 . The objective of the following procedure is to obtain diagonal matrices T Q > 0 and T R > 0 to scale the linear state space prediction model and quadratic cost weighting matrices as follows:
Since T Q and T R are diagonal, the diagonal structure of k is retained. The transformation (11) is a function of both T Q and its inverse, and both of these appear quadratically. In [37] some guidelines are provided for desirable scaling properties. In particular, it is desirable to scale the rows and columns ofÂ so that they are all of similar magnitude. While not exactly the original purpose, it should be noted that if the preconditioner (8) is applied repeatedly (i.e., repreconditioning the same matrix multiple times) to a general square matrix of full rank, the one-norm of each of the rows converges asymptotically to unity. The method proposed here for normalizingÂ follows naturally but with the further caveat that the structure ofM is imposed to be of the form (11) . Therefore, it is not (in general) possible to scaleÂ such that all row norms are equal to an arbitrary value. Instead, the objective is to reduce the variation in row (and column) norms. Empirical testing suggests that normalizing the two-norm of the rows ofÂ (subject to [11] ) gives the most accurate solutions from the PDIP method for the present application.
Noting the structure ofÂ, define the following vectors: 
, and apply the algorithm in Fig. 3 . Table VIII shows properties ofÂ that influence solution quality, before and after application of the prescaling with = 10 −7 : the condition number ofÂ; the standard deviation of the row 1-and 2-norms; and the standard deviation of the magnitude of the eigenvalues, which are substantially reduced by the scaling. Fig. 4 shows three metrics for the quality of the solution from the MINRES-based PDIP solver over the duration of a closed-loop simulation with a prediction horizon N = 12. The number of MINRES iterations per PDIP iteration is varied for four different approaches to preconditioning (none, offline, online, and combined online and offline). These experiments were performed in software, but a theoretical computation time using (9) for the FPGA implementation is also shown.
With neither preconditioning nor offline scaling, the control performance is unacceptable. Even when the number of MINRES iterations is equal to 2Z = 2 × 516 = 1032, the mean stage cost over the simulation is high (the controller failed to stabilize the aircraft), and the worst-case control error in comparison with a conventional PDIP solver using double precision arithmetic and a factorization-based approach is of the same order as the range of the control inputs. Using solely online preconditioning, control performance (in terms of the cost function) does not start to deteriorate significantly until the number of MINRES iterations is reduced to 0.25Z = 129, although at this stage the worst case relative accuracy is still poor (but mean relative accuracy is tolerable). With only offline preconditioning, worst case Numerical performance for a closed-loop simulation with N = 12, using PC-based MINRES-PDIP implementation with online and offline preconditioning (missing markers for the mean error indicate that at least one control evaluation failed due to numerical errors). Offline prescaling appears to be more effective than the online preconditioning alone, but combined use yields significantly better numerical performance than using either alone. relative control error does not deteriorate until the number of MINRES iterations is reduced to 0.75Z = 387 and control performance does not deteriorate until this is reduced to 0.1Z = 51. When combined, control performance is maintained with 0.03Z = 15 iterations, and worst-case control accuracy is maintained with 0.08Z = 41 iterations. 
VI. FPGA-IN-THE-LOOP TESTBENCH
The hardware-in-the-loop experimental setup used to test the predictive controller design has two goals: providing a reliable real-time closed-loop simulation framework for controller design verification; and demonstrating that the controller could be plugged into a plant presenting an appropriate interface. Fig. 5 shows a schematic of the experimental setup. The QP solver, running on a Xilinx FPGA ML605 evaluation board [38] , controls the nonlinear model of the B747 aircraft running in SIMULINK on a PC. At every sampling instant k, the observer estimates the next state δx(k + 1|k) and disturbanceŵ(k + 1|k). For the testbench, the roll, pitch and airspeed setpoints comprising the reference signal r (k) in the target calculator (5) that the predictive controller is designed to track, are provided by simple linear control loops, with the roll and pitch setpoints as a function of a reference yaw angle and reference altitude, respectively. The airspeed setpoint is passed through a low-pass filter. The vectors δx(k + 1|k),ŵ(k + 1|k) and r (k) are represented as a sequence of single-precision floating point numbers in the payload of a UDP packet via an S-function and this is transmitted over 100 Mbit/s Ethernet. The FPGA returns the control action in another UDP packet. This is applied to the plant model at the next sampling instant.
On the FPGA the two custom hardware circuits implementing the QP solvers for target calculation and MPC regulation are connected to a Xilinx MicroBlaze soft core processor, upon which a software application, bridges the communication between the Ethernet interface and the two QP solvers. As well as being simpler to implement, this architecture provides some system flexibility in comparison with a dedicated custom interface, with a small increase in FPGA resource usage (Table IX) and communication delay, and allows easy portability to other standard interfaces, e.g., SpaceWire, CAN bus, and so on, as well as an option for direct monitoring of controller behavior. Table IX shows the FPGA resource usage of the different components in the system-on-a-chip testbench, as well as the proportion of the FPGA used for two midrange devices with approximately the same silicon area from the last two technology generations (the newer FPGA offers more resources per unit area, meaning a smaller, cheaper, lower power model can be chosen). The linear solver uses the majority of the resources in the MPC QP solver, while the MPC QP solver consumes substantially more resources than the target calculator, since it is solving a larger optimization problem (refer to Section II). Table IX also highlights the cost of using preconditioning.
Using Xilinx Power Analyzer it is estimated that the Virtex 6 LX240T FPGA-based system draws 9.75 W of power, of which 4.277 W is quiescent (e.g. leakage) and 5.469 W is dynamic. This is less than the 35 W thermal design power associated with the microprocessor in the laptop used for performance comparison in the following section. A smaller, low-voltage or more modern FPGA model can reduce power consumption (particularly leakage). Within the custom circuit, 3.00 W is used in the interior point QP solver and 0.12 W in the FGM target calculator.
VII. FPGA-IN-THE-LOOP CLOSED-LOOP PERFORMANCE
A closed-loop system with the FPGA in the loop, controlling the nonlinear model of the Boeing 747 from [32] is compared with running the complete control system on a conventional computer using factorization-based methods. The two QP solvers are first evaluated separately, and then trajectory plots of the closed-loop trajectories for the complete system are presented. The reference trajectory is continuous, piecewise continuous in its first derivative, and consists of a period of level flight, followed by a 90 • change in heading, then by a 200 m descent, followed by a 10 ms −1 deceleration.
A. MPC Regulator
Solution times and control quality metrics for the regulator QP solver are presented for a 360 s simulation, with N = 12 and N = 5 in Table X . Based on the results of Section V, for N = 12, the number of MINRES iterations per PDIP iteration I M R = 51. For N = 5, I M R = 30. This is higher than was empirically determined to be necessary; however, the architecture of the QP solver requires that the MINRES stage must run for at least as long as the sequential stage. The control accuracy metrics presented are
where u F (k) is the calculated control input, and u * (k) is the hypothetical true solution and the subscript · {i} indicates an elementwise index. Since the true solution is not possible to obtain analytically, the algorithm of [27] , implemented using MATLAB Coder, is used as a baseline. The metrics are presented alongside those for custom software QP solvers generated using current versions of CVXGEN [39] (for N = 5 only since for N = 12 the problem was too large to handle) and FORCES [40] (for N = 12 and N = 5). PC-based comparisons are made using double precision arithmetic on a laptop with a 2.4 GHz Intel Core 2 Duo processor. The code from CVXGEN and FORCES is modified to use single precision arithmetic and timed running directly on the 100 MHz MicroBlaze soft core on the FPGA for the number of iterations observed necessary on the PC. (Double precision floating point arithmetic would be emulated in software, and not provide a useful timing comparison.) While obtaining results useful for control from the single precision modification to the CVXGEN solver proved to be too challenging, the timing result is presented assuming random data for the number of iterations needed on the PC. The MicroBlaze used for the software solvers is configured with throughput optimizations, single precision floating point unit (including square root), maximum cache sizes (64 kB data and 64 kB instruction), and maximum cache line length (eight words).
For N = 12, the FPGA-based QP solver (at 250 MHz) is slightly faster than the PC-based QP solver generated using FORCES (at 2.4 GHz) based on wall-clock time but ≈ 10× faster on a cycle-by-cycle basis. It is also ≈ 65× faster than the FORCES solver on the MicroBlaze (at 100 MHz), which would fail to meet the real-time deadline of T s = 0.2 s by a factor of approximately 10. By contrast, the clock frequency for the FPGA-based QP solver could be reduced by a factor of ≈ 15 (reducing power requirements, and making a higher FPGA resource utilization factor possible), or the sampling rate increased by the same factor (improving disturbance rejection) while still meeting requirements. Worstcase and mean control error are competitive. A similar trend is visible for N = 5 with the FPGA-based solver only marginally slower than the CVXGEN solver on the PC in terms of wallclock time.
The maximum communication time over Ethernet, experimentally obtained by bypassing the interface with the QP solvers in the software component is 0.6990 ms. The values for FPGA-based implementation in Table X are normalized by subtracting this, since it is independent of the QP solver.
B. Target Calculator
The accuracy and computation time of the fixed-point FPGA-based target calculator is compared with double and single precision variants of the same algorithm generated using MATLAB Coder and with a solver generated using CVXGEN, in Table X . FORCES is not considered here since the target calculation problem does not have a multistage structure. While the accuracy of the FPGA-based implementation is slightly inferior to the other options (but as demonstrated by closed-loop simulation still adequate), the solution wall-clock time is faster than the fastest PC-based algorithm compared, and more than 200× faster on a cycle-per-cycle basis than running a single precision FGM directly on the MicroBlaze, and negligible in comparison with the MPC computation time.
C. Combined System
Trajectories from the closed-loop setup, with N = 12 for the PDIP-based MPC regulator, and the FGM-based target calculator running on the FPGA are shown in Fig. 6 . The reference trajectory is tracked, inputs constraints are enforced during transients, and the zero-value lower bound on the spoiler panels is not violated in steady state.
VIII. CONCLUSION
This paper has demonstrated the implementation of a system-on-a-chip MPC control system, including predictive control regulator and steady-state target calculation on an FPGA. A Xilinx MicroBlaze soft-core processor is used to bridge communication between the two custom QP solvers, and the outside world over Ethernet. The controller is tested in closed-loop with a nonlinear simulation of a large airliner-a plant with substantially more states and inputs than any previous FPGA-based predictive controller.
The MPC regulator employs a PDIP algorithm using single precision floating point arithmetic, with a preconditioned iterative method used to solve systems of linear equations. A numerical investigation shows that with preconditioning and the correct plant model scaling, a relatively small number of MINRES iterations is required to achieve sufficient control accuracy for this application. The steady-state target calculator uses the FGM implemented using fixed point arithmetic. This is economical in terms of FPGA resources, and is faster than an equivalent algorithm in floating point arithmetic on a laptop PC, while running at approximately 10% of the clock frequency. The whole system can fit onto a midrange FPGA, and uses less than 1/3 of the power of the laptop microprocessor with which it is compared. Lower clock frequencies could be used to further reduce power consumption further, while still meeting real-time control deadlines.
