Abstract: Model Predictive Control (MPC) is increasingly being proposed for application to miniaturized devices, fast and/or embedded systems. A major obstacle to this is its computation time requirement. Continuing our previous studies of implementing constrained MPC on Field Programmable Gate Arrays (FPGA), this paper begins to exploit the possibilities of parallel computation, with the aim of speeding up the MPC implementation. Simulation studies on a realistic example show that it is possible to implement constrained MPC on an FPGA chip with a 25MHz clock and achieve MPC implementation rates comparable to those achievable on a Pentium 3.0 GHz PC.
INTRODUCTION
Model Predictive Control (MPC) has become an established control technology owing to its capability of solving problems with physical constraints. Its applications were originally in the petrochemical industry, but its use has now been proposed for a variety of industries, including high bandwidth applications, such as ships (Perez et al., 2000) , aerospace (Richards et al., 2003) , road vehicles (Morari et al., 2003) and also in micro scale devices (Bleris et al., 2005) .
Basic MPC solves a quadratic programming (QP) or a linear programming (LP) problem. Its successful application depends on the ability to generate a feasible solution within one sampling period. One approach to achieving MPC for high speed embedded applications which are constrained by computational time is through hardware acceleration of the MPC implementation.
Reconfigurable hardware such as Field Programmable Gate Array (FPGA) is a promising platform for deploying embedded MPC solutions to applications which require flexibility and short design cycle. The encapsulation of the constrained MPC algorithms as suitable modules for embedded control has been investigated in (Ling et al., 2006) , where a Handel-C implementation of an MPC algorithm was described and realized on a modest Xilinx Spartan 3L(XC3S1500L-4-fg320) FPGA. The computation carried out in that FPGA design was sequential without exploiting the parallel computing capability of an FPGA. This paper focuses on improving on our earlier work by exploiting the opportunities for parallel computation.
The paper is organized as follows: A review of the constrained MPC formulation and the Interior Point algorithm for solving the resulting Quadratic Programming (QP) problem is presented in Section 2. The prototyping environment for this study, as well as an illustration of how the Handel-C language could be used to explore opportunities for parallel computations of the MPC algorithm through better use of the available FPGA resources, is briefly described in Section 3. The speed up improvement of the new implementation is verified through an aircraft simulation example in Section 4.
REVIEW OF CONSTRAINED MPC AND THE INTERIOR POINT METHOD

Review of Constrained MPC
Constrained MPC can be formulated as a QP problem. For simplicity, given a single input single output discrete linear time-invariant plant in the state space form,
and ( ) x k are the system output, input and internal states respectively. The constrained MPC problem is to minimize the cost function,
subject to linear inequality constraints on the system outputs, inputs and states. Here, ω is the set-point; P N and u N are the prediction and control horizons respectively.
We follow the standard approach (eg Rao et al. 1998 , Maciejowski 2002 ) of eliminating the equality constraints Here, Q is a symmetric Nu × Nu matrix and J is mc × Nu in size, where mc is the total number of inequality constraints. The infeasible Interior Point (IP) method (Rao et al, 1997) was adopted for solving the QP problem and implemented in a FPGA. The IP method can be derived from the wellknown Karush-Kuhn-Tucker (or KKT) conditions, which are necessary and sufficient conditions for characterising the global optimum, providing that Q ≥ 0:
Equations (5) and (6) are called feasibility conditions while (7) is called the complementarity condition.
The infeasible interior point method perturbs the complementarity condition in (7) with the following scalar
where k is the iteration counter and c m is the number of inequality constraints in (4). As the iteration proceeds, the infeasibility and k μ are gradually reduced to zero.
The Interior Point Method
According to the infeasible interior point framework introduced by (Wright, 1997) , the QP can be solved iteratively as follows:
Step 1: Choose an initial condition (z 0 ,λ 0 ,t 0 ) with (λ 0 ,t 0 )>0.
Step 2: At the k-th iteration, solve for the increments
Step 3: Increment the variables according to:
for some α k ∈ (0, 1], subject to (λ k+1 , t k+1 ) > 0.
Step 4: Judge the convergence. If the iterations have converged, stop the iterations; the optimal control sequence is obtained as
Step 2 with
It is clear that Step 2 of the interior point algorithm is where most of the computation occurs and hence optimizing the computational efficiency of step 2 should be the most fruitful. In most embedded MPC application, the number of constraints c m is usually much larger than the number of variables u N -note that the same constraint at different points in the prediction horizon appears as several constraints in (4). In this case, it can be computationally advantageous to solve (9) in two stages as follows:
Note that k Γ is a diagonal matrix.
Using (14), (10) can also be written as ( )
17th IFAC World Congress (IFAC'08) Seoul, Korea, July 6-11, 2008 
PROTOTYPING ENVIRONMENT FOR MPC-ON-A CHIP
The rapid prototyping environment for optimization of the "MPC on a Chip" design is shown in Fig. 1 . The plant simulation is implemented in MATLAB while MPC is implemented either in MATLAB or on a FPGA prototyping board. This prototype environment provides flexibility and allows convenience in experimentation and verification of the various FPGA implementations of the MPC algorithm.
Fig. 1 Prototyping of MPC on a Chip
The tools used in this study include a RC203 FPGA board, the DK Design Suite from Celoxica and Matlab/Simulink software from The Mathworks.
The RC203 is a platform for evaluation and development of FPGA-based applications. The platform include a Xilinx Virtex-II FPGA, external memory, programmable clocks, Ethernet, Audio, Parallel port, RS-232 etc. The FPGA used in this study is XC2V3000, which has three million logic gates. The XC2V3000 has 96 multiplier blocks, 96 block of 18K on-chip RAM and a maximum 720 I/O pads.
For hardware implementation on an FPGA, the constrained MPC algorithm is coded in Handel-C which is a C-like language for digital logic design. Programming in Handel-C allows one to focus on algorithm design without worrying about how the underlying computation engine works. As an illustration, Fig.4(a) and Fig.4(b) show, respectively, fragments of the Handel-C code for the sequential and parallel implementation of the computation of Fig. 4(a) , the computation would require 9 clock cycles, since in our implementation, one floating-point multiplication and addition would require 3 clock cycles each to complete.
The opportunity for parallelism depends very much on the data dependency of the computation tasks. Variables that do not have any data dependency could be computed simultaneously provided additional hardware resources are available. As illustrated in Fig. 4(b) , if an additional copy of the floating point multiplier hardware is created on a FPGA, then the multiplications can be performed in parallel and the clock cycles needed are reduced to 6. During the time when the multiplications are carried out, the adder hardware is left idle and can be deployed for other computational tasks (not shown here). In our current implementation, we achieved a degree of parallelism by ad hoc programming. In particular, the execution of some FOR loops were combined and parallelised when solving (14).
The Handel-C implementation of the MPC algorithm was compiled into a bitstream file which was then downloaded to the RC203 board to configure the FPGA chip to perform the constrained MPC calculations. The data exchange between the PC which runs MATLAB and the FPGA board which carries out the MPC calculations was achieved via the serial link.
RESULTS
Aircraft Model and Constraints
To verify the implementation of our FPGA implementation of the constrained MPC algorithm, the Cessna Citation 500 Aircraft example (Example 2.7) from Maciejowski (2002) 
The model has the elevator angle (rad) as its input, and the pitch angle (rad), altitude (m) and altitude rate (m/s) as outputs. The elevator angle is limited to ±15° (±0.262rad) and the elevator slew rate is limited to ±30°/s (±0.524rad/s). These are limits imposed by equipment design and cannot be exceeded. For passenger's comfort the pitch angle limit is limited to ±20° (±0.349rad)
A MPC controller was designed with a sampling interval of 0.5s, N p = 10, and N u = 3. The following constraints, We implemented our MPC on the Xilinx FPGA using the IEEE single precision arithmetic (8-bit exponent and 23-bit mantissa). Two FPGA implementations of the constrained MPC algorithms were created: one implemented the computations sequentially while the other exploited the opportunities for parallel computation. The aircraft model was simulated in MATLAB (Version 7.1) on a Windows XP PC (CPU P4 3.0GHz with 1GB RAM) and controlled by the FPGA implementation of MPC on the RC203 board. The controller and plant interacted through the serial link.
Simulation Results
Simulations were conducted based on two scenarios which corresponded to Case 2 and Case 3 of (Maciejowski, 2002) . These two scenarios were chosen because they resulted in QP problems with different numbers of inequality constraints: 32 for Case 2 and 52 for Case 3.
We tested several versions of the constrained MPC implementations using the prototype environment described in Section 3. In order to create a benchmark for our FPGA implementation of MPC, a version of the constrained MPC algorithm implemented in MATLAB using its standard quadprog function (which solves QP problems using the active set method) was created. Also, the interior point method described in Section 2 was coded as an m-function in MATLAB. The timing of these two implementations running in the interactive environment of MATLAB was recorded. We also used the "mcc -m" command available in MATLAB Compiler to obtain stand-alone executables of these two MATLAB implementations. We recorded their timings. Table 1 compares the results. It can be seen that the parallel implementation of MPC in the FPGA was approximately twice as fast as the sequential implementation. Although the computational time of the Interior Point (IP) implementation increases with the number of inequality constraints, the computational speed achieved by exploiting parallelism using an FPGA is comparable to that of the stand alone executable running on a Pentium4 3.0GHz CPU. Table 2 presents the resource requirements and the clock counts measured for the two FPGA implementations. The clock rate achieved was 25MHz. The clock counts needed to complete one iteration of the Interior Point (Step 2 only) was 7564 for the sequential implementation and 3088 for the parallel implementation; a saving of 4476 and 6920 in case 2 and 3 per iteration step. The reduction in the clock counts was achieved at the expense of a modest increase in the hardware resources needed for the parallel implementation: 2 additional Arithmetic Logic Units (ALU) --which are 18x18 multipliers, 20% more Look-Up Tables  (LUT) , and 2% more Flip-Flops (FF). Nevertheless, it should be noted that even with these increased resource requirements; there are still unused resources on the FPGA chip that can be further exploited. The response of the aircraft for each of the simulation scenarios is shown in Figure 5 (a) and 6(a), respectively. For Case 2, the 400m step change in altitude set-point resulted in the pitch angle constraint being active during the transient, while the rate of change of altitude was unconstrained. For Case 3, the rate of change of altitude was also constrained, and the simulation result showed that the altitude rate constraint is active during most of the transient while the pitch angle constraint is only active at the beginning of the transient.
Solving a constrained MPC problem is a repetitive process. At each MPC sampling instant, a new QP problem is created which requires several Interior Point iterations to obtain the solution. The number of IP iterations required varies and depends on factors such as the set of active constraints.
Figure 5(b) shows the plot of the number of IP iterations and clock counts recorded at every MPC sampling instants for † For step 2 in the IP algorithm, one iteration only 17th IFAC World Congress (IFAC'08) Seoul, Korea, July 6-11, 2008 Case 2. It can be seen that when the constraints become active from 4s to 9s, both the number of iterations and clock counts increase dramatically. After the transient they fall to a relatively low number. The most difficult QP problem occurred at about t=5s. Comparing the results of sequential and parallel implementations, it can be seen that although the trends are similar, the clock counts needed in the latter was reduced by nearly half. The corresponding plots of the number of iterations and clock counts for Case 3 are given in Fig. 6(b) . The trends of iterations and clock counts are similar to those in Figure 5 (b). For Case 3, the most difficult QP problem occurs at t=9s, which takes more than 30 IP iterations to solve. Table 3 lists the average number of Interior Point iterations (at each step) for Cases 2 and 3. Such information may be useful in determining an appropriate sample rate for an embedded MPC application. has a slope which is in between the sequential FPGA implementation and that of Pentium CPU. However, as shown in Fig. 8 , when the number of optimisation variables increases, the increase in the computation time varies faster than linear and the Pentium CPU outperformed the FPGA implementation. The reason could be that the current FPGA implementation uses a straightforward Gaussian elimination method to solve the resulting Ax=b problem in Step 2 and hence less efficient in handling larger Ax=b problems. This issue is a current subject of investigation.
CONCLUSION
In this paper, two FPGA implementations of constrained MPC were described and evaluated. The parallel implementation outperformed the sequential implementation in terms of computation speed, at the expense of a modest increased in the hardware resource requirement.
This paper continues our investigation of the potential benefits of using FPGAs to implement MPC. There is still much detailed analysis to be done, which should indicate further possibilities for achieving faster implementations. More study of the hardware trade-offs is needed. It is noted that there are still unused hardware resource available which can be further exploited. A more systematic method to exploit these unused resources is a subject of current research. This should allow MPC to be used in application areas where the computational load has been considered too great until now, such as UAVs, automobile control systems and gas turbine control.
