Abstract-A reconfigurable field-programmable gate array (FPGA)-based predictive controller based on Nesterov's fast gradient method is designed using Simulink and converted to VHDL using Mathworks' HDL Coder. The implementation is verified by application to a spacecraft rendezvous and capture scenario, with communication between the FPGA and a simulation of the relative dynamics occuring over Ethernet. For a problem with 120 decision variables and 240 constraints, computation times of 0.95 ms are achieved with a clock rate of 50 MHz, corresponding to a speed up of more than 2000 over running the algorithm directly on a MicroBlaze microprocessor implemented on the same FPGA.
I. INTRODUCTION
The archetypal predictive controller (e.g. [1] , [2] ) solves a constrained, finite but receding horizon optimal control problem online at each time step. In its most prevalent forms, this problem is cast as a constrained convex quadratic program (QP) or a linear program (LP). The solution of these has a higher computational burden than linear feedback and this must be accommodated in the overall system design.
FPGAs are programmable silicon chips that allow implementation of custom digital circuits. In contrast to conventional microprocessors, the designer can choose the levels of parallelism and numerical precision at each stage in a numerical algorithm and trade these against hardware resource usage. Parallellism can accelerate the solution of constrained optimisation problems to meet real-time requirements whilst running at comparatively low clock frequencies.
The present paper demonstrates the implementation of a field programmable gate array (FPGA)-based predictive controller applied to the terminal phase of a spacecraft rendezvous and capture mission, as outlined in Section II.
FPGA design can be carried out in a number of ways. These include: directly using a register transfer language (RTL) (e.g. VHDL, Verilog), writing a custom generator to create RTL source from a set of parameters (e.g. [3] , [4] ), using a high level C-like language (System C, Vivado High Level Synthesis), or graphical methods (e.g. Xilinx System Generator for DSP, Mathworks HDL Coder). Whilst the former allow the most fine-grained control over the design, the latter are more accessible to the control systems designer, who may be an expert with MATLAB and SIMULINK, but only marginally familiar with VHDL or Verilog. Graphical visualisation of the design also helps its documentation and communication, ease of maintenance and re-use. The main focus of this paper, forming Section III, is the modelling of the circuit design using SIMULINK, to be translated to VHDL using Mathworks' HDL Coder. In [5] , an interior point QP solver was implemented solely using builtin SIMULINK blocks to enable automatic code generation and to assist verification and validation processes. Elsewhere, SIMULINK Coder/MATLAB Coder (née Real Time Workshop/Embedded MATLAB (EML) ) have been used to compile M-code implementations of custom QP/LP solvers to C to accelerate simulation and simplify deployment (e.g. [6] - [8] ). For FPGA synthesis the process is rather more complex. It is insufficient to generate RTL from the same M-code used for simulation or C generation. Register level timing, numerical representation and parallelism of computation must also be considered to obtain a design suitable for an FPGA. The design from Section III is implemented on a Xilinx ML605 evaluation board, and results are presented in Section IV.
II. BACKGROUND

A. Control scenario
The terminal phase of a spacecraft rendezvous and capture scenario is considered, based on the Mars Sample Return mission [6] , [9] , [10] in a circular orbit with radius of 3893.4 km. The linearised Hill-Clohessy-Wiltshire equations (e.g. [11] , [12] ) are used for the prediction model, the controlled spacecraft has a nominal mass of 1575 kg.
As in [13] , the chaser spacecraft nominally starts at zero velocity, 200 m behind a passive target and 0.0767 m below the x-axis (positive z) in a cartesian reference frame centred on the target. The objective is to track a step increase in velocity in the x direction (towards the target) to 0.2 ms −1 until the separation has reduced to 100 m and then reduce to 0.1 ms −1 whilst holding 0.0767 m below (positive value) the z-axis, and null out-of-plane (y-axis) separation. At ≤ 3 m separation, the remainder of the manoeuvre must be passive and the chaser trajectory will naturally drift upwards to intercept the x axis. Navigation error is modelled as Gaussian and white, with 3σ values of [0.0247, 0.0247, 0.009, 0.007, 0.007] on [y, z,ẋ,ẏ,ż] respectively [6] . As well as a minimum impulse bit, below which differential thrust is used to deliver the net commanded value, the thrust on each axis is subject to a multiplicative uncertainty with σ = 0.01.
An ℓ 1 -regularised quadratic (ℓ asso ) cost function [13] - [16] is used to engender sparsity in the input trajectory, thus avoiding lengthy periods of continuous low-level thrust. At each time step k, the MPC minimises the cost function
subject to the predicted plant dynamics and any input or state constraints, where R > 0 is diagonal and Q ≥ 0. In the present design there are no state constraints, and inputs are bounded symmetrically: −u max ≤ u i ≤ u max , where u max = [8, ..., 8] T (in Newtons). The 1-norm of R λ u i is encoded by letting R λ > 0 be diagonal, and splitting u i into positive and negative components on each of 3 axes, so that u i = u
T and to ensure strong convexity, replace the input contribution to the stage cost with u
Due to diagonality of R and positivity ofũ i , the optimal u i is equal to that of the original problem. The resulting constrained optimisation can be posed as a dense parametric QP
, and the values of H, F x , F r , l, θ min , θ max follow from condensing out the state variables in the usual way (e.g. [1] ). The tunings are obtained as described in [13] , aiming for a capture tolerance of 7.5 cm (tighter than the 20 cm required by the MSR mission).
In the present scenario, N = 20, andũ i ∈ R 6 . Therefore, after condensing, the number of decision variables, n θ = 120. A sampling period of T s = 1 s is used. To avoid magnification of navigation error it is desirable to not allow a full unit delay for computation. A QP of this size is easy to solve in milliseconds on a contemporary desktop computer, but in space environments computer clock rates are between one and two orders of magnitude slower due to radiation hardening and power consumption requirements. Through parallelisation, an FPGA can achieve similar times whilst clocked at a few tens of MHz. The motivation for using the FPGA in this application is to get adequate computational speed at relatively low clock rates.
B. Optimisation algorithm
Nesterov's fast gradient method (FGM) (e.g. [17] , [18] ) is chosen, since only input constraints are considered in the presented control scenario and this class of algorithm is divisionfree, with a relatively straightforward data flow, offering natural opportunities for parallelism and pipelining. It is also demonstrably well-behaved with fixed-point arithmetic [4] , [19] . Thus it lends itself well to FPGA implementation using the design methodology presented. The algorithm is shown in Fig. 1 for the case of a bound constrained QP, where θ is the decision variable, the superscript • (k) indicates element of a sequence, and L is the largest eigenvalue of H. The value β is taken to be
and µ is the smallest eigenvalue of H.
III. DESIGN A. Scaling and data types
As highlighted in [17] , [18] , the QP can be scaled to minimise the condition number of H to improve convergence.
End for Alternatively [20] , [21] proposes a diagonal preconditioner, originally intended for an interior point QP solver with an iterative MINRES linear solver which puts the numerical values of a matrix into the range [−1, 1] and spec(H) ⊆ (0, 1]. The latter is used in the present implementation, since it simplifies the positioning of the fixed point, and also means that L = 1 (Fig. 1, Step 2), therefore eliminating a multiplication by 1/L from the final circuit design. To avoid additional notation, we henceforth use H to mean the Hessian term after the scaling has been applied. The choice of fixed point data type is driven by the solution accuracy required, the dynamic range of the numerical values and the hardware available. The Xilinx Virtex 6 FPGA which will be targeted has a number of dedicated 25 × 18-bit hardware multipliers (DSP48Es). The data types used (shown in Table I ) are chosen based on the FPGA resources, empirical testing and knowledge of the bounds of variables, rather than an aim to use the optimum number of resources for a specified accuracy. By application of [20] , [21] only one integer bit is needed for H. Similarly, β ∈ [0, 1]. During matrix multiplication, elements of H multiply elements of y (k) therefore it is convenient for one of these to be (a multiple of) 18 bits long and the other to be (a multiple of) 25 bits long to optimally use the DSP48E resources. y (k) has a greater dynamic range, so therefore takes the longer word length. Noting that for this scenario θ min is a vector of zeros, y (k) and θ (k) will be bounded above by max(θ max ). After scaling, max(θ max ) = 1.5453 so at least 1 integer bit is needed.
for β ∈ [0, 1], so y must have at least one additional integer bit totalling 2 integer bits. β takes the same data type as H, and we let θ take the same type as y with 2 integer bits and 22 fractional bits. The data type ofθ (k) follows from allowing SIMULINK to use its internal rules rather than forcing a data type or using backwards propagation. Since data types are chosen based on pre-determined bounds, all summations and products are set to round to "Floor" and to "Wrap" on overflow to minimise hardware resources used. Experiments using the MATLAB fixed-point toolbox verify sufficient accuracy for the application. Analysis of convergence with fixed point arithmetic is beyond the scope of this paper; for a formal theoretical framework the reader should consult the work of [4] .
B. Fast gradient method implementation
The data-flow through the circuit is modelled at a clockcycle level using SIMULINK. Counters and state machine logic are implemented using the "MATLAB Function Block", from which HDL Coder can also generate VHDL or Verilog (we use the former). It is designed so that rather than being hard-coded, the QP matrices can be loaded upon initialisation. (This is not repeated for each QP solution.)
1) External interface:
The QP solver presents five input ports and two output ports (Table II) .
2) Control logic: The main control logic consists of two counters and a switch. The first counter is the iteration number. On the final iteration (determined by the input ITERMAX IN) the boolean signal LASTITER is set high. The second is the element of the vector y and/or row of H currently being addressed. When an element is being addressed the output TRIG is set high. This pulse passes through a series of shift registers with a delay of the same length as the numerical component of the circuit, allowing determination of when the iteration has been completed when it returns as input we y (where "we" stands for "Write Enable"). The same happens with the read address -each subsystem outputs a delayed copy of this to enable memory instances in the next subsystem to be presenting the output from the correct memory location at the correct time. The delays take into account the latency of the RAMs later in the pipeline, avoiding extra delays in the main data path. Each of these blocks is implemented in a MATLAB function block, with persistent variables used to model registers storing data between time steps (Fig. 3) .
The switch simply determines whether data from outside the circuit, or estimate of y from the previous iterate is passed 3) Gradient calculation: The gradient calculation (Step 1 in Fig. 1 and Fig. 5 ) comprises a matrix-vector multiplication followed by a vector-vector addition. This is performed in a pipelined manner, with a (maximum) throughput of one dot product of a matrix row with the vector per clock cycle. The version of HDL Coder used (R2012b) does not support matrix valued signals. Instead, the (scaled) matrix H is explicitly stored as an array of single-port RAMs (generated using the template in hdldemolib) each containing a column. All of these (i.e. the matrix row) are indexed by the same signal. Since the size of H depends on the problem formulation, a self-modifying masked subsystem is created. To load data into H, switching logic (not shown) is included so that when "WE" is high, the address is taken from a pair of resettable counters (determining the row address, which is connected to all RAMs (columns), and determine which column to set Write Enable high for) rather than from RADDR before it is finished with.) The parallel element-by-element multiplication of each row of H with y (k) only requires a single SIMULINK block. The data type is allowed to propagate via internal rule at this point. The elements are then added together (as in the matrixvector multiplication in [3] ) by an adder reduction tree. The latter is generated using the SIMULINK "Sum of Elements" block, configured to use a "Tree" architecture (in HDL Block Properties) in series with a delay block. By placing this in a subsystem with HDL Coder's "Distributed Pipelining" enabled for that subsystem, the delay is then automatically distributed between the stages of the tree in the generated VHDL. Hy (k) is then converted back to the same data type as y (k) and will not overflow because spec(H) ⊆ (0, 1]. 4) Gradient step: The gradient step (Step 2 in Fig. 1 and Fig. 6 ) employs a second copy of y (k) stored in a simple dual port RAM (again with an additional counter inside the subsystem for the write address) and a subtractor. The dual 
5) Projection:
The projection stage (Step 3 in Fig. 1 and Fig. 7 ) employs two single port RAMs, two comparators in parallel, and a multiport switch indexed by word obtained by concatenating the bits from the comparators.
6) Calculate difference:
The next stage (Step 4 in Fig. 1  and Fig. 8 ) employs a further dual port RAM (with counter) which buffers y (k) and a subtractor. 7) Update y: The final stage (Step 5 in Fig. 1 and Fig. 9 ) contains a register (containing β), a multiplier and an adder.
C. Calculation of f
Vector f is a function of contant matrices F x , F r , vector l, and varying parameters x and r. LettingF = [F x , F r , l], and
T , f =Fx -another matrix multiplication. This is implemented as a separate block outside of the main Fig. 8 . Calculate difference Fig. 9 . Update y algorithm (not shown for space reasons). To simplify the external interface, the first three input ports of the FGM block are replicated (although DIN is here a 32-bit unsigned integer, extended to 41-bit before being re-output), and passed through modified or unmodified depending on the value of LOADMODE. The vectorx is input in serial as a 32-bit signed data type with 20-bit fraction length.F is stored as a sequence of its row vectors with a 25-bit signed data type with 20-bit fraction length. The matrix-vector multiplication is performed serially since the vector is only of length 11 (5 states, 5 references and a constant of unity) and this only happens once per QP. When complete, the "TRIG" output, connected to the FGM state machine, is raised. The 32 × 25 bit multiplication requires two DSP48E units. The 32-bit word is manually split into two parts and long multiplication carried out with a pipeline stage in-between. The latency of calculating f is (after loading) (n θ × 11) + 9 cycles, where n θ is the number of decision variables.
D. General design considerations
Addition, subtraction, comparison and switching are chained with a delay of one cycle. Each multiplier is chained with a pipeline delay of two cycles, since the DSP48E units contain two built in pipeline stages. Whilst HDL Coder generates generic VHDL, this construct is inferred by the Xilinx ISE 14.4 toolchain used to synthesise and implement the generated design. The total latency of a single FGM iteration is n θ + log 2 (n θ ) + 19. (For n θ = 120, the main path latency is 24, with an extra delay of 2 cycles in the iteration counter.) HDL Coder is configured so that any "reset" signals are synchronous and active high. All delay blocks and "MATLAB function" blocks are configured with reset type "none" to allow the Xilinx toolchain to efficiently synthesise the design.
The resulting VHDL synthesises with an estimated feasible operating clock frequency of 261 MHz on the Virtex 6 LX240T FPGA on the ML605 Evaluation Board targeted with default settings in the Xilinx ISE toolchain. Whilst for the proposed application this is inappropriately high, there is flexibility (due to pipeline registers) to route the design in a smaller FPGA, or around another circuit on the same FPGA.
IV. INTEGRATION AND TESTING
The designed circuit is connected as a peripheral core to a Xilinx MicroBlaze soft core processor in a similar way to [22] (although here, the generated VHDL code is imported into Xilinx System Generator for DSP, which is used to generate the memory-mapped interface logic). The MicroBlaze has two tasks. The first is as a convenient bridge to the outside world using UDP over 100 MBit/s Ethernet. The second is to initialise the QP matrices with their operational values, by loading them into the DATA IN port with an appropriate schedule of values of LOADMODE. The FGM circuit and the MicroBlaze are clocked at 50 MHz.
FPGA resource usage is presented in Table III , based on synthesis estimates. The design comfortably fits inside the smallest Virtex 6 FPGA (V6 LX75T). The register and lookup-table (LUT) usage of the custom FGM circuit is small compared with that used by the MicroBlaze, whilst the Block RAM and DSP48E usage is substantially greater.
For a nominal scenario with measurement and thrust uncertainty, but the simulation model otherwise matching the prediction model including an assumption of instantaneous control on sampling, a comparison of the FPGAbased solver with EML-based FGM implementations and an FGM solver synthesised using FiOrdOs (version R20121210) [23] is presented in Table IV (all FGM solvers carry out 300 iterations). The loss of accuracy for the fixed point implementations is small. The FGM/EML solver on the desktop is faster than FiOrdOs due to use of BLAS (without this enabled, they are comparable) but slightly slower on the MicroBlaze since BLAS is not used. The fixed point software implementations are slow due to use of multi-word arithmetic. The MicroBlaze has a single precision hardware floating point unit -if emulation is used instead, the QP takes almost 40 seconds. The computation time for the FPGA-based design is more than 2000 times faster than the fastest software implementation on the MicroBlaze. The FPGA clock frequency could be halved or even quartered, whilst keeping computation time insignificant with respect to T s . However, communication time is significant for the FPGA-based implementation and other interfacing technologies would be more appropriate if used with a real plant. To verify the control system, a Monte-Carlo simulation with the FPGA-based controller in the loop (communicating using UDP/IP over Ethernet) is run from a range of initial conditions surrounding the nominal condition. Mismatch is introduced into the simulation model parameters. The seed used for thrust and navigation error is also varied. The initial position, velocity, orbital radius and mass are sampled from uniform distributions of range (±10 m, ±0.3 ms −1 , ±50 km, ±235 kg) respectively. Figure 10 shows the position in the y − z plane at capture for each of the simulations and histograms of capture time and cumulative impulsive velocity change (∆V ). All are well within the 20 cm tolerance considered in [6] . All are within a 10 cm tolerance with 99.55% within 7.5 cm of the origin.
V. CONCLUSIONS
This paper has presented the design of a constrained MPC controller with a ℓ asso cost on an FPGA, using fixed point arithmetic and implemented using SIMULINK and HDL Coder. The design was tested on an FPGA evaluation board, connected to the outside world using a MicroBlaze soft core processor and UDP/IP over Ethernet. The system is demonstrated in a closed-loop with a simulated plant, and delivers the solution within 1.91 ms (of which 0.96 ms was communication) despite a clock frequency of only 50 MHz.
