Abstract-Model predictive control (MPC) is an optimization-based strategy for high-performance control that is attracting increasing interest. While MPC requires the online solution of an optimization problem, its ability to handle multivariable systems and constraints makes it a very powerful control strategy specially for MPC of embedded systems, which have an ever increasing amount of sensing and computation capabilities. We argue that the implementation of MPC on field programmable gate arrays (FPGAs) using automatic tools is nowadays possible, achieving cost-effective successful applications on fast or resource-constrained systems. The main burden for the implementation of MPC on FPGAs is the challenging design of the necessary algorithms. We outline an approach to achieve a software-supported optimized implementation of MPC on FPGAs using high-level synthesis tools and automatic code generation. The proposed strategy exploits the arithmetic operations necessaries to solve optimization problems to tailor an FPGA design, which allows a tradeoff between energy, memory requirements, cost, and achievable speed. We show the capabilities and the simplicity of use of the proposed methodology on two different examples and illustrate its advantages over a microcontroller implementation.
I. INTRODUCTION
T HE amount of required and available sensing and computation capabilities of existing and newly designed systems is growing rapidly. In this paper, we focus on model predictive control (MPC) [1] , which is an optimization-based advanced control strategy that uses a mathematical model to predict the future behavior of a system. The predictions are used to obtain a sequence of control inputs that minimize a desired performance criterion and result in satisfaction of the required constraints. Fig. 1 shows the central idea of MPC, in which the predictions of the model states (x k ) at each sampling time k are used to obtain a sequence of control input vectors (u k ) that minimizes a desired performance criterion and result in satisfaction of the required constraints. The application of MPC demands the realtime solution of a numerical optimization problem. Therefore, its applications are traditionally limited to slow systems, e.g., chemical plants or petrochemical systems. However, due to the significant advances in tailored algorithms and hardware, MPC is increasingly becoming of interest for very fast embedded and resource-limited systems. Some examples include industrial electronics applications [2] , [3] such as inverters [4] - [6] , rectifiers [7] , matrix converters [8] , or engine control [9] .
New hardware architectures have been studied for the implementation of MPC [10] , including programmable logic controllers [11] , low-cost microcontrollers [12] , [13] , field programmable gate arrays (FPGAs) [14] - [16] , and applicationspecific-integrated circuits (ASICs) [17] . The choice of hardware architecture is often a tradeoff between cost, energy consumption, and required performance.
Advances in FPGA technology have led to inexpensive devices with increasing digital resources. This opens significantly the spectrum of potential applications. Moreover, FPGA technology enables the optimized use of parallel calculations as well as ad-hoc digital hardware development, increasing the performance to levels that are not achievable using any other fixed architecture implementation. As a result of the technology development, an FPGA technology has become an alternative for MPC controllers implementation [18] - [20] due to its high performance and cost effectiveness, enabling applications at megahertz rates. For cost-efficient solutions, the system can be ported to an ASIC.
Although theoretical issues of MPC control have been deeply studied [21] , there are still many challenges for the fast and economically viable implementation and use of MPC. Among those, the effort to design, optimize, and implement such controller is one of the most relevant challenges. In order to enable a rapid and economic FPGA implementation, this paper proposes a design and optimization procedure using high-level synthesis (HLS) [22] , [23] which has proved to be an effective way to implement industrial controllers [24] , [25] . In comparison to [26] , which also presents the use of HLS tools for embedded optimization, we present explicitly how to tailor and optimize the FPGA implementation paying special attention to the arithmetic operations of MPC controllers. This allows to consider different tradeoffs between energy, speed, and memory requirements, and to provide several guidelines for designers. The main benefits of the proposed design workflow are the fast testing of different designs, which enable a time-effective way to analyze a large set of FPGA implementations, which cannot be done manually.
We focus here on the software-supported implementation of predictive controllers on FPGAs, mainly because of its high performance, cost effectiveness, and flexibility. Currently, there are tools for the implementations of MPC that generate simple code that can be used, e.g., on microcontrollers, see CVXGEN [27] or ECOS/QCML [28] . The use of FPGAs for efficient MPC implementations has been presented, e.g., in [29] . However, there the FPGA design is performed manually. While a toolbox for FPGA prototyping is presented in [26] , there are currently no available tools to automatically design and optimize an FPGA implementation of MPC starting from a high-level description of the control problem, enabling the application of MPC techniques to nonexperts in the field. This represents the main challenge for an optimized use of MPC on FPGAs. This challenge can be efficiently overcome as it is shown in this paper. This paper is organized as follows. Section II gives a brief overview of the MPC problem formulation and the basic algorithms that are studied for implementation. Section III details the proposed workflow for an efficient and cost-efficient MPC FPGA implementation using HLS as well as the optimization procedure used. Section IV presents a design example including its formulation, implementation, and optimization process and the achieved results. Section V summarizes the main conclusions of this paper.
II. MPC PROBLEM

A. Formulation and Basic Solution Algorithms
MPC is based on the repeated solution of an optimization problem at each sampling time. A mathematical model is used to predict the future behavior of the system until a given prediction horizon and a sequence of optimal control inputs is obtained by minimizing the chosen cost function subject to given constraints (see also Fig. 1 ). For embedded systems, usually linear models are considered, which represent a linear system or a linearization of the actual nonlinear system, and can be represented as
where x ∈ R n x denotes the states and u ∈ R n u represents the control inputs.
Usually, a quadratic cost function with positive-semidefinite weight matrices and affine constraints are considered so that the optimization problem that needs to be solved at each sampling time is convex and has the form minimize
Here, Q, R, P are tuning parameters of the cost function which penalize state and input deviations. F, G, c min , c max define the constraints that are considered for the control task, while N is the prediction horizon. Embedding the initial condition x 0 and the linear dynamics (1) in the formulation of the cost function and constraints, the optimization problem can be transformed into a so-called condensed formulation [30] . The condensed problem only has the control inputs as optimization variables, which is beneficial for embedded implementations. The resulting equivalent formulation of the optimization problem (2) can be expressed in condensed form as
Many efficient algorithms have been proposed in the last years to solve (3). First-order methods for solving the quadratic program (3) are, e.g., presented in [12] and [31] . Second-order approaches are provided in [32] and [33] . We focus on the use of a combination of the fast gradient method (FGM) [34] that can be used when only input constraints are considered, combined with the use of an augmented Lagrangian method (ALM), which allows to handle state constraints as proposed in [31] . We review this approach in the following.
For simplicity of notation, in the remainder of this paper, we denote the cost function of the considered optimization problem (3) as f 0 (u, x 0 ) and the equality constraints as f c (u, x 0 ) = V u − v(x 0 ) (inequality constraints are transformed into equality constraints by introducing slack variables). As it is common in the field of numerical optimization [30] , the augmented Lagrangian L(·) is defined as
Here, λ l is the Lagrange multiplier associated to the constraint l, μ is a tuning parameter of the regularization term, and n c denotes the number of constraints. The idea of ALM is that solving an unconstrained optimization problem with L(·) as cost function and using the optimal value of the multipliers λ l leads to the solution u of the original (constrained) optimization problem [30] . In the combined FGM+ALM method, we use the FGM to find the solution of the unconstrained problem given by the augmented Lagrangian and we use the ALM method to iteratively update the values of the Lagrange multipliers to reach its optimal value.
The FGM is a first-order iterative method to efficiently solve input-constrained optimization problems. It is especially suited for embedded platforms due to its low memory requirements and high convergence rate. It consists of two main steps. In the first step, the next candidate of the optimal solution u + is computed
Here, L is a Lipschitz constant that can be calculated from the problem data, i.e., the system description and the control task description, and P U denotes a projection onto the set U, which is defined by the input constraints. In most cases, the inputs are box constrained, and such projection is a simple (and computationally cheap) saturation operation given the minimum and maximum possible values of the control inputs. The operator ∇ u (·) denotes the partial derivative of the augmented Lagrangian with respect to the vector of control inputs u.
In the second part of the FGM an extra step is computed [34] :
Here, φ > 0 is a strong convexity constant which can be computed offline [31] . The FGM for a number of iterations j in and a Lagrange multiplier λ is summarized in Algorithm 1.
If a given optimization problem only has input constraints, Algorithm 1 is directly able to obtain a solution (the value of the multiplier λ is then irrelevant). In the case of state constraints, an optimal value of the multiplier has to be found to obtain the solution of the original (constrained) problem. This is achieved via an iterative update of the multiplier
where μ > 0 is a penalty parameter. When state constraints are present, a combination of the FGM presented in Algorithm 1 and the multiplier update in (7) is used. A summary of the ALM + FGM scheme with i ex iterations is given in Algorithm 2.
III. FPGA IMPLEMENTATION OF MPC USING HLS
A. Design Workflow
A two-step approach is proposed (see Fig. 2(a) ) to enable an optimized yet simple FPGA design for MPC controllers. As a first step, we use the tailored MPC code generation tools μAOMPC [35] and included novel extensions to transform a simple description of the control problem to code that can be used in a second step by a HLS tool, such as Vivado HLS [36] . The required tailored code is automatically generated by the extended version of μAOMPC, which implements the ALM+FGM algorithm presented in Table II relying only on additions and multiplications. The proposed extended version modifies μAOMPC 0.4.0 by including definitions of each needed operation (see Table III ) with fixed-size arrays. This is necessary because the Vivado HLS tool does not allow variable-size array optimization. Variable-size arrays were used in μAOMPC to achieve generated codes of smaller size. However, the modified version of the code will take advantage of fixed-size arrays to optimize the FPGA implementation by unrolling and pipelining the digital design, as it will be explained later.
This approach offers several advantages for computations on embedded platforms. The HLS tool processes the autogenerated code taking into account explicitly the particularities of the optimization algorithms presented in Tables I and II. The result of the proposed approach is an automatic and optimized FPGA design for MPC, which depends on the problem data and designer requirements. The proposed approach offers several design alternatives to achieve different tradeoffs between FPGA size and computation time. Alternatively, μAOMPC can 1. set w = u 2. for j = 0 until j in − 1 do: 3. 
scale(a) nu * N 4 a + b n u * N be used to generate C-code that can be directly used on microcontrollers, without any additional libraries. The HLS tool workflow to implement the MPC controller is summarized in Fig. 2(b) . Inputs are the C-code containing the MPC algorithm as well as the required libraries and additional constrains to be considered by the HLS tool. As discussed later, these constraints must be designed carefully taking the MPC problem into account in order to optimize both the performance and the needed digital resources. The output of the HLS process is an optimized register transfer level implementation using a hardware description language ready for both simulation and/or synthesis. One of the main benefits of this design flow is the fast design process, which enables to analyze a large set of implementations in a time-effective manner, which is not possible to address manually. The main novel contributions of the proposed design flow include the extension of the tool μAOMPC to generate C-code that can be directly used by HLS tools, as well as the optimized use of HLS constraints to take advantage of the arithmetic operations that are necessary to solve the typical optimization problems that arise within MPC.
B. Optimization of the FPGA Design
The optimal implementation of MPC with respect to FPGA resources and computational speed depends strongly on the matrix-vector operations that are necessary to solve the optimization problem at each sampling time. Table III summarizes all the necessary operations, as a function of the number of states (nx), the number of control inputs (nu), and the prediction horizon (N). These operations and their relation with the FPGA implementation are key during the design process in order to optimize the implementation performance and digital resource usage. The saturate operation is the projection operator defined in the FGM algorithm for the case of box-constrained inputs and the scale operation denotes the multiplication of all elements of a vector by a scalar. A similar analysis can be performed for the FGM+ ALM algorithm. It is omitted here for simplicity in the presentation.
From Table III , it is clear that the matrix-vector multiplication inside the loop is the most resource consuming operation. The optimization analysis will focus on this aspect.
In order to explore and optimize the design space using HLS, several design constraints must be carefully chosen.
1) Unrolling permits unrolling a certain loop to compute arithmetic operations using parallel processing. This significantly improves the implementation performance at the cost of additional digital resources. Fig. 3(a) shows the standard arithmetic operation implementation, which requires a single multiplier plus an accumulator which sweeps all matrix elements. This implementation, however, is the slowest and it can limit real-time computing with complex problems or large prediction horizons. In order to improve the performance, the inner loop can be unrolled (see Fig. 3(b) ), which requires additional multipliers plus an adder tree. Depending on the unroll level, a balance between performance and digital resource usage must be met. Additionally, several rows of the matrix multiplication can be computed at the same time. This implies the replication of the aforementioned structure as many times as the number of rows to be computed simultaneously (see Fig. 3(c) ).
Usually, unrolling is chosen to be an even number to take the most of dual-port memories. Finally, if the performance needs to be further improved, the outer loop can be also unrolled (see Fig. 3(d) ). This means that partial dataset from an iteration can be used to start computing subsequent iterations, enabling a significant gain in computing time. It is important to note that this enhancement, possible using HLS, is, in general, not feasible by hand coding due to the complex arithmetic data paths.
2) Pipelining is a common practice in digital design used to increase the maximum clock frequency when several arithmetic operations are performed sequentially. In this case, the constraint is used to set the target initialization interval as parameters, enabling an increased throughput and clock frequency at the cost of additional digital resources. Considering modern control problems and FPGA technology, this is a strongly recommended strategy due to the high number of flip flops commonly available even in inexpensive devices.
3) Inlining enables cross optimization among different C functions organized in a hierarchy. Unlike straight-forward optimization, which considers each function as separate black boxes, this directive enables further optimization and resource sharing to increase the performance and decrease the digital resource usage. 
IV. DESIGN EXAMPLES
In order to prove the advantages of the proposed HLS scheme and to discuss implementation details, this section presents several representative design examples using HLS for MPC implementation in FPGA. We discuss two examples, a small dc-motor problem and a larger chain of masses problem to highlight the differences in arithmetic complexity and optimization.
A. Considered Example Problems 1) DC Motor:
A simple dc motor can be represented by the following discrete-time linear system:
where the time constant T = 0.06, the amplification factor K = 0.15, and the sampling period is equal to t s = 4 ms. The states of the system represent the rotor position and the angular speed. The input is the pulse width modulation voltage, which is constrained to be between ±100% of its maximum amplitude. The considered MPC controller needs to solve at each sampling time the following optimization problem: where the tuning matrices in the cost function are
We consider a prediction horizon of N = 40. The same solution is obtained regardless of the hardware platform used. The closed-loop performance of the system is shown in Fig. 4 . The sampling time of the controller is the same as the sampling period of the system (4 ms). As it can be seen, the MPC controller drives the system to the desired equilibrium point while respecting the input constraints.
2) Chain of Masses:
The second example is a chain of masses that are linked by a spring [32] representing, e.g., an oscillating system with no damping. We consider six masses (m i = 1 kg) and spring constants of k = 1 N/m (see Fig. 5 ). The system can be represented by 12 dynamic states, the first six states describe the position of the masses and the last six its velocities. The forces between the masses are the three control inputs of the system. The control task is to drive the system to the origin, where all masses are at the original position and with no velocity, starting from a disturbed state. The dynamics of the system with a sampling period of t s = 0.5 s can be described by the following discrete-time linear system: 
We consider constraints on the position and velocity of each mass and on the inputs of the system. The optimization problem to be solved at each sampling time is
where the tuning matrices in the cost function are chosen as identity matrices of suitable dimensions, Q = P = eye (12) and R = eye(3) with a prediction horizon N = 10. The MPC algorithm is triggered at each sampling time of the controller, which is chosen to be 10 ms. Fig. 6 shows that the MPC controller is also able to drive the system of oscillating mases to the origin.
B. FPGA Implementation and Optimization
The MPC controllers for both the dc motor and chain of masses examples have been designed and implemented. This section summarizes the main results in the optimal implementation exploration for the floating-point implementations, highlighting the main optimization aspects. The implementations have been made using the VIVADO HLS tool from Xilinx. The target FPGA is a cost-effective XC7A200. Tables IV and V summarize the main implementation results for floating-point implementations, where the microprocessor (μP) implementation, using a standard STM32F407V μP, has been included for comparison. Data are obtained from the actual routed design. All implementations have been made with a target clock period of 10 ns and power consumption is evaluated with a vector-less activity propagation methodology [37] , which is an industry-standard probabilistic methodology for power consumption estimation. The proposed implementations use the three optimization methods explained in Section III-B, i.e., unrolling, pipelining, and inlining. Solutions 1, 2, and 3 implement different unrolling strategies at different loop levels, either internal or external. All the solutions implement pipelining, which can be seen in the loop factor or initialization intervals. Finally, all the solutions implement also inlining to optimize cross-function optimization.
Solution 0 is the standard implementation with no optimization, which is a sequential one close to a microprocessor implementation. Solution 1 consists on unrolling the inner loop plus pipelining accordingly using different unrolling factor from 1 (only pipelining) to N (completely unrolled). Solution 2 consists on replicating the hardware to compute simultaneously several rows. For this solution, full unroll and unroll factor 2 have been considered assuming dual-port memory since higher unrolling factor would lead to unacceptable memory resources usage. The straightforward full unroll option (2.N on tables) has been included to show that an overconstrained solution leads to a suboptimal result that does not fit into the selected device. Finally, Solution 3 consists on pipelining the outer loop. It is important to note that this optimization cannot be hand coded in a feasible way and it provides further optimization at the cost of additional resources. Moreover, the design space exploration is mandatory, since the optimum solution cannot be found without performing the actual hardware implementation. Consequently, some implementations achieve high performance with balanced resources consumption, whereas other implementations achieve reduced performance even with unfeasible resource usage.
C. Discussion
From the previous results, it is clear that the design space exploration using HLS for MPC problems enables optimization of the FPGA design in terms of performance, digital resources, and power consumption. The optimal solution, however, is not trivial. Too aggressive design constraints result in a solution with less performance and more usage of FPGA area. Whereas some strategies, such as inner loop unroll directly increases performance and resource usage, other more complex strategies such as outer loop unrolling cannot be easily predicted.
Also, it is important to note that the optimal solution depends highly on the problem and on the digital platform used for implementation. The proposed combination of automatic code generation and the use of HLS tools enables a fast optimization procedure that can be specifically performed for each problem at a reasonable cost. Once the set of possible results has been generated, the designer can choose the most suitable for each specific application.
It is important to note that with the proposed methodology optimized FPGA implementations of MPC for fast systems are now possible, opening the design space to new applications.
V. CONCLUSION
An optimized software-supported FPGA implementation of MPC has been proposed. In order to take the most of modern FPGA technology and provide control engineers with powerful and simple tools, a design methodology based on HLS has been presented. Special attention has been paid to the code and FPGA optimization, providing several guidelines for designers. Finally, the proposed methodology has been applied to two design examples, proving the feasibility and possibilities of the proposed approach. As a conclusion, the combination of FPGA technology and HLS is a promising technique for modern MPC controllers with high performance, low cost, and energyeffective implementations. Furthermore, it opens the door to strictly verifiable MPC implementations, as, for example, required in aerospace industries.
Future work will include the consideration of model uncertainty in MPC, as done, e.g., in [38] . Additionally, combinations of software plus ad-hoc digital hardware via hard-core or softcore microprocessors open new possibilities. In this sense, the so-called software-defined system on chip provides valuable tools for control designers.
