Abstract-Model Predictive Control (MPC) is a computationally demanding control technique that allows dealing with multiple-input and multiple-output systems, while handling constraints in a systematic way. The necessity of solving an optimization problem at every sampling instant often (i) limits the application scope to slow dynamical systems and/or (ii) results in expensive computational hardware implementations. Traditional MPC design is based on manual tuning of software and computational hardware design parameters, which leads to suboptimal implementations. This paper proposes a framework for automating the MPC software and computational hardware co-design, while achieving the optimal trade-off between computational resource usage and controller performance. The proposed approach is based on using a multiobjective optimization algorithm, namely BiMADS. Two test studies are considered: Central Processing Unit (CPU) and Field-Programmable Gate Array (FPGA) implementations of fast gradient-based MPC. Numerical experiments show that optimization-based design outperforms Latin Hypercube Sampling (LHS), a statistical sampling-based design exploration technique.
I. INTRODUCTION AND RELATED WORK
Model predictive controller design is a multidisciplinary problem that involves tuning several coupled design parameters. Traditionally MPC controllers were tuned manually, with a trial and error approach, which cannot be considered as a viable option for most presentday applications, considering the number of design parameters and design evaluation time [1] . Moreover, manual tuning often requires understanding the nature of the controlled dynamical system and MPC controller with the underlying optimization solver. Available tuning guidelines for model predictive control, including heuristic and systematic (but not automatic) approaches, are reviewed in [2] . Note that only high level optimal control problem parameters (e.g. horizon length, weights on states/inputs) are considered in the review paper, without regard to solving the underlying optimization problem.
The full design exploration approach, which can be considered as the simplest way of design automation, leads to unacceptable exponential complexity scaling with respect to the number of parameters. Statistical exploration methods, e.g. Monte Carlo methods [3] and Latin Hypercube Sampling (LHS) [4] attempt to accelerate exploration by randomising the sampling process. However, all above techniques explore the design space without exploiting the knowledge about evaluated designs. As a result, statistical algorithms often achieve uniform distribution in the parameters space, without giving priority to the most promising (in terms of performance criteria) areas and hence waste time evaluating unpromising implementations. However, Monte Carlo methods and LHS can be used for identifying an initial guess for other algorithms. An alternative approach for taking humans out of the design loop is based on systematic optimization [5] . The following applications of optimization techniques to MPC design can be found in the literature:
• [6] presents an application of multi-objective optimization to the Van de Vusse reaction considering two contradicting objectives: maximizing the desired product and minimizing the unwanted product. Several multi-objective optimization methods are employed in this work: normalized normal constraint [7] , normal boundary intersection [8] and weighted sum.
• Study [9] compares systematic multi-objective optimizationbased parameter tuning using enhanced normalized normal constraint method [10] against Monte Carlo simulations. The case study considers NMPC applied to a biochemical system.
• Another algorithm for tuning NMPC controllers with application to chemical processes is presented in [11] . Instead of exploring the trade-offs the algorithm attempts to find a single compromise design using the approach [12] . Other examples of automatic tuning of MPC controllers can be found in [13] and [14] . The studies outlined above share certain common features:
• Multi-objective nature of MPC design problem is acknowledged and hence dedicated multi-objective optimization algorithms are used.
• Contradicting control objectives are considered as design goals without explicit optimization of computational complexity.
• Optimization solver parameters are not systematically tuned, it is assumed that optimal control problems are solved to optimality. This might be a valid assumption for slow dynamical systems, e.g. chemical reactors, but not for fast applications, e.g. robotics.
• Underlying computational platform parameters are not involved in the tuning process. In contrast, in this work we automate MPC controller design considering computational resources as a design objective. The control performance can be traded off against resource usage by tuning both algorithmic and computational hardware-related parameters.
The paper is organised as follows. Following the introduction, the target computational platform is described in Section II. An approach for formulating predictive controller design as an optimization problem is presented in Section III. Possible ways of formalising design objectives and constraints are discussed within the section. Following that, Section IV reviews existing algorithms for solving the resulting optimization problem and justifies using the BiMADS algorithm [15] for solving MPC design problems. Two case studies are considered in Section V: design of CPU-based and FPGA-based implementations of a fast gradient-based predictive controller. Section VI concludes the paper.
II. TARGET COMPUTATIONAL PLATFORM
In this work we prototype predictive controller implementations using Xilinx Zynq-7000 XC7Z020 System-on-a-Chip (SoC), which incorporates a dual-core ARM Cortex-A9 processor and Artix-7 FPGA logic. The FPGA fabric contains 53200 Lookup Tables  (LUTs) , 106400 Flip-Flops (FFs), 220 DSP blocks and 140 block RAMs (BRAMs) with total capacity of 4.9 Mb. Using SoC allows prototyping both Central Processing Unit (CPU) and FPGA implementations using one single chip.
The following techniques can be used to accelerate optimizationbased controllers on FPGAs:
• Data-level parallelization refers to splitting computations across multiple processors, so that different sets of data are processed simultaneously. Usually applied on regular data structures, e.g. arrays.
• Data pipelining is based on connecting processing units in series, so that the output of one unit is the input to the next one. The elements of this chain, i.e. pipeline, process data simultaneously.
• Tailoring number representation for a given algorithm, instead of using standard single/double precision floating point arithmetic, allows achieving better time vs resource usage trade-offs. For example, for fixed-point number representation arithmetic operations complexity is identical to that of integers. In this work we synthesize FPGA circuits using vendor's high level synthesis tool, namely Vivado HLS. Vivado HLS allows describing FPGA circuit architecture with C code and additional optimization directives for implementing the above above-discussed acceleration techniques. The entire FPGA design flow involves multiple stages:
• High-level synthesis: Converting the C code with synthesis directives (e.g. loop unrolling or pipelining) into low level Hardware Description Language (HDL) code.
• Synthesis: The process of transforming HDL code into a netlist, a graph that defines the connection of all logic gates and registers.
• Place-and-route (P&R): Solving a set of optimization problems in order to fit the netlist to a particular FPGA platform. The outcome of P&R is a bitstream that can be uploaded onto the FPGA.
• Functional verification of the circuit. For optimization-based control applications, the FPGA circuit has to be verified in the loop with a plant model. In this work we automate the above design flow by using Protoip software tool [16] .
III. FORMULATING THE DESIGN PROBLEM AS AN OPTIMIZATION PROBLEM
A. Design objectives and constraints 1) Performance: Quantifying controller performance is not a trivial task: depending on the nature of the dynamic system and design requirements several performance criteria might be considered. Fortunately, optimization-based controllers often perform explicit performance optimization and hence can be evaluated using objective measures. Depending on the application different control objectives might be selected, e.g. energy consumption or settling time. It should be emphasized that the solution of a single optimal control problem cannot serve as a performance indicator, since the ultimate goal is achieving desired closed-loop behaviour [17] . Instead, a closed-loop cost function should be calculated based on a closed-loop simulation.
Considering the closed-loop cost function as an objective within an optimization framework, it is important to take into consideration the continuity and monotonicity properties of the cost function. According to [18] , even for a constrained LQR formulation, which is arguably the simplest MPC setup, neither continuity nor monotonicity with respect to horizon length and sampling time can be guaranteed in general. This significantly limits the range of optimization tools that can be used for MPC design optimization.
2) Computation time: In relation to CPU implementations, where several algorithms might share the same hardware platform, algorithm execution time becomes both a design objective and design constraint. On the one hand, minimizing algorithm execution time keeps processor load low and hence enables sharing processor time with other algorithms. On the other hand there is a fundamental constraint for MPC design problems: in order to implement the controller in realtime, the optimization algorithm execution time has to be smaller than the sampling time of the system. There are certain exceptions to this rule [19] , which are not considered in this work.
In contrast, for FPGA implementations, where a circuit is synthesised for one particular algorithm, minimizing computation time would not give any benefits, since the logic cannot be reused by other algorithms. Moreover, for certain cases it could be worthwhile to increase computation time by decreasing the circuit size and hence saving resources. However, computation time can only be increased up to the sampling time, which can be formalized as a design constraint.
Note that, for a given algorithm the CPU implementation execution time would not be fully predictable because of a complex memory hierarchy. For FPGA circuits, execution time (in terms of clock cycles) can often be determined based on the architecture and hence efficiently predicted before circuit synthesis.
3) Computational logic usage and energy consumption: An FPGA designer often aims to minimize the amount of silicon that is required for implementing a particular control algorithm to get a size-and costefficient solution. As discussed in Section II, a modern FPGA has the following resources: flip-flops, lookup tables, block RAMs and DSPs. We measure the silicon usage (or computational logic usage) as follows:
where RF F , RLUT , RDSP and RBRAM denote relative utilization of each resource type. Euclidean norm can be considered as a compromise between the L 1 and L ∞ norms, where the former would not take into account a possible imbalance between different types of resources and the latter would penalize only the most heavily used resource.
In contrast to taking the average (which in this case would be equivalent to the L 1 norm), the Euclidean norm automatically accounts for the fact that a certain implementation is constrained by the most heavily-used resource.
There is seldom a linear correspondence between circuit size and energy consumption. For instance, in some cases it could be energyefficient to create a large circuit by parallelizing all computations in order to quickly perform all calculations and switch to idle mode [17] . In such cases, energy consumption may be considered as a separate design objective, which is particularly important for energy-autonomous embedded platforms.
B. Design parameters 1) Horizon length and sampling time:
Horizon length and sampling time are fundamental design variables that have a crucial impact both on closed-loop performance of a system and computational hardware requirements. These two parameters, being tightly coupled with each other, define the quantity of predicted information in a predictive controller. Horizon length defines the "vision distance", while sampling rate sets the "quality of the picture" [18] . Sampling frequency also defines the response delay of the controller. An overview of techniques for manual tuning of these parameters can be found in [2] .
2) Problem formulation: condensed vs non-condensed: Eliminating the states from decision variables by expressing them via the current state and input sequence leads to a compact condensed formulation [20] , which results in a worst-case cubic scaling of the number of floating point operations in horizon length for the primal-dual interior point algorithm iteration. A sparse formulation implies the opposite approach: keeping the dynamics in constraints and treating both inputs and states as decision variables. Although the problem size becomes larger, exploiting the sparsity pattern allows linear scaling of computational effort in horizon length. The condensed and non-condensed formulations can be considered as two extreme points of a sparsity level design variable. Controlling the level of sparsity can be achieved by dividing the prediction horizon into sub-intervals and performing partial condensing. This provides the possibility of adjusting the block size of linear algebra problems in order to find the optimal level of sparsity in terms software and hardware resources utilization [21] .
3) Optimization algorithm parameters: Optimization algorithms that solve optimal control problems numerically often have many tuning parameters on the underlying levels. The first fundamental design choice is the algorithm type [17] :
• first vs second order methods • interior point vs active set algorithms • gradient-based vs derivative-free approaches In addition to making high level design decisions, it is important to tune low level parameters, which might include:
• number of iterations / termination criterion • barrier parameter update strategy for interior-point algorithms • globalization strategy (line search vs trust region) The above parameters must be tuned with respect to closedloop performance, which is not necessary correlated with a single optimization problem's optimality conditions [17] . For example, a sub-optimal controller with a longer prediction horizon might perform better than an optimal controller with a shorter horizon. More details on sub-optimal MPC can be found in [22] .
4) Number representation:
There are usually two types of choices that have to be made in relation to number representation:
• The conceptual choice of data representation type: e.g. floating point or fixed point.
• The numbers of bits to be allocated for different parts of a number, e.g. for mantissa and exponent in floating point arithmetic. 5) Data-level parallelism and pipelining: Data-level parallelism and pipelining were discussed in Section II. The main algorithmic choices in relation to these techniques are:
• The number of parallel processors.
• The number of pipeline stages, i.e. pipeline depth. It should be emphasized that parallelism and pipelining affect algorithm execution time and resource usage, which may or may not have an impact on the closed-loop performance.
C. The resulting optimization problem
The design parameters considered in Section III-B can be classified in two categories: software parameters (e.g. prediction horizon length) and hardware parameters (e.g. number representation). Conventional approaches propose sequential design: initially the software algorithm is designed at a high level of abstraction without regard to the intended hardware platform and, following this, the algorithm is implemented on a hardware platform by selecting appropriate hardware parameters. This decoupled approach usually leads to inefficient utilization of available resources [17] . In contrast, the co-design approach implies simultaneous design of both software and hardware components in order to achieve the best possible performance for a given set of available resources. However, improvement of the closed-loop performance cannot be considered as the only design objective. As can be seen from Section III-A there are often several contradicting design objectives, which might include performance and computational hardware resource usage. Instead of looking for one optimal design (which often does not exist due to conflicts between objectives), engineers might make a decision based on the whole series of Pareto optimal designs, i.e. designs that cannot be improved in terms of one objective without worsening at least one of the other objectives.
The problem of investigating design trade-offs is usually formalized as a multi-objective optimization (MOO) problem. The main bottleneck that prevents efficient solution of MOO design problems in relation to MPC are the properties of the objective functions:
• Long function evaluation time. Evaluation of the design objective functions requires time-consuming simulations to be performed.
• Absence of derivative information. There are no analytical expressions for accurate estimation of the design objectives derivatives.
• Mixed domain. Design variables can be both discrete (e.g. number of bits for data representation) and continuous (e.g. sampling rate).
• Noisiness. For example, the same HLS code may result in different resource usage values depending on a vendor's software version or other factors that are not taken into account by conventional models. The next section will review existing algorithms for solving multi-objective optimization problems with focus on BiMADs, a biobjective version of a mesh adaptive direct search algorithm.
IV. DERIVATIVE-FREE MULTI-OBJECTIVE OPTIMIZATION

A. Problem statement
We consider the following multi-objective optimization problem:
where S is q-dimensional decision space. Since the objectives of MOO are often contradicting, there is no single solution to the problem. Instead, a set of Pareto optimal solutions can be obtained. 
is the set of non-dominated points for a given set of evaluated points U , i.e. the current approximation of the Pareto frontier. The quality of a Pareto frontier approximation can be assessed by means of the hypervolume. • Trust-region interpolation algorithms. Trust-region algorithms propose building a local approximation of the objective function based on evaluated samples. Based on this approximation, the function is minimized inside the trust region.
• Line search algorithms for derivative-free optimization are conceptually similar to their derivative-based counterparts: they perform search along a particular direction. However, for derivativefree algorithms, the search direction is calculated without gradient information.
• Direct Search Methods (DSMs) do not attempt to approximate derivatives either explicitly or implicitly. Instead, optimization is based on evaluation of a finite set of points around the current solution guess, so that a point with smaller objective function value can be found. Among the considered classes of algorithms, only direct search methods of directional type have been extensively studied in relation to multi-objective optimization [24] , hence rest of the section will be focused on DSMs.
Each iteration of a DSM is split into two parts: a search step and poll step. The latter step must be rigidly defined to guarantee convergence, while the former is more flexible and can be tuned to improve numerical performance. Assuming there is a given feasible solution guess, the general structure of a DSM of directional type can be described as follows [23] :
1) Search step. Evaluate the objective function at a finite set of points. If any of these points provide a better objective value compared to the current guess, declare the iteration as successful and skip the poll step. 2) Poll step. Perform a local search around the current best point by evaluating a set of poll points, which are defined by poll directions and a step size. Depending on the result of this search, declare the iteration as successful or unsuccessful. 3) Mesh parameter update. Reduce step size for unsuccessful iterations, increase or maintain step size for successful ones.
Practical examples of DSM algorithms of directional type are the coordinate-search method [23] and the Mesh Adaptive Direct Search (MADS) algorithm [25] .
2) Derivative-free multi-objective optimization: There are two fundamentally different ways for tackling multi-objective optimization problems:
• Direct algorithms for multi-objective optimization attempt to approximate the Pareto frontier directly.
• Scalarization-based approaches propose converting the multiobjective problem into a sequence of single-objective problems. The Direct Multisearch (DMS) algorithm [26] , which belongs to the class of direct multi-objective optimization algorithms, is a generalization of DSM algorithms for single-objective optimization. The main components of a DMS are the same as DSM as discussed in the previous subsection: the search step, the poll step and mesh update. The algorithm keeps a list of non-dominated points, which represents the current Pareto frontier approximation. The local poll search is performed around several non-dominated points. Successfulness of an iteration is decided based on the changes in the Pareto frontier approximation. The search step, similarly to single-objective DSM algorithms, is flexible and is not required for convergence [24] . However, it can help to improve the distribution of the points along the Pareto frontier, although this is not a systematic way of ensuring uniformity of the frontier.
A classical example of scalarization-based algorithms is the weighted-sum method. The method scalarizes the vector of objectives into a single objective by taking an affine combination of all the objectives. Tuning the weights of the scalarized function allows one to move along the Pareto frontier. However, in this case movement along the frontier happens in a non-systematic way. As a result, some regions become over-represented and others may suffer from lack of information. In addition, this approach cannot deal with nonconvexities and discontinuities in the Pareto frontier and hence loses some Pareto optimal solutions.
The BiMADS algorithm [15] performs scalarization in a different way. For a problem with two objectives the scalarized function is given by
where r is a reference point in the objective space. The problem of minimizing φr(·) is solved by the MADS [25] algorithm. The formulation (3) with appropriate selection of a reference point allows the recovery of all Pareto optimal solutions, which is not the case for the weighted-sum reformulation. For achieving good distribution of non-dominated solutions, the reference point must be selected in a particular way. The BiMADS algorithm proposes a formal method for identifying the biggest gap in the Pareto frontier approximation and presents a procedure for selecting an appropriate reference point. Details can be found in [15] . In this work, the BiMADS algorithm will be used for solving the multi-objective co-design problem. The main advantages of the algorithm are:
• Both BiMADS and the underlying single-objective MADS solver are supplied with a mathematical proof of convergence [15] , [25] , which is not the case for other considered algorithms.
• BiMADS has proven to be efficient for solving real-world engineering problems [27] .
• There is a free software application that implements the algorithm, namely NOMAD [28] • NOMAD provides various interfaces, including a Matlab frontend, which simplifies integration with Protoip [16] . The application scope of BiMADS is limited to bi-objective problems, which can be considered as the main drawback of the algorithm.
V. CASE STUDIES
A. Fast gradient-based controller for a mass-spring-damper system -CPU-only implementation
The system under control represents a chain of ten masses that are connected via springs and dampers ( Figure 2 ). The first mass is also connected to a fixed wall. Each mass can be actuated with an input force that has input and output limits. It is assumed there is no gravitational force. The system can be modelled with a continuoustime linear state space model:
where Ac ∈ R n×n , Bc ∈ R n×m are continuous-time state and input matrices. For a system of ten masses n = 20 and m = 10.
The following optimal control formulation is considered:
where Qc ∈ S n + , Rc ∈ S m ++ , Wc ∈ R n×m and P d ∈ S n ++ are state, input, cross and terminal penalty matrices accordingly. S n ++ (S n + ) denotes a set of positive (semi-)definite matrices.
In this experiment the following prediction matrices were used:
where I and 0m×n denote identity and zeros matrices accordingly. Tuning q speed , which is a design parameter, allows the changing the ratio between penalising masses positions and velocities. For the purpose of digital control, the continuous-time state-space model (4) and optimal control problem (5) are discretized, assuming a zero-order hold:
Note that the discrete-time penalty matrices Q d , R d , W d and P d are model-dependent [18] . The optimal control problem (7) can be transformed into a quadratic programming problem by eliminating the states, which leads to the condensed formulation
Note that the gradient term h depends on the current state, while the Hessian H is fixed and hence can be precalculated offline. More details on condensed and sparse formulations for predictive control can be found in [20] . Since (8b) has the form of box constraints, calculating projection on the feasible set becomes computationally cheap. This facilitates using Nesterov's projected gradient algorithm [29] , also known as the Fast Gradient Method (FGM). The method proposes moving in the anti-gradient direction and performing projection P (·) on the feasible set after each iteration (Algorithm 1). The extra momentum step with a parameter β achieves an optimal convergence rate. The constant step scheme [29] implies
, where L is the largest eigenvalue of the Hessian H and µ is the convexity parameter, which is equal to minimum eigenvalue of the Hessian. 
zi+1 = P (θi+1) ⊲ projection on the feasible set 6:
The algorithm was implemented on an ARM Cortex A9 processor of the Xilinx Zynq-7000 XC7Z020 system-on-a-chip employing only a single processing core. Using the Protoip toolbox allowed for fast verification of the controller in the loop with the plant model as shown in Figure 3 .
The problem of interest is automatic design of a fast gradient-based controller. The following design objectives are considered:
• Controller performance is judged based on settling time. In this experiment, settling time is defined as the time elapsed from the beginning of closed-loop simulation to the time at which x(t) 2 ≤ ǫ, where ǫ = 0.01. Several simulations with different initial conditions are performed in order to calculate performance measure as a sum of settling times for different initial conditions; the list of initial conditions are presented in Appendix A. Since the performance criterion and MPC objective are different, it is essential to tune the prediction matrices in order to achieve the desired performance.
• Algorithm computational time. As discussed in Section III-A2, computational time is the main measure of algorithm complexity for CPU implementations. Design constraints:
• Algorithm computational time, in addition to being a design objective, appears in a constraint function: in order to implement the controller in real-time the algorithm execution time has to be smaller than the sampling time of the system (Section III-A2).
• Stability constraint captures whether the controller was able to stabilize the system. Unstable response might happen due to short horizon, numerical errors or other reasons. Note that the former constraint is quantifiable while the latter is non-quantifiable. A quantifiable constraint is a constraint for which the degree of feasibility and violation can be quantified [30] . In this work, non-quantifiable constraints are handled with the extreme barrier approach, which implies setting the objective to infinity for all infeasible points and therefore not allowing infeasible iterations. For quantifiable constraints the progressive barrier approach is adopted. Progressive barrier constraint handling allows exploiting knowledge of the violation degree by accepting infeasible iterations. Both extreme and progressive barrier approaches are implemented in NOMAD. More details can be found in [30] , [28] .
The design parameters are the following:
• Horizon length, N in (7); bounds: 1 ≤ N ≤ 12.
• Sampling time, Ts; bounds: 0.02 ≤ Ts ≤ 0.5.
• Number of fast gradient algorithm iterations, NF GM in Algorithm 1; bounds: 20 ≤ NF GM ≤ 200.
• State penalty matrix, particularly q speed parameter in (6); bounds: 0.2 ≤ q speed ≤ 5. The above parameters are tightly coupled with each other. For example, consider Figure 4 that illustrates the impact of horizon length and sampling time on the Hessian condition number. It can be observed that both parameters have a significant impact on the condition number, which in turn affects the convergence rate of the fast gradient algorithm [29] . As a result, the number of fast gradient algorithm iterations NF GM required for convergence will also change. However, NF GM must be selected with respect to closed-loop performance, rather than open-loop optimality conditions, which complicates the tuning process even more.
The above design problem was solved using the BiMADS algorithm. The results are compared to latin hypercube sampling, which is a statistical sampling method commonly used for design exploration and for design-of-experiments in particular [4] . For both experiments the number of evaluations was restricted to 200. Design evaluation involves compiling the source code and performing processor-in-theloop tests, which takes 1-4 minutes for the considered setup, depending on a design complexity and the sampling time. Observe that full design exploration is not a viable approach: even using a coarse grid with ten points for the continuous variables, full exploration will require 217200 evaluations. As can be seen from Figure 5 , the Pareto frontier returned by BiMADS dominates the front identified by LHS, i.e. any design sampled by LHS is dominated by at least one design identified by BiMADS (see Definition IV.2). Moreover, according to the hypervolume profile (Figure 6 ), BiMADS provides a satisfactory approximation of the Pareto frontier on the early stages, which opens the possibility of early termination, depending on the time and/or simulation resource availability.
Consider Table I , which lists designs identified by BiMADS. Designs are sorted according to their settling times in ascending order. It can be observed that most of the design parameters (Ts, N and NF GM ) are changing along the Pareto-frontier in a non-monotonic way. This can be explained by the complicated cross-coupling be- Hessian condition number tween parameters. As a consequence, manual and heuristic tuning approaches are unlikely to identify these trade offs in an efficient way. The only exception is the weight matrix parameter qratio, which has the same value for all designs. In this particular case qratio could be optimized separately from other design parameters. Given the Pareto frontier ( Figure 5 ) a designer will be able to select a particular implementation based on the available processing time. For example:
• For the computational budget of 100 ms, LHS-based exploration achieves a settling time of 10.31 s, while optimization-based design allows settling of the plant in 9.75 s.
• Tightening the computational budget to 2 ms leads to sampling 
B. Fast gradient-based controller for a mass-spring-damper system -FPGA implementation
This case study considers implementation of the Algorithm 1 with fixed point arithmetic on the FPGA logic of Xilinx Zynq-7000 XC7Z020 SoC. The testing setup is similar to that of Section V-A, both in terms of the optimal control problem formulation (5) and the plant model (Figure 2) . The algorithm was implemented with the Vivado HLS FPGA synthesis tool using Protoip for automatic deployment and verification in the loop with the plant model (Figure 7) . As can be seen from Algorithm 1, FGM relies only on addition and multiplication operators, while all divisions can be precalculated offline. The following techniques were used to accelerate vectorvector and matrix-vector operations (lines 4-7):
• Loop pipelining [31] . Data pipelining, as a general acceleration technique, was discussed in Section II. In relation to loops, pipelining implies overlapping iterations, i.e. starting a new iteration before finishing the previous.
• Loop flattening [31] is transforming nested loops into a single loop with multiple counters. Flattening allows efficient pipelin- ing of nested loops. This technique was applied to matrix-vector multiplication (line 4), where loop nests arise when iterating over matrix rows and columns. Fixed point arithmetic often introduces overflow and round off errors. The former issue can be addressed by precalculating the upper bound on the largest absolute value of algorithm iterates using interval arithmetic. Regarding overflow errors, the number of fraction bits has to be sufficiently large to maintain numerical stability of an iterative algorithm. A procedure for precalculating the number of integer and fraction bits for fixed point implementations of a fastgradient algorithm is presented in [32] .
The following objectives are considered in this design optimization experiment:
• Controller performance is measured similarly to the previous case study (Section V-A) with ǫ = 0.02.
• FPGA logic usage. As discussed in Section III-A FPGA designers often aim to minimize the amount of logic used for a particular algorithm. Logic usage is measured as the Euclidean norm of relative utilization of each resource type, see (1) in Section III-A. Design constraints:
• Algorithm execution time. Similarly to the previous case study, in order to implement the controller in real-time, the algorithm execution time has to be smaller than the sampling time of the system. This constraint is treated with the progressive barrier approach. Note that for FPGA setup (in contrast to CPU), computational time does not appear as an objective. This is explained by the fact that FPGA logic is synthesized for a particular algorithm and cannot be reused for other applications.
• Objective function convexity constraint. Due to assumptions on the weight matrices in formulation (5), the Hessian of the objective function (8a) is positive definite. However, a fixed point representation of the true Hessian may be non-convex because of truncation errors, which might affect convergence of the fast-gradient algorithm. To avoid non-convex formulations, a positivity constraint on the smallest eigenvalue of the fixed point representation of the Hessian must be set. Although this is a quantifiable constraint, which potentially can be treated with a progressive barrier approach, we will use the extreme barrier method that rejects all infeasible iterations. Since identifying Hessian convexity is significantly faster compared to the full design evaluation, which involves circuits synthesis and closed loop simulation, rejecting infeasible iterations allows saving design time. The following design parameters are considered:
• Number of fast gradient algorithm iterations, NF GM in Algorithm 1; bounds: 20 ≤ NF GM ≤ 200. • State penalty matrix, particularly q speed parameter in (6); bounds: 0.2 ≤ q speed ≤ 5.
• Number of fraction bits for fixed point number representation, N f rac ; bounds:
Similarly to the previous case study, the multi-objective optimization problem with the above design objectives and constraints was solved using BiMADS algorithm and results were compared to Latin hypercube sampling allowing 200 evaluations for both algorithms. In contrast to software compilation, FPGA circuit synthesis is a time-consuming process, which leads to the design evaluation time in the range of 20-35 minutes. For this case study full design exploration would require 4561200, assuming a grid of ten points for the continuous parameters.
It can be observed from Figure 8 that BiMADS outperforms LHS, i.e. the BiMADS Pareto frontier dominates frontier identified by LHS. However, unlike with the previous case study, LHS outperforms BiMADS in the early stages of design exploration, which can be visualized with the hypervolume profile in Figure 9 . This might happen due to bad selection of initial guesses for BiMADS. Moreover, LHS, being a statistical method, might occasionally identify Pareto optimal designs faster than deterministic algorithms. It can be observed from Figure 9 that, after achieving a certain hypervolume space in the beginning of exploration process, LHS does not improve the Pareto frontier approximation significantly further. In contrast, BiMADS, having a poor initial guess, improves the solution and outperforms LHS when reaching an evaluation limit. Consider Table II , which lists designs identified by BiMADS. Similarly to the software implementation of the fast gradient algorithm, the weight matrix parameter qratio is the same for all designs and equal to the lower bound for this parameter. Other design parameters, including the hardware parameter N f rac , are changing along the Pareto frontier, trading off resource usage against performance.
Given the Pareto frontier ( Figure 8 ) a designer will be able to select a particular implementation based on the available FPGA resources (RF P GA). For example:
• For RF P GA = 0.1, LHS-based exploration achieves a settling time of 10.87 s, while optimization-based design allows settling of the plant in 8.71 s.
• Tightening the resource limit to RF P GA = 0.05 leads to sampling times of 19.31 s and 10.56 s for LHS-and optimizationbased approaches accordingly.
VI. SUMMARY AND FUTURE WORK
This paper proposed automating predictive control design by employing systematic optimization. Several contradicting design objectives were discussed and a bi-objective optimization problem formulation was proposed. Decision variables for the resulting optimization problem include both software and hardware parameters, which allows simultaneous optimization of both software and hardware and hence exploiting coupling between these components. It was shown that the bi-objective optimization-based design outperforms conventional exploration techniques and allows systematic investigation of resource-performance trade offs.
Future work includes exploiting the coupling between the nature of the design problem and design optimization algorithm. For instance, closed-loop cost function properties, which were recently studied in [18] , can be used for building and updating a surrogate model and hence reducing the number of required simulations [28] . Another direction for further research is deriving analytical bounds for objective functions in order to eliminate unpromising designs without actual function evaluations. 
