We propose an implementation of an interior-point-based nonlinear predictive controller on a heterogeneous processor. The workload can be split between a general-purpose CPU and a fieldprogrammable gate array to trade off the contradicting design objectives of control performance and computational resource usage. A new way of exploiting the structure of the KKT matrix yields significant memory savings. We report an 18x memory saving, compared to existing approaches, and a 36x speedup over a software implementation with an ARM Cortex-A9 processor. We also introduce a new release of Protoip, which abstracts low-level details of heterogeneous programming and allows processor-in-the-loop verification.
Introduction
Explicit performance optimization, systematic constraint handling and the inherent capability of dealing with nonlinearities are the main features that explain the success of Model Predictive Control (MPC) in recent decades [26] . In the MPC framework a dynamic optimization problem is solved at each sampling instant, which might restrict the application scope to systems with slow dynamics and/or render expensive implementations limited to high-performance computers.
Conventionally these challenges were addressed on the algorithmic and software levels by developing new optimization problem formulations and generating hardware-efficient code [13] . However, in addition to improvements on the software side, recent developments in reconfigurable computing allowed acceleration of predictive control algorithms on Field-Programmable Gate Arrays (FPGAs), which resulted in low-cost and resource-efficient realizations of custom quadratic programming (QP) solvers for MPC [16, 11] . Extending a hardware acceleration approach from linear to nonlinear model predictive control (NMPC), which can be considered as the next logic step, requires mapping numerical integration algorithms on hardware, which is not a trivial task, since dynamic models that describe the physical world are problem-dependent. As a result, instead of using systematic dynamic optimization, existing FPGA implementations of NMPC either rely on stochastic optimization [32] or approximate the offline solution with machine learning techniques [3] . These approaches cannot guarantee scalability nor closed-loop performance. Accelerating deterministic algorithms on hardware might be achieved by employing heterogeneous computing platforms that involve both a general-purpose processor with a fixed architecture and FPGA logic. For example, [25] present a heterogeneous implementation of a multiple-shooting based NMPC algorithm. The authors propose implementing integration in software while accelerating a fast gradient-based quadratic programming solver on an FPGA. The reported speedup of the heterogeneous implementation over a software realization is 1.6x and further improvement is limited, since integration and optimization algorithms have comparable computational complexity. This is a consequence of Amdahl's law [1] , which states that an algorithm's speedup is limited by the part of the workload that cannot benefit from acceleration.
We present a new heterogeneous implementation of nonlinear interior-point algorithm for predictive control, that was first introduced in [19] . The main features of the proposed implementation are:
• A method for scheduling sparse matrix-vector multiplication within an iterative linear system solvers to enable significant improvements in terms of computation time vs resource usage. For the example considered, an 18x memory saving compared to existing approaches and a 36x speedup over a software implementation are reported.
• Flexible splitting of the algorithmic workload between software and hardware for trading off the computational resource usage against performance.
In addition to the initial results presented in [19] , this paper presents the following extensions:
• The whole family of implicit and explicit Runge-Kutta methods are supported for ordinary differential equations integration. In contrast, the initial implementation was limited to Euler method only.
• The optimal control objective is generalized from a quadratic function to nonlinear least squares.
• The proposed controller is experimentally validated in the loop with a gantry crane highfidelity Simscape [22] model. In [19] the controller was only validated in the loop with a nominal model.
Another contribution of the paper is a new release of the Protoip software tool [30] . Protoip allows quick prototyping and processor-in-the-loop verification of optimization algorithms on a Xilinx Zynq system-on-a-chip (SoC), which contains an ARM processor and FPGA fabric. In contrast to the previous releases, which were focused on pure FPGA implementations, the new version of Protoip allows the incorporation of both an ARM processor and FPGA. Protoip can be used both for quick testing of the proposed implementation from the MATLAB environment and for design and verification of other heterogeneous implementations.
The remainder of the paper is organised as follows: Section 2 describes the considered optimal control problem formulation; existing NMPC algorithms, with a focus on suitability for hardware implementation, are reviewed in Section 3; in Section 4 a heterogeneous computer-based implementation of NMPC is presented, followed by experimental setup description (Section 5) and experimental results (Section 6). Note, that Protoip is a part of the experimental setup and hence presented in the corresponding section. Section 7 concludes the paper.
Optimal Control Problem formulation
We consider nonlinear time-invariant systems that can be described as an ordinary differential equation (ODE) of the formẋ
where
We consider the nonlinear optimal control problem (OCP) with initial statex and prediction horizon T :
where h :
Slack trajectory s and slack variable s T are introduced alongside with equations (2d) and (2e), which is a common technique that allows for the handling of general nonlinear inequality constraints [31] . The presented formulation can be generalized for time-varying reference tracking, which, as will be shown later, requires only changing the software part of the algorithm.
Nonlinear predictive control algorithms
Direct solution of the continuous-time optimal control problem (2) involves two main stages: integration, i.e. solving the ordinary differential equation (ODE), and optimization. Implementing integration on an FPGA is not desirable because of the following reasons: • The ODE (2c) may involve mathematical expressions (e.g. sine and square root) that in contrast to standard addition and multiplication operations, require significant amounts of computational resources ( Figure 1 ) and might be unsuitable for pipelining.
• A vector function f c is composed of scalar functions that might have different underlying mathematical operations and different evaluation complexities, i.e. f c might have irregular structure. Irregularity potentially limits reusing computational logic and speedup by parallelization.
Optimization algorithms, on the other hand, can benefit from hardware acceleration due to (i) their iterative nature, which is beneficial for reusing computational logic, and (ii) the fact that underlying linear algebra algorithms can be efficiently mapped onto hardware [16] .
Taking the above into account we consider two classes of algorithms for solving (2): shootingbased and direct transcription algorithms [4] . The common feature of shooting methods is decoupling the ODE and optimization solvers. Accelerating only the latter does not result in significant improvements, due to Amdahl's law, since the workloads of the two operations are comparable. In contrast, direct transcription algorithms transform (2) directly to a discrete OCP by approximating the ODE with algebraic equations based on numerical integration equations, i.e. 
where N and T s , respectively, denote the horizon length and the sampling time, r k = r
nl is a vector of integrator intermediate stages and l is the number of integrator stages per sampling instant.
With a trapezoidal integrator, (3d) would be given by:
The OCP (3) can be transformed into an NLP of the form
Note that J is a matrix of zeros and positive/negative ones since (3) contains only bound inequalities.
NLP (5) incorporates both integration and optimization, which potentially opens the possibility of accelerating both subproblems. Primal-dual interior-point algorithms have proven their efficiency for the numerical solution of optimal control problems [28] . Moreover, with interior-point algorithms, the structure of the KKT matrix associated with the OCP remains fixed (unlike with active set methods), which is desirable for hardware implementations [16] .
The next section introduces a structure-exploiting heterogeneous implementation of an interiorpoint algorithm for solving (5) . The implementation supports the use of Runge-Kutta family integrators for temporal discretization, considering Butcher tableau as a design parameter. However, results for memory saving, algorithm acceleration and closed loop simulations are obtained with a trapezoidal integrator (4), which belongs to the family of explicit integrators and hence allows efficient handling of stiff ODEs.
Algorithm and implementation details 4.1 Primal-dual interior-point algorithm
Interior-point algorithms solve NLPs by incorporating inequality constraints into the objective function using a logarithmic barrier function scaled with a barrier parameter. The modified problem is solved by performing consecutive Newton steps with or without globalization. Algorithm 1 and A outline a primal-dual interior-point algorithm for solving (5) . The algorithm computes a barrier parameter based on the current complementary value and reduction parameter, which is one of the simplest and most common approaches [23] . Furthermore, there is no globalization strategy incorporated in the algorithm. This choice is justified by the fact that line search and trust region globalization algorithms involve repetitive evaluation of f (θ) and p(θ), which, as was previously mentioned, often have irregular structure and hence discourage efficient acceleration. An extensive comparative study of barrier update strategies and globalization techniques for nonlinear interior-point methods can be found in [23] .
Another algorithmic choice that requires justification is the Hessian approximation method. Most of existing interior-point algorithms for large scale optimization use either Broyden-FletcherGoldfarb-Shanno (BFGS) [7] or Gauss-Newton approximators. The BFGS algorithm, if applied blockwise, can achieve a block diagonal structure for the Hessian approximation matrix when used for optimal control applications [15] . The Gauss-Newton algorithm, in addition, preserves sparsity
7:
8: within each block, which is particularly beneficial taking into account scratchpad memory limitations of reconfigurable computers. However, memory savings come at the price of narrowing the application scope to least-squares types of problems. In this work we accept this limitation and use the Gauss-Newton approach.
The main workload of the algorithm is concentrated in solving the system of linear equations. The matrix associated with the problem is symmetric and can be reordered to achieve the structure presented in Figure 2 . A symmetric system of linear equations can be solved using direct methods, e.g. LDL decomposition, or iterative methods, e.g. Minimum Residual (MINRES) algorithm [24] . Decomposition algorithms often involve many division computations, which are more complicated from a hardware implementation point of view compared to addition and multiplication. Moreover, parallelizing computations is not straightforward with direct algorithms. In contrast, iterative methods mainly rely on matrix-vector multiplications, while having very few division and square root computations. In this work we use the MINRES algorithm, which can be efficiently mapped onto hardware [5] and is well studied in relation to optimal control applications [28] . The algorithm performs minimization of the residual ||b − Az|| 2 over a Krylov subspace, which is constructed iter- atively by the Lanczos kernel [24] . The Lanczos kernel is based on a three-term recurrence, hence there is no need to store the entire Krylov subspace (unlike with the generalized minimal residual method [27] ). From an implementation point of view, the MINRES algorithm is mostly based on addition and multiplication operations, requiring only two scalar divisions and two scalar square root computations per iteration.
Proposed implementation
Algorithm 1 with the MINRES linear solver can be visualized with the block diagram in Figure 3 . We consider four different ways of splitting the workload between software and hardware:
• The entire algorithm is implemented in software, i.e. ARM Cortex-A9 processor. We denote this implementation by SW.
• Only the matrix-vector multiplication is accelerated in hardware and the rest is implemented in software (HG 1 ).
• The whole Lanczos kernel is accelerated in hardware and the rest is implemented in software (HG 2 ).
• The whole linear system solver is accelerated in hardware and the rest is implemented in software (HG 3 ).
Note that for all considered options, the Jacobians evaluations are implemented in software to avoid synthesising resource-consuming nonlinear operators. Operations that are implemented in hardware can be classified into:
• Scalar operations, which do not require acceleration.
• Vector-vector operations, which can be efficiently pipelined in hardware.
• Matrix-vector multiplication, which is the most resource-consuming part. Efficient implementation requires exploiting sparsity.
One possible way to exploit the structure of A k is based on considering the matrix as banded [16] . In this case the number of parallel computations is defined by the bandwidth and cannot be changed with respect to resource availability. In this work we treat the matrix as block sparse and, in addition, exploit sparsity within each block, which can result in an 18x saving in memory (Figure 4 ). Matrix-vector multiplication can potentially be parallelized by simultaneous processing of the solid line blocks in Figure 2 , which correspond to different sampling instants. However, consecutive blocks are coupled, i.e. each block accesses areas in the input vector associated with its neighbours, which results in restrictions on memory partitioning. It can be noted that blocks are coupled only via negative identity matrices. For a matrix-vector multiplication, handling negative identity matrices is Figure 2 . ODE model and OCP parameters are presented in Sections 5 and 6, trapezoidal method is used for ODE integration. Details on dense band implementation can be found in [6] . Actual memory savings may vary depending on the memory block size of a particular computing platform.
reduced to data transfer operations (with changing the sign bit) and does not require any arithmetic operators. After negative identity matrices are processed, the remaining parts (grey area) can be parallelized, since there is no overlap in accessing input vector partitions. The efficiency of multiplication also depends on how sparsity within each block is handled. For example, consider the sparse matrix in Figure 5 . Each non-zero element in the matrix corresponds to a multiply-accumulate (MAC) operation, i.e. each non-zero element is multiplied by a corresponding element of the input vector and added to the corresponding element of the output vector. Although the order of processing non-zero elements with floating point arithmetic might affect the final result, in practice this effect is assumed to be negligible. More importantly, this order has an impact on output vector read-write dependencies and hence pipelining possibilities. For the considered example, processing of a 12 cannot be started before a 11 is fully processed, since both elements require writing data to the same element of the output vector. Hence, MAC operations have to be scheduled in such a way that the distance (in terms of computer clock cycles) between processing non-zero elements from the same row is maximized. This scheduling problem can be formulated as an optimization problem max
subject to:
where d ∈ Z is a slack variable, M = {a 11 , a 12 , a 21 , a 23 , a 32 , a 44 } is the set of non-zero elements, N nz is the number of non-zero elements, t(a ij ) is the number of time instant, at which processing of a ij starts, and S sched = [t(a 11 ), t(a 12 ), t(a 21 ), t(a 23 ), t(a 32 ), t(a 44 )] T ∈ Z Nnz . Since all blocks in Figure 2 have the same sparsity pattern, the scheduling problem is solved only once during design. The resulting MAC circuit for the considered example is shown in Figure 5 . Note that due to symmetry of A k only the lower triangular part is stored. The impact of scheduling on the time taken for sparse matrix-vector multiplication is presented in Figure 6 . For this case study, problem (6) was solved using the YALMIP built-in branch and bound solver [21] . If pipelining is implemented without scheduling, the initiation interval is equal to the adder latency, which limits potential improvement. Initiation interval is defined as the number of clock cycles that elapses between starting processing two successive non-zeros. With a systematic scheduling approach, read-write dependencies are handled optimally, which allows one to start processing a new non-zero element after the previous as quickly as possible, i.e. achieving the minimal initiation interval. For the ODE model presented in Section 5 it was possible to achieve initiation interval of one clock cycle.
Although matrix-vector multiplication can be parallelized by allocating a MAC unit to each grey block (Figure 2 ), we propose trading off computation time against resource usage by varying the number of MAC units, as shown in Figure 7 . The number of MAC units is denoted by P .
Limitations of the proposed implementation
We highlight certain limitations of the proposed implementation:
• For matrix-vector multiplication, input and output vectors have to be partitioned in accordance with the matrix Figure 2 , which, depending on the parallelism level P , might increase FPGA block memory usage. This issue can be partially addressed by placing corresponding partitions of the matrix and input vector in the same memory block, which will not affect performance, provided the computing platforms supports dual-port memory blocks. Another possible solution is implementing vector partitions using lookup tables (provided partitions are sufficiently small) instead of using block memory. Figure 2 ). It is assumed that the latency for the addition and multiplication units are 6 and 5 clock cycles, respectively. The number of nonzero elements is 161, which is the case for the example described in Sections 5 and 6.
• In contrast to [6] , our implementation requires control logic, which might limit FPGA clock frequency. However, matrix-vector multiplication, being a part of a larger algorithm, is often not the main bottleneck for increasing clock frequency [5] .
• Applying a preconditioner directly to original linear system may destroy in-block sparsity and hence eliminate memory savings. To overcome this limitation, preconditioners can be applied indirectly by performing two matrix-vector multiplications per MINRES iteration so that sparsity patterns are preserved, both for preconditioner and the original matrix [10] . In this work we use a sparsity-preserving prescaler described in [17] . Although preserving sparsity, this prescaler changes negative identity matrices into general diagonal matrices, which slightly complicates the algorithm.
Experimental setup
Experimental setup represents a closed-loop system that consists of a gantry crane model and a heterogeneous computer running the predictive controller. The closed-loop simulation is managed by Protoip, a software tool for rapid prototyping of online optimization algorithms on reconfigurable computing platforms. This section will provide detailed information on the experimental setup, including a new release of Protoip, heterogeneous computer and the gantry crane model.
Protoip -a new release
The initial release of Protoip [30] represents a software tool for rapid deploying and verification of online optimization algorithms on FPGAs using hardware vendor's tools on the underlying level (Xilinx Vivado, Vivado HLS, SDK). Protoip is written in Tcl, an interpreted programming language that was designed with an emphasis on computer aided design. Matlab interface is provided to allow working in conjunction with state-of-the-art optimization and numerical integration toolboxes. The previous version of Protoip was dealing only with pure FPGA implementations, i.e. CPU was not involved in computations. The new release adds possibility of splitting the workload between CPU and FPGA within a single chip to exploit strengths of each subsystem and trade off FPGA logic usage against algorithm performance. The design flow for heterogeneous implementations consists of two parallel branches which correspond to software and hardware parts of a given algorithm accordingly (Figure 8) . Initially, the user provides two pieces of C/C++ code: software code for CPU and hardware code for FPGA. This step can be performed either manually or automated by using synthesizable code generation tools [29] . On the next stage, Protoip performs software verification of the hardware code using automatically generated testbench files, which allows detecting errors on the early stages of the design flow avoiding time consuming circuit synthesis process. Following this, FPGA circuit is synthesised and software code is compiled. On the final stage of the design flow implementation is deployed on a heterogeneous platform and Processor-in-the-loop (PIL) test is performed.
Processor-in-the-loop test implies simulating the controlled plant in Matlab environment on PC, while performing computations on an embedded platform ( Figure 9 ). Desktop PC to embedded platform communication speed and reliability might be critical for processor-in-the-loop simulations. The previous release of Protoip was relying on UPD/Ethernet interface, which is a common and relatively fast protocol, however, having no transmission guarantees. Lack of reliability might be crucial for design space space exploration, where processor-in-the-loop simulations may run autonomously for several weeks. A possible solution to this problem is employing PCI interface, which has proven to be reliable solution for FPGA to PC communication [18] . However, PCI interface is rarely available on low-cost prototyping boards. For that reason it was decided to keep using UDP/Ethernet so that no additional hardware is introduced. To improve reliability another communication layer with error checking and retransmission mechanism was implemented on the top of UDP. As a result communication reliability was improved without changing hardware requirements.
Another feature of the release is a new way of managing Vivado, Vivado HLS and Xilinx SDK projects. In particular:
• Users get flexibility of managing projects (e.g. adding new libraries, project parameters). With the previous version of Protoip only predefined set of project parameters was supported.
• A possibility of handling multiple file C/C++ projects is added. This make Protoip more friendly in relation to code generation tools. Consider [29] , a C code generation tool for model predictive control based on operator splitting methods, which uses Protoip on the underlying level and is capable generating code for FPGAs.
Summary of the new Protoip release features, i.e. contributions of this work:
• Possibility of prototyping heterogeneous implementations.
• A new communication layer on the top of UDP/Ethernet for reliable processor-in-the-loop simulations.
• A new way of managing application projects to allow working in conjunction with code generation tools.
Target heterogeneous computer
Protoip supports the entire family of Xilinx Zynq-7000 devices. The proposed implementation was tested on a Xilinx Zynq-7000 XC7Z020 SoC with dual-core ARM Cortex-A9 processor (only one core was used in this work) and Artix-7 FPGA logic that contains 53200 six-input Lookup Tables  (LUTs) and 106400 Flip-Flops (FFs). In addition to general purpose logic the Artix-7 provides special purpose units: 220 DSP blocks and 140 block RAMs with total capacity 4.9 Mb, which is a relatively small amount compared to an average modern CPU RAM. Communication between software and hardware was performed via Advanced eXtensible Interface (AXI) with burst mode support that allows reading/writing of words every clock cycle. We used the Zedboard development kit [14] , which supplements the Zynq platform with auxilary circuits to provide an easy access to the communication interfaces, including JTAG for programming FPGA and Ethernet for data exchange.
Gantry crane
For the purpose of closed-loop simulations, Simulink Multibody library-based [22] gantry crane model was used, which allowed incorporation of various types of friction including viscous, breakaway and Coulomb frictions. The setup consists of a payload (m = 0.47 kg) hanging on a rope attached to a moving cart (Figure 10 ). The system is actuated by two Pulse Width Modulation (PWM) controlled motors, which move the cart along the rail and lift/lower the load. Motors are equipped with internal velocity controllers, hence the inputs to the system are velocity setpoints. Cart position x c , rope length x l and deflection angle θ l are measured by encoders with a resolution equal to 4096 pulses per rotation. Corresponding velocities are estimated using a second order low-pass filter cascaded with a differentiator with Laplace transform
Figure 10: A schematic drawing of the gantry crane setup.
In this experiment ω = 20π rad/s, ζ = 0.7. For the purpose of digital implementation (7) is discretized using a Tustin transformation with sampling time T est = 0.01s. Note that estimator and controller can be sampled at different rates.
The motion of the crane with a variable cable length can be described by [9] θ l x l + 2ẋ lθl +ẍ c cos(
where g is the acceleration of gravity. It is assumed that the dynamics between the motor velocity setpoints and corresponding velocities are described by the first-order lag relation with Laplace transform
where τ is the time constant. The following state-space model can be obtained:
where v c and v l denote corresponding velocities. Time constants τ c = 0.13s and τ l = 0.07s were identified using sum-of-sinusoids input signal postprocessing data with the nonlinear least squares algorithm from [22] . In this work controller performance was initially verified by using the same model for predictive control and plant simulation adopting the CasADi [2] front-end of the CVODES numerical integrator [12] . Following initial verification a heterogeneous MPC controller was tested in the loop with Simscape model within the Simulink environment. Incorporating the Zynq platform into the Simulink environment was performed using the Protoip Application Programming Interface (API), which is implemented in C, and the Simulink Coder package, which allows calling external C/C++ functions. 
Experimental results
Four implementations (software only and three heterogeneous) of Algorithm 1 were deployed and tested with Protoip. Single precision floating point data arithmetic was used, clocking the CPU and FPGA at 667 MHz and 167 MHz, respectively. The following control objectives were selected:
Note that this formulation has proven to be a tuning-free solution for gantry crane control, in contrast to the standard quadratic objective approach, which often relies on a time consuming tuning process [8] . Actuation limits dictated by the motor characteristics were incorporated as input constraints:
For all designs considered T s = 100 ms, the number of interior-point iterations n iter = 15 and the number of MINRES iterations was set to be equal to corresponding linear system size. Suitability of these parameters was experimentally confirmed based on closed-loop simulation results. With the proposed implementation computation time can be traded off against resource usage either by changing the level of parallelism (i.e. the number of MAC units for matrix-vector multiplication) in the hardware accelerator or by shifting MINRES algorithm layers between software or hardware. These tradeoffs are shown in Figure 11 . Resource usage is calculated as a Euclidian norm of a vector that contains the relative utilization of different FPGA resources (LUTs, FFs, DSPs and BRAMs). For each heterogeneous implementation (HG 1 , HG 2 and HG 3 ) the parallelism factor P was varied from 1 to N . Note that although N = 10, there are only 6 designs for each implementation. This happens because some values of P are identified as inefficient at code generation stage. For this example any design with P = 6 to 9 will lead to the same performance as P = 5. We highlight some other conclusions that can be drawn based on Figure 11: • Considering execution time and resource usage as two contradicting design objectives within a multi-objective optimization problem, it can be seen that Pareto-optimal designs are achieved by splitting the workload between software and hardware in different ways. This justifies the flexibility of the software-hardware splitting of the proposed implementation.
• For HG 1 and HG 2 implementations increasing matrix-vector multiplication parallelism does not result in any speed up. This happens due to the fact that the Lanczos kernel (implemented on the FPGA) and the remaining part of the MINRES algorithm (implemented on the CPU) run in parallel and the overall execution time is defined by the slowest branch, which is the remaining part of MINRES.
• For the test case considered the maximum speedup of hardware-accelerated implementations over a pure software realization is 36x. Since this speedup comes at the price of significantly increased resource usage, it might be reasonable to compare the fastest design against other Pareto-optimal designs before making the final design choice.
Increasing the prediction horizon N , while improving closed-loop performance, normally results in implementations with high resource consumption. Consider Figure 12 that visualizes scaling of FPGA resource usage with the prediction horizon for two extreme cases: P = 1 and P = N . It can be observed that changing the degree of parallelism allows taking control over resource scaling.
Note that for sequential processing of KKT matrix blocks (P = 1) resource scaling can hardly be noticed. This happens due to the fact that the memory footprint is so small that fixed-size FPGA memory blocks are not fully filled for N = 1 nor for N = 10. Further synthesis details can be found in B. T the controller drives the system to the desired point according to the objective (11) while satisfying constraints (12).
Conclusions
This paper presented the following contributions:
• A heterogeneous implementation of an interior-point-based nonlinear predictive controller and experimental validation of the controller in the loop with a gantry crane model.
• A new release of Protoip, a software tool for quick prototyping of optimization-based controllers on heterogeneous computers. The tool can be used both for testing the proposed controller and for prototyping new algorithms.
The following conclusions can be drawn out of this work:
• Accelerating the linear algebra routines in hardware within a heterogeneous computer implementation can result in a significant speedup over a software-only implementation.
• Performance of a heterogeneous computer-based implementation can be efficiently traded off against resource usage by shifting the computational workload between the CPU and the FPGA, while varying the amount of parallelism for a given part of an algorithm might be less efficient or even completely gainless.
• Offline scheduling for sparse matrix vector multiplication allows avoiding data dependencies and hence building efficient data pipelines, which result in faster implementations. Further work will be focused on automating the design process by formulating the NMPC design problem as multi-objective optimization problem in order to identify design trade offs in a systematic way as in [20] . The list of possible design variables can include both hardware parameters (e.g. parallelization level) and software parameters (e.g. horizon length or number of iterations) parameters. Table 1 : Algorithm execution times with heterogeneous implementations. For all considered implementations the CPU clock rate is 667 MHz and the FPGA clock frequency is 166 MHz. T s = 100 ms, n iter = 15, t is the algorithm execution time.
