Nonlinear Predictive Control on a Heterogeneous Computing Platform by Khusainov, Bulat et al.
ar
X
iv
:1
71
0.
08
73
7v
1 
 [c
s.S
Y]
  2
4 O
ct 
20
17
Nonlinear Predictive Control on a Heterogeneous Computing
Platform
Bulat Khusainov∗, Eric Kerrigan∗†, Andrea Suardi∗, George Constantinides∗
∗Department of Electrical & Electronic Engineering
†Department of Aeronautics
Imperial College London
London, SW7 2AZ, UK
†corresponding author: e.kerrigan@imperial.ac.uk
April 10, 2018
Abstract
We propose an implementation of an interior-point-based nonlinear predictive controller on a
heterogeneous processor. The workload can be split between a general-purpose CPU and a field-
programmable gate array to trade off the contradicting design objectives of control performance and
computational resource usage. A new way of exploiting the structure of the KKT matrix yields
significant memory savings. We report an 18x memory saving, compared to existing approaches,
and a 36x speedup over a software implementation with an ARM Cortex-A9 processor. We also
introduce a new release of Protoip, which abstracts low-level details of heterogeneous programming
and allows processor-in-the-loop verification.
Keywords: Predictive control, Hardware-software co-design, Scheduling; FPGA; Optimization-
based control
1 Introduction
Explicit performance optimization, systematic constraint handling and the inherent capability of
dealing with nonlinearities are the main features that explain the success of Model Predictive Control
(MPC) in recent decades [26]. In the MPC framework a dynamic optimization problem is solved
at each sampling instant, which might restrict the application scope to systems with slow dynamics
and/or render expensive implementations limited to high-performance computers.
Conventionally these challenges were addressed on the algorithmic and software levels by devel-
oping new optimization problem formulations and generating hardware-efficient code [13]. However,
in addition to improvements on the software side, recent developments in reconfigurable computing
allowed acceleration of predictive control algorithms on Field-Programmable Gate Arrays (FPGAs),
which resulted in low-cost and resource-efficient realizations of custom quadratic programming (QP)
solvers for MPC [16, 11]. Extending a hardware acceleration approach from linear to nonlinear model
predictive control (NMPC), which can be considered as the next logic step, requires mapping nu-
merical integration algorithms on hardware, which is not a trivial task, since dynamic models that
describe the physical world are problem-dependent. As a result, instead of using systematic dynamic
optimization, existing FPGA implementations of NMPC either rely on stochastic optimization [32]
or approximate the offline solution with machine learning techniques [3]. These approaches cannot
guarantee scalability nor closed-loop performance.
Accelerating deterministic algorithms on hardware might be achieved by employing heteroge-
neous computing platforms that involve both a general-purpose processor with a fixed architecture
and FPGA logic. For example, [25] present a heterogeneous implementation of a multiple-shooting
based NMPC algorithm. The authors propose implementing integration in software while acceler-
ating a fast gradient-based quadratic programming solver on an FPGA. The reported speedup of
the heterogeneous implementation over a software realization is 1.6x and further improvement is
limited, since integration and optimization algorithms have comparable computational complexity.
This is a consequence of Amdahl’s law [1], which states that an algorithm’s speedup is limited by
the part of the workload that cannot benefit from acceleration.
We present a new heterogeneous implementation of nonlinear interior-point algorithm for pre-
dictive control, that was first introduced in [19]. The main features of the proposed implementation
are:
• A method for scheduling sparse matrix-vector multiplication within an iterative linear system
solvers to enable significant improvements in terms of computation time vs resource usage.
For the example considered, an 18x memory saving compared to existing approaches and a
36x speedup over a software implementation are reported.
• Flexible splitting of the algorithmic workload between software and hardware for trading off
the computational resource usage against performance.
In addition to the initial results presented in [19], this paper presents the following extensions:
• The whole family of implicit and explicit Runge-Kutta methods are supported for ordinary
differential equations integration. In contrast, the initial implementation was limited to Euler
method only.
• The optimal control objective is generalized from a quadratic function to nonlinear least
squares.
• The proposed controller is experimentally validated in the loop with a gantry crane high-
fidelity Simscape [22] model. In [19] the controller was only validated in the loop with a
nominal model.
1
Another contribution of the paper is a new release of the Protoip software tool [30]. Protoip allows
quick prototyping and processor-in-the-loop verification of optimization algorithms on a Xilinx Zynq
system-on-a-chip (SoC), which contains an ARM processor and FPGA fabric. In contrast to the
previous releases, which were focused on pure FPGA implementations, the new version of Protoip
allows the incorporation of both an ARM processor and FPGA. Protoip can be used both for
quick testing of the proposed implementation from the MATLAB environment and for design and
verification of other heterogeneous implementations.
The remainder of the paper is organised as follows: Section 2 describes the considered optimal
control problem formulation; existing NMPC algorithms, with a focus on suitability for hardware
implementation, are reviewed in Section 3; in Section 4 a heterogeneous computer-based implementa-
tion of NMPC is presented, followed by experimental setup description (Section 5) and experimental
results (Section 6). Note, that Protoip is a part of the experimental setup and hence presented in
the corresponding section. Section 7 concludes the paper.
2 Optimal Control Problem formulation
We consider nonlinear time-invariant systems that can be described as an ordinary differential equa-
tion (ODE) of the form
x˙(t) = fc(x(t), u(t)), (1)
where fc : R
n × Rm → Rn. We consider the nonlinear optimal control problem (OCP) with initial
state xˆ and prediction horizon T :
min
x,u,s
∫ T
0
||h(x(t), u(t), s(t))||22dt+ ||hT (x(T ), sT )||
2
2 (2a)
subject to: x(0) = xˆ, (2b)
x˙(t) = fc(x(t), u(t)), ∀t ∈ [0, T ] (2c)
q(x(t), u(t), s(t)) = 0, ∀t ∈ [0, T ] (2d)
qT (x(T ), sT ) = 0, (2e)
xl ≤ x(t) ≤ xu, ∀t ∈ [0, T ] (2f)
ul ≤ u(t) ≤ uu, ∀t ∈ [0, T ] (2g)
sl ≤ s(t) ≤ su, ∀t ∈ [0, T ] (2h)
xTl ≤ x(T ) ≤ xTu, (2i)
sTl ≤ sT ≤ sTu, (2j)
(2k)
where h : Rn×Rm×RnS → Rnh and hT : R
n×RnST → RnhT . Slack trajectory s and slack variable
sT are introduced alongside with equations (2d) and (2e), which is a common technique that allows
for the handling of general nonlinear inequality constraints [31]. The presented formulation can be
generalized for time-varying reference tracking, which, as will be shown later, requires only changing
the software part of the algorithm.
3 Nonlinear predictive control algorithms
Direct solution of the continuous-time optimal control problem (2) involves two main stages: in-
tegration, i.e. solving the ordinary differential equation (ODE), and optimization. Implementing
integration on an FPGA is not desirable because of the following reasons:
2
Add Mult Sin
0
500
1000
1500
2000
Lookup
tables
Add Mult Sin
0
500
1000
1500
Flip-flops
(registers)
Add Mult Sin
0
5
10
DSPs
Figure 1: FPGA resource usage for different computations with single precision floating point arith-
metic. Data is obtained from the vendor’s high-level synthesis tool, Vivado HLS.
• The ODE (2c) may involve mathematical expressions (e.g. sine and square root) that in contrast
to standard addition and multiplication operations, require significant amounts of computa-
tional resources (Figure 1) and might be unsuitable for pipelining.
• A vector function fc is composed of scalar functions that might have different underlying
mathematical operations and different evaluation complexities, i.e. fc might have irregular
structure. Irregularity potentially limits reusing computational logic and speedup by paral-
lelization.
Optimization algorithms, on the other hand, can benefit from hardware acceleration due to
(i) their iterative nature, which is beneficial for reusing computational logic, and (ii) the fact that
underlying linear algebra algorithms can be efficiently mapped onto hardware [16].
Taking the above into account we consider two classes of algorithms for solving (2): shooting-
based and direct transcription algorithms [4]. The common feature of shooting methods is decoupling
the ODE and optimization solvers. Accelerating only the latter does not result in significant im-
provements, due to Amdahl’s law, since the workloads of the two operations are comparable. In
contrast, direct transcription algorithms transform (2) directly to a discrete OCP by approximating
the ODE with algebraic equations based on numerical integration equations, i.e.
min
x0,...,xN
u0,...,uN−1
r0,...,rN−1
s0,...,sN−1,sT
N−1∑
k=0
(
Ts||h(uk, xk, sk)||
2
2
)
+ ||hT (xN , sT )||
2
2 (3a)
subject to: (3b)
x0 = xˆ, (3c)
c(xk, uk, rk) =
[
xk+1
0
]
, k = 0, . . . , N − 1 (3d)
q(xk, uk, sk) = 0, k = 0, . . . , N − 1 (3e)
qT (xN , sT ) = 0, (3f)
xl ≤ xk ≤ xu, k = 0, . . . , N − 1 (3g)
ul ≤ uk ≤ uu, k = 0, . . . , N − 1 (3h)
sl ≤ sk ≤ su, k = 0, . . . , N − 1 (3i)
xTl ≤ xN ≤ xTu, (3j)
sTl ≤ sT ≤ sTu, (3k)
(3l)
whereN and Ts, respectively, denote the horizon length and the sampling time, rk =
[
r
(1)T
k · · · r
(l)T
k
]T
∈
3
R
nl is a vector of integrator intermediate stages and l is the number of integrator stages per sampling
instant.
With a trapezoidal integrator, (3d) would be given by:
xk + Ts
(
r
(1)
k + r
(2)
k
)
/2 = xk+1 (4a)
r
(1)
k − fc(xk, uk) = 0 (4b)
r
(2)
k − fc
(
xk + Ts
(
r
(1)
k + r
(1)
k
)
/2, uk
)
= 0 (4c)
The OCP (3) can be transformed into an NLP of the form
min
θ
1
2
||f(θ)||22 (5a)
subject to p(θ) = 0, (5b)
Jθ − d ≤ 0, (5c)
where θ = [xT0 u
T
0 r
T
0 s
T
0 . . . x
T
N−1 u
T
N−1 r
T
N−1 s
T
N−1 x
T
N s
T
T ]
T . Note that J is a matrix of zeros and
positive/negative ones since (3) contains only bound inequalities.
NLP (5) incorporates both integration and optimization, which potentially opens the possibility
of accelerating both subproblems. Primal-dual interior-point algorithms have proven their efficiency
for the numerical solution of optimal control problems [28]. Moreover, with interior-point algorithms,
the structure of the KKT matrix associated with the OCP remains fixed (unlike with active set
methods), which is desirable for hardware implementations [16].
The next section introduces a structure-exploiting heterogeneous implementation of an interior-
point algorithm for solving (5). The implementation supports the use of Runge-Kutta family inte-
grators for temporal discretization, considering Butcher tableau as a design parameter. However,
results for memory saving, algorithm acceleration and closed loop simulations are obtained with
a trapezoidal integrator (4), which belongs to the family of explicit integrators and hence allows
efficient handling of stiff ODEs.
4 Algorithm and implementation details
4.1 Primal-dual interior-point algorithm
Interior-point algorithms solve NLPs by incorporating inequality constraints into the objective func-
tion using a logarithmic barrier function scaled with a barrier parameter. The modified problem
is solved by performing consecutive Newton steps with or without globalization. Algorithm 1 and
A outline a primal-dual interior-point algorithm for solving (5). The algorithm computes a bar-
rier parameter based on the current complementary value and reduction parameter, which is one
of the simplest and most common approaches [23]. Furthermore, there is no globalization strat-
egy incorporated in the algorithm. This choice is justified by the fact that line search and trust
region globalization algorithms involve repetitive evaluation of f(θ) and p(θ), which, as was pre-
viously mentioned, often have irregular structure and hence discourage efficient acceleration. An
extensive comparative study of barrier update strategies and globalization techniques for nonlinear
interior-point methods can be found in [23].
Another algorithmic choice that requires justification is the Hessian approximation method.
Most of existing interior-point algorithms for large scale optimization use either Broyden-Fletcher-
Goldfarb-Shanno (BFGS) [7] or Gauss-Newton approximators. The BFGS algorithm, if applied
blockwise, can achieve a block diagonal structure for the Hessian approximation matrix when used
for optimal control applications [15]. The Gauss-Newton algorithm, in addition, preserves sparsity
4
Algorithm 1 Primal-dual interior-point method for NLP
1: Initial point [θT[0], ν
T
[0], λ
T
[0]]
T : λ[0] > 0, Gθ[k] − d < 0
2: Reduction parameter 0 < σ ≤ 1
3: for k = 0 to niter do
4: A[k] =
[
H[k] +GW[k]G
T ∇θp
T (θ[k])
∇θp(θ[k]) 0
]
5: b[k] =
[
rdual
req
]
6: Solve A[k]z[k] = b[k] for z[k] =
[
∆θ[k]
∆ν[k]
]
7: ∆λ[k] = −Λ[k]e −G
−1(θ[k])µe−G
−1(θ[k])Λ[k]∇θg(θ[k])∆θ[k]
8: α[k] = max(0,1]α : λ[k] + α[k]∆λ[k] > 0, G(θ[k] +∆θ[k])− d < 0
9: [θT[k+1], ν
T
[k+1], λ
T
[k+1]]
T = α[k][∆θ
T
[k],∆ν
T
[k],∆λ
T
[k]]
T + [θT[k], ν
T
[k], λ
T
[k]]
T
10: end for
I −I I I I I I I I I I I I I I I I I I
−I I I I I I I I I I I I I I I I I I I
I I I I I I I I I I I I I I I I I I I
I I I I I I I I I I I I I I I I I I I
I I I I I I −I I I I I I I I I I I I I
I I I I I I I I I I I I I I I I I I I
I I I I −I I I I I I I I I I I I I I I
I I I I I I I I I I I I I I I I I I I
I I I I I I I I I I I I I I I I I I I
I I I I I I I I I I I −I I I I I I I I
I I I I I I I I I I I I I I I I I I I
I I I I I I I I I −I I I I I I I I I I
I I I I I I I I I I I I I I I I I I I
I I I I I I I I I I I I I I I I I I I
I I I I I I I I I I I I I I I I −I I I
I I I I I I I I I I I I I I I I I I I
I I I I I I I I I I I I I I −I I I I I
I I I I I I I I I I I I I I I I I I I
I I I I I I I I I I I I I I I I I I I




H0
∇F0
∇F
T
0
H1
∇F1
∇F
T
1
H2
∇F2
∇F
T
2
H3
∇F3
∇
F
T 3
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I




I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I
I • I




∗=
Figure 2: Structure of the KKT matrix assosiated with the problem (5). Hi and∇Fi denote Hessians
and Jacobians of the corresponding sampling instant in the prediction horizon.
within each block, which is particularly beneficial taking into account scratchpad memory limita-
tions of reconfigurable computers. However, memory savings come at the price of narrowing the
application scope to least-squares types of problems. In this work we accept this limitation and use
the Gauss-Newton approach.
The main workload of the algorithm is concentrated in solving the system of linear equations.
The matrix associated with the problem is symmetric and can be reordered to achieve the structure
presented in Figure 2. A symmetric system of linear equations can be solved using direct methods,
e.g. LDL decomposition, or iterative methods, e.g. Minimum Residual (MINRES) algorithm [24].
Decomposition algorithms often involve many division computations, which are more complicated
from a hardware implementation point of view compared to addition and multiplication. More-
over, parallelizing computations is not straightforward with direct algorithms. In contrast, iterative
methods mainly rely on matrix-vector multiplications, while having very few division and square
root computations. In this work we use the MINRES algorithm, which can be efficiently mapped
onto hardware [5] and is well studied in relation to optimal control applications [28]. The algorithm
performs minimization of the residual ||b−Az||2 over a Krylov subspace, which is constructed iter-
5
Figure 3: Algorithm 1 flow diagram with MINRES solver.
atively by the Lanczos kernel [24]. The Lanczos kernel is based on a three-term recurrence, hence
there is no need to store the entire Krylov subspace (unlike with the generalized minimal residual
method [27]). From an implementation point of view, the MINRES algorithm is mostly based on
addition and multiplication operations, requiring only two scalar divisions and two scalar square
root computations per iteration.
4.2 Proposed implementation
Algorithm 1 with the MINRES linear solver can be visualized with the block diagram in Figure 3.
We consider four different ways of splitting the workload between software and hardware:
• The entire algorithm is implemented in software, i.e. ARM Cortex-A9 processor. We denote
this implementation by SW.
• Only the matrix-vector multiplication is accelerated in hardware and the rest is implemented
in software (HG1).
• The whole Lanczos kernel is accelerated in hardware and the rest is implemented in software
(HG2).
• The whole linear system solver is accelerated in hardware and the rest is implemented in
software (HG3).
Note that for all considered options, the Jacobians evaluations are implemented in software to
avoid synthesising resource-consuming nonlinear operators. Operations that are implemented in
hardware can be classified into:
• Scalar operations, which do not require acceleration.
• Vector-vector operations, which can be efficiently pipelined in hardware.
• Matrix-vector multiplication, which is the most resource-consuming part. Efficient implemen-
tation requires exploiting sparsity.
One possible way to exploit the structure of Ak is based on considering the matrix as banded [16].
In this case the number of parallel computations is defined by the bandwidth and cannot be changed
with respect to resource availability. In this work we treat the matrix as block sparse and, in addition,
exploit sparsity within each block, which can result in an 18x saving in memory (Figure 4).
Matrix-vector multiplication can potentially be parallelized by simultaneous processing of the
solid line blocks in Figure 2, which correspond to different sampling instants. However, consecutive
blocks are coupled, i.e. each block accesses areas in the input vector associated with its neighbours,
which results in restrictions on memory partitioning. It can be noted that blocks are coupled only via
negative identity matrices. For a matrix-vector multiplication, handling negative identity matrices is
6
Horizon length, N
0 2 4 6 8 10 12 14 16 18 20
M
em
or
y,
 b
its
10 3
10 4
10 5
10 6
10 7
symmetric dense band
proposed approach
Figure 4: Memory requirements for the storing matrix on Figure 2. ODE model and OCP parameters
are presented in Sections 5 and 6, trapezoidal method is used for ODE integration. Details on dense
band implementation can be found in [6]. Actual memory savings may vary depending on the
memory block size of a particular computing platform.
reduced to data transfer operations (with changing the sign bit) and does not require any arithmetic
operators. After negative identity matrices are processed, the remaining parts (grey area) can be
parallelized, since there is no overlap in accessing input vector partitions.
The efficiency of multiplication also depends on how sparsity within each block is handled. For
example, consider the sparse matrix in Figure 5. Each non-zero element in the matrix corresponds to
a multiply-accumulate (MAC) operation, i.e. each non-zero element is multiplied by a corresponding
element of the input vector and added to the corresponding element of the output vector. Although
the order of processing non-zero elements with floating point arithmetic might affect the final result,
in practice this effect is assumed to be negligible. More importantly, this order has an impact on
output vector read-write dependencies and hence pipelining possibilities. For the considered example,
processing of a12 cannot be started before a11 is fully processed, since both elements require writing
data to the same element of the output vector. Hence, MAC operations have to be scheduled in such
a way that the distance (in terms of computer clock cycles) between processing non-zero elements
from the same row is maximized. This scheduling problem can be formulated as an optimization
problem
max
d,Ssched
d (6a)
subject to:
|t(aij)− t(akl)| > d ∀aij ∈M, ∀akl ∈M : i = k, j 6= l (6b)
1 ≤ t(aij) ≤ Nnz ∀aij ∈M (6c)
t(aij) 6= t(akl) ⇐⇒ i 6= k, j 6= l (6d)
where d ∈ Z is a slack variable, M = {a11, a12, a21, a23, a32, a44} is the set of non-zero elements,
Nnz is the number of non-zero elements, t(aij) is the number of time instant, at which processing
of aij starts, and Ssched = [t(a11), t(a12), t(a21), t(a23), t(a32), t(a44)]
T ∈ ZNnz . Since all blocks in
Figure 2 have the same sparsity pattern, the scheduling problem is solved only once during design.
7


•
•
•
•


︸ ︷︷ ︸
output vector
=


a11 a12 0 0
a21 0 a23 0
0 a32 0 0
0 0 0 a44


︸ ︷︷ ︸
sparse matrix


•
•
•
•


︸ ︷︷ ︸
input vector
(a)
(b)
Figure 5: An example of a sparse matrix (a) and the corresponding circuit (b) for matrix vector
multiplication.
The resulting MAC circuit for the considered example is shown in Figure 5. Note that due to
symmetry of Ak only the lower triangular part is stored.
The impact of scheduling on the time taken for sparse matrix-vector multiplication is presented
in Figure 6. For this case study, problem (6) was solved using the YALMIP built-in branch and
bound solver [21]. If pipelining is implemented without scheduling, the initiation interval is equal to
the adder latency, which limits potential improvement. Initiation interval is defined as the number
of clock cycles that elapses between starting processing two successive non-zeros. With a systematic
scheduling approach, read-write dependencies are handled optimally, which allows one to start pro-
cessing a new non-zero element after the previous as quickly as possible, i.e. achieving the minimal
initiation interval. For the ODE model presented in Section 5 it was possible to achieve initiation
interval of one clock cycle.
Although matrix-vector multiplication can be parallelized by allocating a MAC unit to each grey
block (Figure 2), we propose trading off computation time against resource usage by varying the
number of MAC units, as shown in Figure 7. The number of MAC units is denoted by P .
4.3 Limitations of the proposed implementation
We highlight certain limitations of the proposed implementation:
• For matrix-vector multiplication, input and output vectors have to be partitioned in accor-
dance with the matrix Figure 2, which, depending on the parallelism level P , might increase
FPGA block memory usage. This issue can be partially addressed by placing corresponding
partitions of the matrix and input vector in the same memory block, which will not affect
performance, provided the computing platforms supports dual-port memory blocks. Another
possible solution is implementing vector partitions using lookup tables (provided partitions are
sufficiently small) instead of using block memory.
8
Co
m
pu
ta
tio
n 
tim
e,
 c
lo
ck
 c
yc
le
s
0
500
1000
1500
2000
No pipelining
Pipelining without scheduling
Pipelining with scheduling
Figure 6: The impact of scheduling on computation time of a sparse matrix-vector multiplication
(for each gray block in Figure 2). It is assumed that the latency for the addition and multiplication
units are 6 and 5 clock cycles, respectively. The number of nonzero elements is 161, which is the
case for the example described in Sections 5 and 6.
• In contrast to [6], our implementation requires control logic, which might limit FPGA clock
frequency. However, matrix-vector multiplication, being a part of a larger algorithm, is often
not the main bottleneck for increasing clock frequency [5].
• Applying a preconditioner directly to original linear system may destroy in-block sparsity
and hence eliminate memory savings. To overcome this limitation, preconditioners can be
applied indirectly by performing two matrix-vector multiplications per MINRES iteration so
that sparsity patterns are preserved, both for preconditioner and the original matrix [10]. In
this work we use a sparsity-preserving prescaler described in [17]. Although preserving sparsity,
this prescaler changes negative identity matrices into general diagonal matrices, which slightly
complicates the algorithm.
5 Experimental setup
Experimental setup represents a closed-loop system that consists of a gantry crane model and a
heterogeneous computer running the predictive controller. The closed-loop simulation is managed
by Protoip, a software tool for rapid prototyping of online optimization algorithms on reconfigurable
computing platforms. This section will provide detailed information on the experimental setup,
including a new release of Protoip, heterogeneous computer and the gantry crane model.
5.1 Protoip - a new release
The initial release of Protoip [30] represents a software tool for rapid deploying and verification of
online optimization algorithms on FPGAs using hardware vendor’s tools on the underlying level
(Xilinx Vivado, Vivado HLS, SDK). Protoip is written in Tcl, an interpreted programming language
that was designed with an emphasis on computer aided design. Matlab interface is provided to allow
working in conjunction with state-of-the-art optimization and numerical integration toolboxes. The
previous version of Protoip was dealing only with pure FPGA implementations, i.e. CPU was not
involved in computations. The new release adds possibility of splitting the workload between CPU
and FPGA within a single chip to exploit strengths of each subsystem and trade off FPGA logic usage
against algorithm performance. The design flow for heterogeneous implementations consists of two
9
Number of floating point units
0 50 100 150 200
Co
m
pu
ta
tio
n 
tim
e,
 c
lo
ck
 c
yc
le
s
0
1000
2000
3000
4000
5000
← P = 1
← P = 2
← P = N = 20
Proposed approach
Existing implementation
Figure 7: Trading off computational time against processor resource usage for matrix-vector mul-
tiplication (Figure 2). Each circle represents a design with a certain degree of parallelism. The
floating point unit is either floating point addition or multiplication. For the considered test case,
N = 20 and the ODE model is given in Section 5. It is assumed that the latency of the addition and
multiplication units are 6 and 5 clock cycles, respectively. Details on the existing implementation
can be found in Boland and Constantinides [6].
Figure 8: SoC design flow with Protoip.
Figure 9: Processor-in-the-loop test with Protoip.
parallel branches which correspond to software and hardware parts of a given algorithm accordingly
(Figure 8). Initially, the user provides two pieces of C/C++ code: software code for CPU and
hardware code for FPGA. This step can be performed either manually or automated by using
synthesizable code generation tools [29]. On the next stage, Protoip performs software verification
of the hardware code using automatically generated testbench files, which allows detecting errors
on the early stages of the design flow avoiding time consuming circuit synthesis process. Following
this, FPGA circuit is synthesised and software code is compiled. On the final stage of the design
flow implementation is deployed on a heterogeneous platform and Processor-in-the-loop (PIL) test
is performed.
Processor-in-the-loop test implies simulating the controlled plant in Matlab environment on PC,
while performing computations on an embedded platform (Figure 9). Desktop PC to embedded
platform communication speed and reliability might be critical for processor- in-the-loop simula-
tions. The previous release of Protoip was relying on UPD/Ethernet interface, which is a common
and relatively fast protocol, however, having no transmission guarantees. Lack of reliability might
10
be crucial for design space space exploration, where processor-in- the-loop simulations may run au-
tonomously for several weeks. A possible solution to this problem is employing PCI interface, which
has proven to be reliable solution for FPGA to PC communication [18]. However, PCI interface
is rarely available on low-cost prototyping boards. For that reason it was decided to keep using
UDP/Ethernet so that no additional hardware is introduced. To improve reliability another com-
munication layer with error checking and retransmission mechanism was implemented on the top of
UDP. As a result communication reliability was improved without changing hardware requirements.
Another feature of the release is a new way of managing Vivado, Vivado HLS and Xilinx SDK
projects. In particular:
• Users get flexibility of managing projects (e.g. adding new libraries, project parameters). With
the previous version of Protoip only predefined set of project parameters was supported.
• A possibility of handling multiple file C/C++ projects is added. This make Protoip more
friendly in relation to code generation tools. Consider [29], a C code generation tool for model
predictive control based on operator splitting methods, which uses Protoip on the underlying
level and is capable generating code for FPGAs.
Summary of the new Protoip release features, i.e. contributions of this work:
• Possibility of prototyping heterogeneous implementations.
• A new communication layer on the top of UDP/Ethernet for reliable processor-in-the-loop
simulations.
• A new way of managing application projects to allow working in conjunction with code gen-
eration tools.
5.2 Target heterogeneous computer
Protoip supports the entire family of Xilinx Zynq-7000 devices. The proposed implementation was
tested on a Xilinx Zynq-7000 XC7Z020 SoC with dual-core ARM Cortex-A9 processor (only one
core was used in this work) and Artix-7 FPGA logic that contains 53200 six-input Lookup Tables
(LUTs) and 106400 Flip-Flops (FFs). In addition to general purpose logic the Artix-7 provides
special purpose units: 220 DSP blocks and 140 block RAMs with total capacity 4.9Mb, which is
a relatively small amount compared to an average modern CPU RAM. Communication between
software and hardware was performed via Advanced eXtensible Interface (AXI) with burst mode
support that allows reading/writing of words every clock cycle. We used the Zedboard development
kit [14], which supplements the Zynq platform with auxilary circuits to provide an easy access to the
communication interfaces, including JTAG for programming FPGA and Ethernet for data exchange.
5.3 Gantry crane
For the purpose of closed-loop simulations, Simulink Multibody library-based [22] gantry crane model
was used, which allowed incorporation of various types of friction including viscous, breakaway and
Coulomb frictions. The setup consists of a payload (m = 0.47 kg) hanging on a rope attached to a
moving cart (Figure 10). The system is actuated by two Pulse Width Modulation (PWM) controlled
motors, which move the cart along the rail and lift/lower the load. Motors are equipped with internal
velocity controllers, hence the inputs to the system are velocity setpoints. Cart position xc, rope
length xl and deflection angle θl are measured by encoders with a resolution equal to 4096 pulses
per rotation. Corresponding velocities are estimated using a second order low-pass filter cascaded
with a differentiator with Laplace transform
Yest(s) =
sω2
s2 + 2ωζ + ω2
. (7)
11
Figure 10: A schematic drawing of the gantry crane setup.
In this experiment ω = 20pi rad/s, ζ = 0.7. For the purpose of digital implementation (7) is
discretized using a Tustin transformation with sampling time Test = 0.01s. Note that estimator and
controller can be sampled at different rates.
The motion of the crane with a variable cable length can be described by [9]
θ¨lxl + 2x˙lθ˙l + x¨c cos(θl) + g sin(θl) = 0, (8)
where g is the acceleration of gravity.
It is assumed that the dynamics between the motor velocity setpoints and corresponding velocities
are described by the first-order lag relation with Laplace transform
Ymotor(s) =
1
τs+ 1
, (9)
where τ is the time constant. The following state-space model can be obtained:
x˙c = vc (10a)
v˙c =
−vc + uc
τc
(10b)
x˙l = vl (10c)
v˙l =
−vl + ul
τl
(10d)
θ˙l = θl (10e)
θ¨l =
1
xl
(
−vc + uc
τc
cos(θl) + g sin(θl) + 2vlθ˙l
)
(10f)
where vc and vl denote corresponding velocities. Time constants τc = 0.13s and τl = 0.07s were
identified using sum-of-sinusoids input signal postprocessing data with the nonlinear least squares
algorithm from [22].
In this work controller performance was initially verified by using the same model for predictive
control and plant simulation adopting the CasADi [2] front-end of the CVODES numerical inte-
grator [12]. Following initial verification a heterogeneous MPC controller was tested in the loop
with Simscape model within the Simulink environment. Incorporating the Zynq platform into the
Simulink environment was performed using the Protoip Application Programming Interface (API),
which is implemented in C, and the Simulink Coder package, which allows calling external C/C++
functions.
12
FPGA Resource usage
0 0.05 0.1 0.15 0.2 0.25 0.3
Al
go
rit
hm
 e
xe
cu
tio
n 
tim
e,
 m
s
10 1
10 2
10 3
10 4
ւP=1 ւP=N=10
P=1 ր P=N=10 ր
P=1 ր
P=N=10 ր
SW
HG1
HG2
HG3
Figure 11: Trading off algorithm execution
time against resource usage by splitting the
workload between software and hardware.
For all implementations N = 10.
Horizon length, N
0 5 10
F
P
G
A
re
so
u
rc
e
u
sa
ge
0.25
0.3
0.35
0.4
0.45
0.5
0.55
P = 1
P = N
Figure 12: The impact of horizon length on
FPGA resource usage with HG3 implementa-
tion.
6 Experimental results
Four implementations (software only and three heterogeneous) of Algorithm 1 were deployed and
tested with Protoip. Single precision floating point data arithmetic was used, clocking the CPU and
FPGA at 667MHz and 167MHz, respectively. The following control objectives were selected:
h(u, x, s) =


xc + xl sin(θl)
xl cos(θl)− 0.5
θ˙l
10−4uc
10−4ul

 , hT (xN , sT ) =

 xc + xl sin(θl)xl cos(θl)− 0.5
θ˙l

 . (11)
Note that this formulation has proven to be a tuning-free solution for gantry crane control, in
contrast to the standard quadratic objective approach, which often relies on a time consuming
tuning process [8]. Actuation limits dictated by the motor characteristics were incorporated as
input constraints:
− 0.15 ≤ uc ≤ 0.15 (12a)
− 0.15 ≤ ul ≤ 0.15 (12b)
For all designs considered Ts = 100 ms, the number of interior-point iterations niter = 15 and the
number of MINRES iterations was set to be equal to corresponding linear system size. Suitability
of these parameters was experimentally confirmed based on closed-loop simulation results.
With the proposed implementation computation time can be traded off against resource usage
either by changing the level of parallelism (i.e. the number of MAC units for matrix-vector multi-
plication) in the hardware accelerator or by shifting MINRES algorithm layers between software or
hardware. These tradeoffs are shown in Figure 11. Resource usage is calculated as a Euclidian norm
of a vector that contains the relative utilization of different FPGA resources (LUTs, FFs, DSPs and
BRAMs). For each heterogeneous implementation (HG1, HG2 and HG3) the parallelism factor P
was varied from 1 to N . Note that although N = 10, there are only 6 designs for each implementa-
tion. This happens because some values of P are identified as inefficient at code generation stage.
For this example any design with P = 6 to 9 will lead to the same performance as P = 5. We
highlight some other conclusions that can be drawn based on Figure 11:
13
• Considering execution time and resource usage as two contradicting design objectives within a
multi-objective optimization problem, it can be seen that Pareto-optimal designs are achieved
by splitting the workload between software and hardware in different ways. This justifies the
flexibility of the software-hardware splitting of the proposed implementation.
• For HG1 and HG2 implementations increasing matrix-vector multiplication parallelism does
not result in any speed up. This happens due to the fact that the Lanczos kernel (implemented
on the FPGA) and the remaining part of the MINRES algorithm (implemented on the CPU)
run in parallel and the overall execution time is defined by the slowest branch, which is the
remaining part of MINRES.
• For the test case considered the maximum speedup of hardware-accelerated implementations
over a pure software realization is 36x. Since this speedup comes at the price of significantly
increased resource usage, it might be reasonable to compare the fastest design against other
Pareto-optimal designs before making the final design choice.
Increasing the prediction horizon N , while improving closed-loop performance, normally results in
implementations with high resource consumption. Consider Figure 12 that visualizes scaling of
FPGA resource usage with the prediction horizon for two extreme cases: P = 1 and P = N . It
can be observed that changing the degree of parallelism allows taking control over resource scaling.
Note that for sequential processing of KKT matrix blocks (P = 1) resource scaling can hardly be
noticed. This happens due to the fact that the memory footprint is so small that fixed-size FPGA
memory blocks are not fully filled for N = 1 nor for N = 10. Further synthesis details can be found
in B.
Finally we present closed-loop simulation results obtained with a HG3 implementation in the loop
with Simscape model (Figure 13). Starting from the initial condition xˆ = [0.5 0 0.7 0 − 0.2 − 0.5 −
0.15 − 0.15]T the controller drives the system to the desired point according to the objective (11)
while satisfying constraints (12).
7 Conclusions
This paper presented the following contributions:
• A heterogeneous implementation of an interior-point-based nonlinear predictive controller and
experimental validation of the controller in the loop with a gantry crane model.
• A new release of Protoip, a software tool for quick prototyping of optimization-based controllers
on heterogeneous computers. The tool can be used both for testing the proposed controller
and for prototyping new algorithms.
The following conclusions can be drawn out of this work:
• Accelerating the linear algebra routines in hardware within a heterogeneous computer imple-
mentation can result in a significant speedup over a software-only implementation.
• Performance of a heterogeneous computer-based implementation can be efficiently traded off
against resource usage by shifting the computational workload between the CPU and the
FPGA, while varying the amount of parallelism for a given part of an algorithm might be less
efficient or even completely gainless.
• Offline scheduling for sparse matrix vector multiplication allows avoiding data dependencies
and hence building efficient data pipelines, which result in faster implementations.
14
0 2 4 6 8 10
x
c
,
m
0
0.5
1
0 2 4 6 8 10
v
c
,
m
/
s
-0.2
0
0.2
0 2 4 6 8 10
x
l
,
m
0.4
0.5
0.6
t, s
0 2 4 6 8 10
v
l
,
m
/
s
-0.2
0
0.2
0 2 4 6 8 10
θ
l
,
ra
d
-0.5
0
0.5
0 2 4 6 8 10
θ˙
l
,
ra
d
/
s
-2
0
2
0 2 4 6 8 10
u
c
,
m
/
s
-0.4
-0.2
0
0.2
t, s
0 2 4 6 8 10
u
l
,
m
/
s
-0.4
-0.2
0
0.2
Figure 13: Processor-in-the-loop simulation results with HG3 implementation. N = 10, P = 10.
Further work will be focused on automating the design process by formulating the NMPC de-
sign problem as multi-objective optimization problem in order to identify design trade offs in a
systematic way as in [20]. The list of possible design variables can include both hardware parame-
ters (e.g. parallelization level) and software parameters (e.g. horizon length or number of iterations)
parameters.
8 Acknowledgements
This work was funded from the People Programme (Marie Curie Actions) of the European Union’s
Seventh Framework Programme (FP7/2007-2013) under REA grant agreement no 607957 (TEMPO).
The authors also would like to acknowledge the discussions with Mr Junyi Liu, Mr Andrea Zanelli,
Dr Rien Quirynen and Dr Milan Vukov, and industrial support from Mathworks.
Appendix A Remaining details for Algorithm 1
g(θ[k]) := Jθ[k] − d
W[k] := Λ[k]G
−1(θ[k]),
where Λ[k] and G(θ[k]) are diagonal matrices containing the elements of λ[k] and g(θ[k]) respectively.
15
H[k] := ∇θf
T (θ[k])∇θf(θ[k]), µ = −σλ
T
[k]g(θ[k])
rdual := ∇θg
T (θ[k])G
−1(θ[k])µe−∇θf
T (θ[k])∇θf(θ[k])−∇θp
T (θ[k])ν, req := −p(θ[k])
Appendix B Synthesis results
The synthesis results are presented in Table 1.
References
[1] Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing
capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference, AFIPS
’67 (Spring), pages 483–485, New York, NY, USA, 1967. ACM. doi: 10.1145/1465482.1465560.
URL http://doi.acm.org/10.1145/1465482.1465560.
[2] Joel Andersson. A General-Purpose Software Framework for Dynamic Optimization. PhD the-
sis, Arenberg Doctoral School, KU Leuven, Department of Electrical Engineering (ESAT/SCD)
and Optimization in Engineering Center, Kasteelpark Arenberg 10, 3001-Heverlee, Belgium,
October 2013.
[3] H. Ayala, R. Sampaio, D. M. Mun˜oz, C. Llanos, L. Coelho, and R. Jacobi. Nonlinear model
predictive control hardware implementation with custom-precision floating point operations. In
2016 24th Mediterranean Conference on Control and Automation (MED), pages 135–140, June
2016. doi: 10.1109/MED.2016.7535908.
[4] J.T. Betts. Practical Methods for Optimal Control Using Nonlinear Programming. Society for
Industrial and Applied Mathematics, second edition, 2010.
[5] David Boland and George A Constantinides. An FPGA-based implementation of the minres
algorithm. In 2008 International Conference on Field Programmable Logic and Applications,
pages 379–384. IEEE, 2008.
[6] David Boland and George A. Constantinides. Optimizing memory bandwidth use and per-
formance for matrix-vector multiplication in iterative methods. ACM Trans. Reconfigurable
Technol. Syst., 4(3):22:1–22:14, August 2011. ISSN 1936-7406. doi: 10.1145/2000832.2000834.
URL http://doi.acm.org/10.1145/2000832.2000834.
[7] William C. Davidon. Variable metric method for minimization. SIAM Journal on Optimization,
1(1):1–17, 1991. doi: 10.1137/0801001. URL http://dx.doi.org/10.1137/0801001.
[8] Frederik Debrouwere, Milan Vukov, Rien Quirynen, Moritz Diehl, and Jan Swevers. Ex-
perimental validation of combined nonlinear optimal control and estimation of an over-
head crane. IFAC Proceedings Volumes, 47(3):9617 – 9622, 2014. ISSN 1474-6670. doi:
10.3182/20140824-6-ZA-1003.01674. 19th IFAC World Congress.
[9] M. Fliess, J. Levine, and P. Rouchon. A simplified approach of crane control via a generalized
state-space model. In [1991] Proceedings of the 30th IEEE Conference on Decision and Control,
pages 736–741 vol.1, Dec 1991. doi: 10.1109/CDC.1991.261409.
16
T
a
b
le
1
:
A
lg
o
rith
m
ex
ecu
tio
n
tim
es
w
ith
h
etero
g
en
eo
u
s
im
p
lem
en
ta
tio
n
s.
F
o
r
a
ll
co
n
sid
ered
im
p
le-
m
en
ta
tio
n
s
th
e
C
P
U
clo
ck
ra
te
is
6
6
7
M
H
z
a
n
d
th
e
F
P
G
A
clo
ck
freq
u
en
cy
is
1
6
6
M
H
z.
T
s
=
1
0
0
m
s,
n
ite
r
=
1
5
,
t
is
th
e
a
lg
o
rith
m
ex
ecu
tio
n
tim
e.
N P SW HG1 HG2 HG3
t, ms t, ms LUT FF DSP
block
RAM
t, ms LUT FF DSP
block
RAM
t, ms LUT FF DSP
block
RAM
1 1 72 34 2242 3242 8 11 20 6236 8429 20 26 9 8469 11723 43 33
2 1 201 94 2653 3726 10 12 49 6684 8793 20 26 19 8868 12071 43 33
2 2 - 94 2727 3859 10 15 49 6745 8972 23 33 16 8960 12136 43 40
3 1 406 180 2646 3836 10 11 92 6502 8758 20 26 31 8910 12084 43 33
3 3 - 180 3249 4485 15 19 91 7080 9202 26 40 24 9277 12264 43 47
4 1 654 288 2637 3868 10 11 140 6542 8817 20 26 47 8786 12143 43 33
4 4 - 290 3689 5059 20 25 141 7617 9852 31 47 32 9636 12472 43 54
5 1 967 421 2576 3817 10 12 202 6432 8762 20 27 65 8976 12062 43 34
5 5 - 421 4006 5714 25 31 201 8203 10706 36 55 42 10037 12851 43 62
6 1 1350 587 2547 3821 10 12 276 6439 8759 20 27 87 8671 12064 43 34
6 6 - 594 4424 6266 30 36 276 8705 1135 41 62 51 10542 13154 46 69
7 1 1812 781 2582 3849 10 13 362 6503 8814 20 28 112 9064 12172 43 35
7 7 - 783 4841 6931 35 41 360 9324 12100 46 69 62 11077 13914 51 76
8 1 2323 994 2564 3850 10 13 460 6499 8820 20 28 141 8820 12172 43 35
8 8 - 998 5200 7488 40 46 459 9820 12759 51 76 73 11489 14573 56 83
9 1 2877 1275 2590 3863 10 13 571 6503 8825 20 28 172 9093 12179 43 35
9 9 - 1275 5651 8045 45 51 571 10461 13401 56 83 84 12182 15222 61 90
10 1 3523 1524 2616 3865 10 13 694 6499 8828 20 28 205 8787 12174 43 35
10 10 - 1524 6005 8593 50 56 694 10906 14044 61 90 96 12693 15864 66 97
[1
0
]
A
.
G
reen
b
a
u
m
.
Itera
tive
M
eth
od
s
fo
r
S
o
lvin
g
L
in
ea
r
S
ystem
s.
S
o
ciety
fo
r
In
-
d
u
stria
l
a
n
d
A
p
p
lied
M
a
th
em
a
tics,
1
9
9
7
.
d
o
i:
1
0
.1
1
3
7
/
1
.9
7
8
1
6
11
97
09
37
.
U
R
L
h
t
t
p
:
/
/
e
p
u
b
s
.
s
i
a
m
.
o
r
g
/
d
o
i
/
a
b
s
/
1
0
.
1
1
3
7
/
1
.
9
7
8
1
6
1
1
9
7
0
9
3
7.
1
7
[11] E.N. Hartley, J.L. Jerez, A. Suardi, Jan M. Maciejowski, E.C. Kerrigan, and G. Constantinides.
Predictive Control using an FPGA with Application to Aircraft Control. IEEE Transactions
on Control Systems Technology, 22(3):1006–1017, May 2014.
[12] Alan C Hindmarsh, Peter N Brown, Keith E Grant, Steven L Lee, Radu Serban, Dan E Shu-
maker, and Carol S Woodward. SUNDIALS: Suite of nonlinear and differential/algebraic equa-
tion solvers. ACM Transactions on Mathematical Software (TOMS), 31(3):363–396, 2005.
[13] B. Houska, H.J. Ferreau, and M. Diehl. An Auto-Generated Real-Time Iteration Algorithm
for Nonlinear MPC in the Microsecond Range. Automatica, 47(10):2279–2285, 2011. doi:
10.1016/j.automatica.2011.08.020.
[14] Inc. ZedBoard (Zynq evaluation and development) hardware user’s guide, September 2012.
[15] Dennis Janka, Christian Kirches, Sebastian Sager, and Andreas Wa¨chter. An
sr1/bfgs sqp algorithm for nonconvex nonlinear programs with block-diagonal hes-
sian matrix. Mathematical Programming Computation, 8(4):435–459, 2016. URL
http://dx.doi.org/10.1007/s12532-016-0101-2.
[16] J. L. Jerez, K. V. Ling, G. A. Constantinides, and E. C. Kerrigan. Model predictive control for
deeply pipelined field-programmable gate array implementation: algorithms and circuitry. IET
Control Theory & Applications, 6:1029–1041(12), May 2012. ISSN 1751-8644.
[17] J. L. Jerez, G. A. Constantinides, and E. C. Kerrigan. A low complexity scaling method for
the lanczos kernel in fixed-point arithmetic. IEEE Transactions on Computers, 64(2):303–315,
Feb 2015. ISSN 0018-9340. doi: 10.1109/TC.2013.162.
[18] F. A. Khan, Z. Hafeez, A. Mirza, and Q. u. Ain. Design of fpga based daq card using pci express
protocol. In 2011 IEEE 14th International Multitopic Conference, pages 211–216, Dec 2011.
doi: 10.1109/INMIC.2011.6151475.
[19] B. Khusainov, E.C. Kerrigan, A. Suardi, and G.A. Constantinides. Nonlinear predic-
tive control on a heterogeneous computing platform. IFAC World Congress 2017. URL
http://hdl.handle.net/10044/1/45094.
[20] B Khusainov, EC Kerrigan, and GA Constantinides. Multi-objective co-design for model pre-
dictive control with an FPGA. In European Control Conference 2016. IEEE, 2016. URL
http://hdl.handle.net/10044/1/30637.
[21] J. Lofberg. Yalmip : a toolbox for modeling and optimization in matlab. In 2004 IEEE
International Conference on Robotics and Automation (IEEE Cat. No.04CH37508), pages 284–
289, Sept 2004. doi: 10.1109/CACSD.2004.1393890.
[22] MATLAB version 9.2 (R2017a). The Mathworks, Inc., Natick, Massachusetts, 2015.
[23] J. Nocedal, A. Wa¨chter, and R. A. Waltz. Adaptive barrier strategies for nonlinear interior
methods. Technical Report RC 23563, IBM Watson Research Center, Yorktown Heights, NY,
USA, 2005.
[24] C. C. Paige and M. A. Saunders. Solution of sparse indefinite systems of linear equations.
SIAM Journal on Numerical Analysis, 12(4):617–629, 1975. doi: 10.1137/0712047. URL
http://dx.doi.org/10.1137/0712047.
[25] Helfried Peyrl, Hans Joachim Ferreau, and Dimitris Kouzoupis. A hybrid hardware implemen-
tation for nonlinear model predictive control. IFAC-PapersOnLine, 48(23):87 – 93, 2015. ISSN
2405-8963. doi: http://dx.doi.org/10.1016/j.ifacol.2015.11.266.
18
[26] J.B. Rawlings and D.Q. Mayne. Model Predictive Control: Theory and Design. Nob Hill Pub.,
2009. ISBN 9780975937709.
[27] Youcef Saad and Martin H. Schultz. Gmres: A generalized minimal residual algorithm for
solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing, 7
(3):856–869, 1986. doi: 10.1137/0907058. URL http://dx.doi.org/10.1137/0907058.
[28] Amir Shahzad, Eric C. Kerrigan, and George A. Constantinides. A stable and effi-
cient method for solving a convex quadratic program with application to optimal control.
SIAM Journal on Optimization, 22(4):1369–1393, 2012. doi: 10.1137/11082960X. URL
http://dx.doi.org/10.1137/11082960X.
[29] H. Shukla, B. Khusainov, E.C. Kerrigan, and C.N. Jones. Software and hardware code gen-
eration for predictive control using splitting methods. IFAC World Congress 2017. URL
http://hdl.handle.net/10044/1/45093.
[30] A. Suardi, G. A. Constantinides, and E. C. Kerrigan. Software development kit for FPGA: A
fast FPGA prototyping tool for embedded optimization. In European Control Conference, 2015.
[31] Andreas Wa¨chter and Lorenz T. Biegler. On the implementation of an interior-point fil-
ter line-search algorithm for large-scale nonlinear programming. Mathematical Program-
ming, 106(1):25–57, 2006. ISSN 1436-4646. doi: 10.1007/s10107-004-0559-y. URL
http://dx.doi.org/10.1007/s10107-004-0559-y.
[32] F. Xu, H. Chen, X. Gong, and Q. Mei. Fast nonlinear model predictive control on FPGA using
particle swarm optimization. IEEE Transactions on Industrial Electronics, 63(1):310–321, Jan
2016. ISSN 0278-0046. doi: 10.1109/TIE.2015.2464171.
19
