Algorithms and Methods for High-Performance Model Predictive Control by Frison, Gianluca
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Nov 08, 2017
Algorithms and Methods for High-Performance Model Predictive Control
Frison, Gianluca; Jørgensen, John Bagterp
Publication date:
2016
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Frison, G., & Jørgensen, J. B. (2016). Algorithms and Methods for High-Performance Model Predictive Control.
Kgs. Lyngby: Technical University of Denmark (DTU).  (DTU Compute PHD-2015; No. 402).
Algorithms and Methods
for High-Performance
Model Predictive Control
Gianluca Frison
Kongens Lyngby 2015
Technical University of Denmark
Department of Applied Mathematics and Computer Science
Richard Petersens Plads, building 324,
2800 Kongens Lyngby, Denmark
Phone +45 4525 3031
compute@compute.dtu.dk
www.compute.dtu.dk
PhD-2015-402
ISSN-0909-3192
Summary (English)
The goal of this thesis is to investigate algorithms and methods to reduce the
solution time of solvers for Model Predictive Control (MPC). The thesis is ac-
companied with an open-source toolbox for High-Performance implementation
of solvers for MPC (HPMPC), that contains the source code of all routines em-
ployed in the numerical tests. The main focus of this thesis is on linear MPC
problems.
In this thesis, both the algorithms and their implementation are equally im-
portant. About the implementation, a novel implementation strategy for the
dense linear algebra routines in embedded optimization is proposed, aiming at
improving the computational performance in case of small matrices. About the
algorithms, they are built on top of the proposed linear algebra, and they are
tailored to exploit the high-level structure of the MPC problems, with special
care on reducing the computational complexity.
ii
Summary (Danish)
Målet med denne afhandling er at undersøge algoritmer og metoder til at reduce-
re beregningstiden for Model Prædiktiv Kontrol (MPC). Afhandlingen indehol-
der også en open-source værktøjskasse for High-Performance implementering af
løsere for MPC (HPMPC). Værktøjskassen indeholder kildekoden til algoritmer
og numeriske tests. Hovedfokus af afhandlingen er på lineær MPC.
I denne afhandling er algoritmerne og deres implementering lige vigtige. Imple-
menteringsmæssigt, udforskes en implementationsstrategi der benytter kompakt
lineær algebra-rutiner i den integrerede optimering. Denne metode forbedrer
ydeevnen for mindre størrelse problemer. Algoritmemæssigt, baseres disse på
den før omtalte kompakte lineære algebra. Algoritmerne skræddersyes også til
at udnytte strukturen i MPC problemerne med specielt fokus på at reducere
beregningskompleksiteten.
iv
Preface
This thesis was prepared at the Department of Applied Mathematics and Com-
puter Science (DTU Compute, formerly known as DTU Informatics) at the
Technical University of Denmark, in partial fulﬁllment of the requirements for
acquiring a PhD degree in Engineering. The PhD project was founded en-
tirely by DTU Compute; three months as Research Assistant at DTU Compute
were founded by EUDP 64013-0558 Energy Eﬃcient Process Control, and
The Danish Council for Strategic Research in the project CITIES - Centre
for IT-Intelligent Energy Systems in Cities (1305-00027B). All are gratefully
acknowledged.
The thesis deals with algorithms and methods for the implementation of fast
solvers for model predictive control. The focus of the thesis is on both the
optimization algorithms (tailored to exploit the special structure of the model
predictive control problem) and the implementation (thanks to a novel imple-
mentation strategy for the dense linear algebra routines in embedded optimiza-
tion). All solvers and routines are gathered in the open-source toolbox HPMPC.
The thesis is in the form of a monograph. The main reason for opting for a
monograph instead of a collection of papers is the desire to present the matter
in many more details and in a more systematic way than possible in a paper. In
particular, the ﬁrst part of the thesis proposes a novel implementation strategy
for the linear algebra routines in embedded optimization, and deep insight and
extensive numerical tests are necessary to convince the reader of the eﬀectiveness
of the approach.
Kgs. Lyngby, 30-December-2015
Gianluca Frison
vi
Acknowledgements
First and foremost, I would like to thank my supervisors. John Bagterp Jør-
gensen for allowing me to pursue what interested me the most, and for his pre-
cious advice. Niels Kjølstad Poulsen for his presence and support throughout
the project.
I would also like to thank Moritz Diehl for opening me the doors of IMTEK,
both during a research stay there, and now for a new adventure.
I would like to thank Milan Vukov and D. Kwame Minde Kufoalor (aka Giorgio)
for the endless hours of enthusiastic work together. This project owes you a lot.
A big acknowledgement goes also to the defence opponents, Allan Peter Engsig-
Karup, Daniel Axehill, Hans Joachim Ferreau, for their challenging questions
and valuable feedback on the thesis.
Finally, I would like to thank my girlfriend Marie, my family and my friends
for their love and their constant support in the good as well as in the diﬃcult
moments. I am glad I lived this adventure with you.
viii
List of abbreviations
ADMM Alternating Direction Method of Multipliers
AS Active-Set
CPU Central Processing Unit
ﬂop ﬂoating-point operation
FMA Fused Multiply-Add
FP Floating-Point
Gﬂops Billions of ﬂoating-point operations per second
IPM Interior-Point Method
ISA Instruction Set Architecture
LLC Last Level of Cache
LP Linear Programming
memop memory operation
MHE Moving Horizon Estimation
MMU Memory Management Unit
MPC Model Predictive Control
NMHE Nonlinear Moving Horizon Estimation
NMPC Nonlinear Model Predictive Control
xOCP Optimal Control Problem
QP Quadratic Programming
SIMD Single-Instruction Multiple-Data
SQP Sequential Quadratic Programming
TLB Translation Lookaside Buﬀer
xi
xii Contents
Contents
Summary (English) i
Summary (Danish) iii
Preface v
Acknowledgements vii
List of abbreviations ix
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Publications list . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
I Dense Linear Algebra Routines for Embedded Opti-
mization 15
2 Review of dense linear algebra implementation techniques 17
2.1 Assumptions about the computer architecture . . . . . . . . . . . 17
2.2 Linear algebra routines in high-performance computing: opti-
mized libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Reference BLAS and LAPACK . . . . . . . . . . . . . . . 19
2.2.2 ATLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 GotoBLAS / OpenBLAS . . . . . . . . . . . . . . . . . . 21
2.2.4 BLIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.5 Intel's MKL . . . . . . . . . . . . . . . . . . . . . . . . . . 23
xiv CONTENTS
2.3 Linear algebra routines in control: code generation . . . . . . . . 23
2.3.1 CVXGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 FORCES . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Comparison of existing dense linear algebra implementations . . 25
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Level 3 BLAS and LAPACK for embedded optimization 31
3.1 General framework: embedded optimization . . . . . . . . . . . . 32
3.2 Optimizing the gemm routine . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Optimizing the gemm kernel . . . . . . . . . . . . . . . . . 35
3.2.2 Use of contiguous memory and panel-major matrix format 41
3.2.3 Order of outer loops . . . . . . . . . . . . . . . . . . . . . 43
3.2.4 Transposition, edges and corners handling . . . . . . . . . 44
3.2.5 Low rank updates handling . . . . . . . . . . . . . . . . . 46
3.3 Optimizing other level 3 BLAS and LAPACK routines . . . . . . 47
3.3.1 Triangles, factorizations, substitutions and inversions han-
dling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Merging of linear algebra routines . . . . . . . . . . . . . 49
3.3.3 Notable routines . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Comparison of implementation techniques for dsyrk + dpotrf . 54
3.5 Performance of level 3 BLAS and LAPACK routines . . . . . . . 61
3.5.1 Performance on Intel Ivy-Bridge micro-architecture . . . . 62
3.5.2 Performance on Intel Haswell micro-architecture . . . . . 64
3.5.3 Performance in case of low rank updates . . . . . . . . . . 66
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Level 2 BLAS for embedded optimization 69
4.1 Optimizing the gemv routine . . . . . . . . . . . . . . . . . . . . . 71
4.1.1 Optimizing the gemv kernel . . . . . . . . . . . . . . . . . 72
4.1.2 Use of contiguous memory and panel-major matrix format 77
4.1.3 Edges handling . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Optimizing the symv routine . . . . . . . . . . . . . . . . . . . . . 78
4.2.1 Optimizing the symv kernel . . . . . . . . . . . . . . . . . 79
4.3 Optimizing other level 2 BLAS routines . . . . . . . . . . . . . . 84
4.3.1 Triangles and substitutions handling . . . . . . . . . . . . 85
4.3.2 Merging of linear algebra routines . . . . . . . . . . . . . 85
4.3.3 Notable routines . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Performance of level 2 BLAS routines . . . . . . . . . . . . . . . 87
4.4.1 Performance on Intel Ivy-Bridge micro-architecture . . . . 87
4.4.2 Performance on Intel Haswell micro-architecture . . . . . 88
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
CONTENTS xv
5 Optimizing gemm kernels on diﬀerent architectures 93
5.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.1.1 Intel Bonnell (Atom) . . . . . . . . . . . . . . . . . . . . . 95
5.2 x86_64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.1 Intel Core . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.2 Intel Nehalem . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2.3 Intel Sandy-Bridge . . . . . . . . . . . . . . . . . . . . . . 106
5.2.4 Intel Haswell . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2.5 Intel Skylake . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.2.6 AMD K10 . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.3 ARMv7A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3.1 ARM Cortex A9 . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.2 ARM Cortex A15 . . . . . . . . . . . . . . . . . . . . . . 127
5.3.3 ARM Cortex A7 . . . . . . . . . . . . . . . . . . . . . . . 130
5.4 ARMv8A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.4.1 Cortex A57 . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.5 PowerPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.5.1 PowerPC 603e . . . . . . . . . . . . . . . . . . . . . . . . 137
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6 Summary and considerations about code generation 143
6.1 Comparison with existing dense linear algebra implementations . 145
II Algorithms for Unconstrained MPC and MHE Prob-
lems 149
7 Unconstrained MPC and MHE problem formulations 151
7.1 Unconstrained MPC problem . . . . . . . . . . . . . . . . . . . . 151
7.1.1 Marix formulation . . . . . . . . . . . . . . . . . . . . . . 152
7.1.2 Optimality conditions . . . . . . . . . . . . . . . . . . . . 153
7.2 Unconstrained MHE problem . . . . . . . . . . . . . . . . . . . . 154
7.2.1 Matrix formulation . . . . . . . . . . . . . . . . . . . . . . 155
7.2.2 Optimality conditions . . . . . . . . . . . . . . . . . . . . 156
8 Structure-exploiting recursive factorizations of the KKT ma-
trix 157
8.1 Backward riccati recursion . . . . . . . . . . . . . . . . . . . . . . 158
8.1.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.1.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 164
8.2 Forward Schur-complement recursion . . . . . . . . . . . . . . . . 170
8.2.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 178
8.3 Comparison of structure-exploiting factorizations . . . . . . . . . 183
xvi CONTENTS
8.3.1 Comparison on Intel Ivy-Bridge micro-architecture . . . . 183
8.3.2 Comparison on Intel Haswell micro-architecture . . . . . . 190
8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
9 Condensing methods 197
9.1 Condensing methods for MPC . . . . . . . . . . . . . . . . . . . . 198
9.1.1 Condensing algorithms for MPC . . . . . . . . . . . . . . 200
9.1.2 Factorization algorithms for MPC . . . . . . . . . . . . . 214
9.1.3 Condensing and factorization algorithms for MPC . . . . 218
9.1.4 Solution algorithms for MPC . . . . . . . . . . . . . . . . 224
9.2 Condensing methods for MHE . . . . . . . . . . . . . . . . . . . . 226
9.2.1 Condensing algorithms for MHE . . . . . . . . . . . . . . 230
9.2.2 Factorization algorithms for MHE . . . . . . . . . . . . . 243
9.2.3 Condensing and factorization algorithms for MHE . . . . 245
9.2.4 Solution algorithms for MHE . . . . . . . . . . . . . . . . 249
9.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
10 Partial condensing 255
10.1 Partial condensing algorithms . . . . . . . . . . . . . . . . . . . . 259
10.2 Choice of Np . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
10.3 Inﬂuence of linear algebra routines performance . . . . . . . . . . 262
10.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
11 Unconstrained MPC problems with time-invariant matrices 267
11.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . 268
11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
11.2.1 Linear time-invariant control problems . . . . . . . . . . . 268
11.2.2 Sub-problem in splitting methods for linear MPC . . . . . 269
11.2.3 Sub-problem in splitting methods for constrained LQR . . 270
11.3 Sparse formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 271
11.4 Condensed formulation . . . . . . . . . . . . . . . . . . . . . . . . 272
11.5 Implementation aspects . . . . . . . . . . . . . . . . . . . . . . . 274
11.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
III Algorithms for Constrained and Non-Linear MPC277
12 Constrained MPC problem formulations 279
12.1 Linear MPC problem . . . . . . . . . . . . . . . . . . . . . . . . . 280
12.1.1 Matrix formulation . . . . . . . . . . . . . . . . . . . . . . 281
12.1.2 Optimality conditions . . . . . . . . . . . . . . . . . . . . 282
CONTENTS xvii
13 Solution of sub-problems in linear MPC and MHE problems 285
13.1 Interior-point methods . . . . . . . . . . . . . . . . . . . . . . . . 286
13.1.1 Basics about interior-point methods . . . . . . . . . . . . 286
13.1.2 Interior-point methods for the linear MPC problem . . . . 288
13.1.3 Interior-point methods implementation choices . . . . . . 289
13.1.4 Partial condensing for linear MPC problems . . . . . . . . 290
13.1.5 Comparison of solvers for linear MPC problems . . . . . . 292
13.2 Alternating direction method of multipliers . . . . . . . . . . . . 298
13.2.1 Notation and basics about ADMM . . . . . . . . . . . . . 298
13.2.2 Box constraints . . . . . . . . . . . . . . . . . . . . . . . . 299
13.2.3 Soft constraints . . . . . . . . . . . . . . . . . . . . . . . . 300
13.2.4 ADMM implementation choices . . . . . . . . . . . . . . . 302
13.2.5 Numerical results for the linear MPC problem . . . . . . 303
13.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
14 Summary and considerations about solution of sub-problems in
nonlinear MPC and MHE problems 307
14.1 Interface with existing solvers for NMPC . . . . . . . . . . . . . . 308
A Custom gcc Compiler 311
Bibliography 317
xviii CONTENTS
Chapter 1
Introduction
The aim of this thesis is to investigate algorithms and methods to reduce the
solution time of solvers for Model Predictive Control (MPC). The thesis is ac-
companied with an open-source toolbox for High-Performance implementation
of solvers for MPC (HPMPC), that contains the source code of all routines em-
ployed in the numerical tests [30]. The main focus of this thesis is on linear MPC
problems, in that they arise as sub-problems in some Nonlinear MPC (NMPC)
formulations.
1.1 Background
MPC is an advanced control technique that gained much attention in both
academia and industry in the last few decades, for both the linear and nonlinear
MPC cases [69]. Good introductions to MPC can be found in [61, 72, 71]. In
a few sentences, MPC makes use of a model of the controlled plant to predict
its future state and compute an input (also known as controls or manipulated
variables) sequence optimal with respect to both performance metrics and plant
constraints. MPC can deal with complex plants with several inputs and outputs
(also known as controlled variables) and relative constraints, and incorporates
in a predictive way information about set-point changes or known disturbances.
This is achieved by solving at each sampling instant an optimization problem,
2 Introduction
that is parametrized with respect to the current state of the plant. The need to
eﬃciently and reliably solve this optimization problem at each sampling instant,
as soon as a new estimate of the plant state is available, has traditionally limited
the application of MPC to the control of plant with slow dynamic. There has
been much research eﬀort in improving the solution time of the optimization
problems in MPC, that is a necessary condition for the applicability of MPC to
the control of systems characterized by faster dynamic.
In the last few decades, MPC has been successfully applied to the control of
systems characterized by increasingly faster dynamic, from the sampling times
in the order of minutes all the way down to sampling times up to MHz rates [49].
Much of these improvements have been achieved thanks to the development of
optimization algorithm tailored to the special structure of the MPC problem.
The two main research directions in this ﬁeld are oﬀ-line and on-line solution
methods of the MPC problem.
Explicit MPC [21, 20] exploits the fact that the solution of the constrained
linear MPC problem is a continuous piecewise aﬃne function of the state over
a polyhedral partition, and that it can be precomputed oﬀ-line for all initial
states. The on-line part of the algorithm reduces to a lookup table, and therefore
extremely high control frequencies can be achieved. Since in the worst case the
number of regions can be as large as the combinations of active constraints, the
use of explicit MPC is generally limited to very small MPC problems.
Alternatively, the solution of the optimization problems can be computed on-
line, between two sampling instants. This imposes tight real-time requirements
on the execution time of the solvers, requiring the development of reliable solu-
tion methods for optimization problems with the structure of MPC problems.
Suitable solvers must be certiﬁed to return the solution within the available
time, or at least should return a reasonable approximation of the solution if
stopped early. On-line methods comprise ﬁrst and second order optimization
methods.
First order optimization methods (such as gradient methods [73, 50, 58] and
splitting methods [82, 68]) are generally straightforward to implement and can
easily exploit sparsity pattern and special structure of problems. They perform
many but cheap iterations, where the cost-per-iteration is quadratic in the input
and state size (requiring level 2 BLAS operations as e.g a matrix-vector multi-
plication or solution of a system of linear equation whose matrix is factorized
oﬀ-line). In general, the number of iterations (and therefore the solution time)
can vary signiﬁcantly with the number of active constraints and the problem
conditioning. Therefore, there has been much eﬀort in ﬁnding certiﬁcation on
the solution time of these methods [74].
1.1 Background 3
Second order optimization methods make use of second order information to
converge to a solution in fewer iterations. Interior-Point Methods (IPM) and
Active-Set methods (AS) belong to this class. In the IPM case, there exists a
polynomial certiﬁcation on the number of iterations required for convergence,
but it is very loose and therefore of no practical value. In practice, in IPMs the
number of iterations is rather unaﬀected by the problem instance: if well ini-
tialized, these methods can typically converge to the solution in 8-15 iterations.
This justiﬁes the wide use of IPMs in MPC [70, 89, 62, 26]. Each iteration is
more computationally heavy than in the ﬁrst order methods case, requiring the
factorization and solution of a system of linear equations in the computation of
the search direction. The factorization makes use of level 3 BLAS and therefore
requires a cubic number of ﬂops in the input and state size.
There is not a polynomial certiﬁcation for the AS methods, that in the worst
case can require an exponential number of iterations to converge. However,
in practice AS methods perform well, and can return a solution quickly, and
therefore they ﬁnd wide applicability in MPC [54, 66]. Particularly interesting
is the approach employed in the open-source AS solver qpOASES [28, 29], that,
in case of early stop, returns the solution of a QP that is between the one solved
at the previous sampling instant and the current one. AS methods typically
require more iterations than IPM to converge, but the iterations are generally
cheaper, since the KKT matrix can be updated without the need to re-factorize
it, when there are changes in the active set.
The other factor contributing to the reduction in solution times has been the
improvement in the computer hardware. In particular, processor frequencies
have increased exponentially for about two decades, going from a few MHz at
the beginning of the '80s, to about 3 GHz with the introduction of Intel Pentium
4 in 2002. In general, an increase in the CPU frequency translates directly in
a similar increase in performance. Therefore, there has not been the need to
invest much research eﬀort on the implementation side. However, since then the
CPU frequencies have stalled, and further improvements in computing power
can come only from the use of vector execution units or multiple CPU cores,
both requiring additional programming eﬀort [84].
Probably due to the fact that matrices in MPC problems are generally of small
size, the use of optimized BLAS libraries has not been considered of much help in
implementing fast solvers for MPC or MHE [46]. An implementation technique
that recently gained much traction in the ﬁeld of MPC is code generation,
originally proposed in this ﬁeld in [62]. The idea of code generation is to exploit
the fact that, in MPC, problems of ﬁxed structure are repeatedly solved at each
sampling instant. This can be done e.g. by choosing the best solution strategy
for the problem at hand, or by exploiting sparsity and knowledge about the
size of matrices. In the implementation of linear algebra routines, all loops can
4 Introduction
be totally unrolled (as in [62]), or the size of the loops can be ﬁxed at code
generation time (as in [26]).
1.2 Thesis approach
As stated in the title, the thesis deals with algorithms and methods for fast
model predictive control. Methods mainly refers to Part I of the thesis (dealing
with linear algebra implementation methods), while algorithms mainly refers to
Part II and Part III of the thesis (dealing with tailored algorithm for MPC).
Both algorithms and their implementation are considered equally important in
the development of fast solvers.
The structure of the thesis follows the structure of the HPMPC toolbox, as
shown on the right hand side of Figure 1.1. Part I deals with the eﬃcient imple-
mentation of linear algebra routines for embedded optimization, and that forms
the basis for the developed solvers. Part II deals with solvers for unconstrained
MPC and MHE problems. Part III deals with solvers for linear (constrained)
MPC and MHE problems.
The approach used in the solvers development is to implement a toolbox of
eﬃcient dense linear algebra routines, and to explicitly exploit the structure
and sparsity pattern of the MPC problems in the algorithms implemented using
these routines.
Part I of the thesis proposes a novel implementation strategy for dense linear
algebra routines, specially tailored for embedded optimization. The matrices of
interest in embedded optimization are typically of small to medium size, and
they can generally ﬁt in cache. Therefore, some of the implementation tech-
niques employed in optimized BLAS libraries (such as blocking for cache) give no
advantages, on the contrary they decrease the performance due to the overhead
of performing useless operations (that is particularly true in case of small matri-
ces). Therefore, a subset of BLAS and LAPACK has been re-implemented using
only techniques beneﬁcial to the embedded optimization case, such as blocking
for registers and explicit use of vectorization through SIMD instructions. Only
single-thread code has been considered in this thesis, as parallel computation is
mostly beneﬁcial for larger size problems, and in any case it should be employed
only once the single thread performance has been well optimized.
Furthermore, a special matrix format is proposed, called panel-major matrix
format. It consists of horizontal panels (i.e. sub-matrices with few rows and
many columns) of ﬁxed height bs stored one after the other, with the elements
1.2 Thesis approach 5
linear algebra 
kernels
linear algebra 
kernels
optimized BLAS 
based solver
HPMPC 
based solver
linear algebra 
routines
linear algebra 
routines
Riccati solver for 
unconstrained MPC
IPM solver for 
linear MPC
IPM solver for 
linear MPC
Riccati solver for 
unconstrained MPC
high-level 
wrapper
packing of 
matrices
Part I
Part II
Part III
Figure 1.1: Structure of a Riccati-based IPM for linear MPC problems when
implemented using linear algebra in either optimized BLAS or
HPMPC. Routines in the orange boxes use matrices in column-
major format, routines in the green boxes use matrices in panel-
major format (or equivalent internal format in optimized BLAS).
The thesis follows the structure of the HPMPC toolbox, with Part
I dealing with linear algebra, Part II with solvers for unconstrained
MPC problems and Part III with solvers for linear (constrained)
MPC problems.
6 Introduction
x86_64 
Intel 
Core
x86 
Intel 
Atom
x86_64 
Intel 
Haswell
ARMv7A
b_s = 2 
linear 
algebra
b_s = 4 
linear 
algebra
Riccati 
recursion
Scur 
compl. 
recursion
conden- 
sing
double precision solvers
x86_64 
Intel 
Core
x86 
Intel 
Atom
x86_64 
Intel 
Haswell
ARMv7A
b_s = 4 
linear 
algebra
b_s = 8 
linear 
algebra
Riccati 
recursion
Scur 
compl. 
recursion
conden- 
sing
single precision solvers
linear 
algebra 
kernels
linear 
algebra 
routines
solvers 
for 
unconstr. 
MPC
Figure 1.2: Structure of the linear algebra routines in HPMPC. The linear
algebra kernels are tailored to each computer architecture. The
linear algebra routines depend only on the panel height bs (that
may be diﬀerent for single and double precision). The routines at
higher levels in the routines hierarchy are completely architecture-
independent.
within each panel stored in column-major order. This matrix format roughly
corresponds to the innermost level of packing employed in optimized BLAS
libraries, giving optimal performance for matrices ﬁtting in cache (as typical
in embedded optimization). The fundamental diﬀerence is that in optimized
BLAS libraries the packing of matrices into this format is done at each call to a
linear algebra routine, while in the proposed approach the panel-major matrix
format is the format expected by linear algebra routines, as shown in Figure 1.1.
This moves the overhead of packing matrices from the standard column-major
or row-major formats much higher in the routines hierarchy. In particular, in
embedded optimization the packing overhead can be well amortized over the
iterations of the optimization methods.
The innermost loop of each linear algebra routine is coded in a separate function,
the kernel. The linear algebra kernels are typically coded in assembly or using
intrinsics, and are tailored for a number of computer architectures, as shown
in Figure 1.2. The two outermost loops of linear algebra routines are almost
architecture-independent, in the sense that they are coded in C code but they
depend on the panel height bs. All routines at higher levels in the routines
hierarchy are completely architecture-independent.
1.3 Thesis outline 7
Part II of the thesis deals with algorithms for unconstrained MPC and MHE
problems. The focus of these algorithms is on exploiting the structure and the
sparsity pattern of the unconstrained MPC and MHE problems at the algorithm
level, since the underlying linear algebra routines are dense.
About the choice of the algorithms, two structure-exploiting factorizations of
the KKT matrix of the unconstrained MPC and MHE problems are presented.
Namely, the backward Riccati recursion and the forward Schur-complement
method are reviewed, and their eﬃcient implementation in HPMPC is presented
in details. None of these methods is novel in the ﬁeld of MPC, but the level of
performance obtained by the routines in HPMPC has not been obtained before,
to the best of my knowledge. Furthermore, a collection of algorithms for con-
densing of MPC and MHE problems is presented. Some of the algorithms are
well known, some are novel (to the best of my knowledge). Structure-exploiting
factorizations and condensing methods ﬁnd a conjunction point in partial con-
densing, that is a technique recently proposed to trade-oﬀ the horizon length
and the input size in MPC and MHE problems [18].
Part III of the thesis brieﬂy deals with algorithms for linear (constrained) MPC
and MHE problems. In particular, two algorithms are presented: an IPM and
an ADMM (Alternating Direction Method of Multipliers), both employing the
backward Riccati recursion as a routine to solve tailored systems of linear equa-
tions. The focus is on showing that the special structure of the constraints can
be exploited to eﬃciently handle them.
1.3 Thesis outline
Chapter 1 contains the introduction, comprising background, thesis approach,
thesis structure and publication list. Afterwards, the thesis is divided into three
parts.
Part I deals with eﬃcient implementation methods for dense linear algebra
routines, tailored for embedded optimization applications.
Chapter 2 states assumptions about the structure of the computer architec-
tures considered in this thesis. Afterwards, it contains a brief review of the main
approaches employed in the implementation of optimized BLAS libraries and
in the implementation of linear algebra routines in embedded optimization. Fi-
nally, it compares the computational performance of the more promising options
8 Introduction
currently available for the implementation of the dense Cholesky factorization
for small to medium matrices.
Chapter 3 proposes a novel implementation method for level 3 BLAS and
LAPACK routines (that are the backbone of algorithms for KKT matrix factor-
ization and Hessian condensing), specially tailored for embedded optimization
applications. The chapter begins with some assumption about the nature of
the embedded optimization problems, and the consequences for the implemen-
tation of linear algebra routines. Afterwards, it presents a set of optimization
techniques that provides good performance for small matrices, and it proposes
a novel matrix format, that guarantees optimal performance of level 3 BLAS
routines for matrices roughly ﬁtting in last level cache. The step-by-step opti-
mization of a key routine shows the performance impact of each single imple-
mentation technique. Finally, a selection of common level 3 BLAS and LAPACK
routines is compared on two recent computer architectures, and the proposed
approach is compared with the best proprietary and open-source BLAS libraries.
Chapter 4 is analogue to Chapter 3 but focusing on level 2 BLAS routines
(that are the backbone of algorithms for KKT matrix solution, gradient con-
densing and residuals computation).
Chapter 5 introduces a number of common computer architectures, and de-
scribes in details the optimization of the general matrix-matrix multiplication
gemm kernel on the diﬀerent architectures, in both single and double precision.
Chapter 6 contains the summary of the Part I, and adds the Cholesky fac-
torization routine implemented using the proposed methods to the initial per-
formance tests performed in Chapter 2. Furthermore, it investigates if the use
of code generation can further improve performance.
Part II contains a collection of structure-exploiting methods for the solution
of unconstrained (linear) MPC and MHE problems. These methods are imple-
mented using the eﬃcient linear algebra routines proposed in Part I.
Chapter 7 introduces the unconstrained MPC and MHE problem formula-
tions considered in the remainder of the part, and shows the structure of the
KKT matrices for these problems.
1.3 Thesis outline 9
Chapter 8 presents two structure-exploiting factorizations for the KKT ma-
trix of the unconstrained MPC and MHE problems: a backward Riccati re-
cursion and a forward Schur-complement recursion. Implementation of these
routines using both optimized BLAS libraries and the custom linear algebra
routines proposed in Part I are presented in details. Finally, exhaustive tests
compare the performance of the two recursions, when implemented using opti-
mized BLAS libraries or the proposed custom linear algebra routines.
Chapter 9 presents a wide collection of condensing methods for the MPC
and MHE problems. All algorithms are initially presented for the MPC prob-
lem, and afterwards adapted to the MHE problem by considering the free initial
state as an extra input variable at stage -1. In particular, three Hessian con-
densing methods (with one novel to the best of my knowledge), two Hessian
factorization methods (with one novel to the best of my knowledge) and two
Hessian solution methods (with one novel to the best of my knowledge) are pre-
sented. Furthermore, two combined methods for the simultaneous condensing
and factorization of the Hessian matrix are presented by combining two Hes-
sian condensing methods with the novel Hessian factorization method (one of
the resulting algorithms is novel the best of my knowledge). The asymptotic
complexity of all methods is analyzed both as number of ﬂops and as execution
time (with linear algebra routines implemented as described in Part I).
Chapter 10 reviews the idea of partial condensing and proposes eﬃcient al-
gorithms for the computation of the state space equations and cost function
matrices and vector in the partially condensed MPC and MHE problems. The
algorithms are based on the best algorithm for the Hessian condensing of the
MHE problem, as investigated in Chapter 10. Furthermore, the backward Ric-
cati recursion of Chapter 8 is used as an example to derive theoretical guidelines
on the best value for the horizon length in the partially condensed MPC and
MHE problems. Finally, the inﬂuence of the performance of linear algebra
routines is investigated, ﬁnding that partial condensing gives additional perfor-
mance advantages in case of very small matrices (to the best of my knowledge,
this has not been investigated in literature yet).
Chapter 11 tailors the backward Riccati recursion and the novel Hessian con-
densing and solution methods to the special case of MPC problems with time-
invariant matrices and time-variant vectors, in both the state space equations
and the cost function expression, in the special case of the matrix of the terminal
cost initialized to the solution of the discrete Riccati algebraic equations. Situa-
tions where problems in this form arise are described; in particular, problems in
10 Introduction
this form arise as sub-problems in splitting methods for the constrained linear
quadratic regulator. Problems in this form can be solved extremely eﬃciently,
since the backward structure-exploiting factorization gives matrices constant
over the stages. Therefore, there is no need to perform the factorization over
all the stages, and only the matrices of a single stage need to be stored, greatly
decreasing the memory requirements for the algorithms (these considerations
are, to the best of my knowledge, novel).
Part III deals with algorithms for constrained (linear) MPC and MHE prob-
lems, that are implemented using the solvers from Part II as routines. Only
algorithms for MPC are explicitly considered, the algorithms for MHE being
analogue.
Chapter 12 introduces the linear (constrained) MPC formulations considered
in the remainder of the part. In particular, hard and soft box constraints and
general polytopic constraints are considered.
Chapter 13 presents two Riccati-based optimization methods for the solution
of linear MPC problems: an IPM and an ADMM. These optimization methods
have been widely used in literature, and have been chosen here in that they
can beneﬁt from an eﬃcient implementation of the backward Riccati recursion.
Furthermore, techniques to exploit the special structure of the soft constraints
are presented, and numerical tests conﬁrm that linear MPC problems with soft
constraints can be solved in only slightly more time than the hard-constrained
counterparts. The resulting IPM solvers are found to be more than an or-
der of magnitude faster than a successful state-of-the-art solver for embedded
optimization on a widely used benchmark, for medium to large problems. Fur-
thermore, the overhead of the high-level wrapper is well amortized over the IPM
iterations.
Chapter 14 contains a summary of Part III, and it shows the performance
gains obtained using the eﬃcient Riccati-based IPM solver and the forward
Schur-complement recursion as routines in the real-time solution of challenging
NMPC and NMHE problems using ACADO.
Appendix A proposes a modiﬁcation to the open-source compiler gcc to im-
prove the performance of the intrinsics for the various fused multiply-subtraction
1.4 Publications list 11
instructions (employed e.g. in the implementation of the Cholesky factorization)
in the x86_64 FMA ISA.
1.4 Publications list
The following journal articles ('article') and conference proceedings ('paper' or
'abstract') were published during the project period.
Paper [33] deals with algorithms for the solution of the unconstrained MPC
problem, that make use of optimized BLAS and LAPACK routines in Open-
BLAS [94] for the linear algebra. The ﬁrst part of the paper reports some
results originally found in the MSc thesis [31]: the backward Riccati recursion is
found to be the solution method performing better for the widest range of prob-
lem sizes, in case of dense matrices in the state space equation and cost function
formulation. The second part of the paper proposes techniques to improve the
speed of a Riccati-based solver, such as the use of the Cholesky factorization
to reduce the ﬂop count (as shown in Chapter 8), or the use of mixed precision
computation [23] (that is orthogonal to the techniques presented in this thesis).
Paper [77] presents an eﬃcient IPM for the LPs arising in economic MPC of
linear systems. The IPM combines a homogeneous and self-dual model with a
specialized Riccati recursion, and it is tested on a power system management
test problem. The subject of the paper is no further considered in this thesis.
Paper [35] compares techniques to parallelize the Cholesky factorization rou-
tine in the implementation of the backward Riccati recursion. In particular,
the performance of the parallel version of OpenBLAS [94] is compared to the
performance of PLASMA [13], that is a library providing LAPACK-like routines
where multiple threads are explicitly handled in the linear algebra algorithms
instead of in the level 3 BLAS routines. Furthermore, the asyncronous version
of PLASMA allows for the threads of linear algebra routines to be scheduled
asyncronously, while explicit barriers are employed to ensure correctness of the
results. The subject of the paper is no further considered in this thesis, since
the focus in on single-thread code.
Paper [34] proposes structure-exploiting condensing methods for the solution
of the unconstrained MPC problem. In particular, two Hessian condensing
12 Introduction
algorithms are considered, together with a novel structure-exploiting Hessian
factorization and Hessian solution algorithms. If the novel Hessian solution
algorithm is employed, there is no need to explicitly build the Hessian matrix, if
the aim is the solution of the unconstrained MPC problem. The combination of
one of the Hessian condensing algorithms with the structure-exploiting Hessian
factorization algorithm and the explicit built of the Hessian matrix results in
the Riccati-based algorithm originally proposed in [19]. The paper forms the
basis of Chapter 9.
Paper [79] investigates the use of warm-starting to reduce the number of
iterations of the Riccati-based homogeneous and self-dual IPM proposed in [77]
for the solution of LPs in economic MPC of linear systems. The subject of the
paper is no further considered in this thesis.
Paper [32] deals with a fast Riccati-based IPM for the solution of the linear
MPC problem. The paper proposes to merge the backward Riccati factorization
and the backward Riccati substitution in a single recursion, such that the ma-
trices of the factorization are merged with the vectors of the substitution. This
improves performance for small-scale problems. Furthermore, the paper pro-
poses the use of custom linear algebra routines to improve the performance for
small-scale problems, implemented using techniques such as blocking for cache
and explicit vectorization. This represent the ﬁrst iteration of the implementa-
tion strategy of Part I, while the Riccati-based IPM is considered in Chapter
13.
Paper [76] presents a Riccati-based ADMM for MPC with input and input-
rate constraints, embedding an IPM for the eﬃcient handling of the rate con-
straints at each ADMM iteration. The subject of the paper is no further con-
sidered in this thesis.
Paper [39] deals with a fast Riccati-based IPM for the solution of the linear
MPC problem. In particular, it investigates the performance improvements
that can be obtained exploiting the higher FP throughput of computations in
single-precision, in conjunction with inexact search direction in the IPM and
mixed precision computation. These techniques are not further considered in
this thesis, mainly because the results in the paper are obtained with an early
version of the linear algebra routines, that has not been updated yet in the
single-precision case.
1.4 Publications list 13
Paper [38] reviews computer architectures commonly employed in embedded
optimization, and describes in details the optimization of the gemm kernel on
these architectures. The the Riccati-based IPM implemented using these opti-
mized linear algebra kernels is compared with a successful state-of-the-art solver
for embedded optimization, and found to be several times faster. The review of
computer architectures and the detailed description of the gemm kernel optimiza-
tion forms the basis of Chapter 5, where many more computer architectures are
added.
Abstract [36] deals with IPM and ADMM for MPC problems, tailored to
eﬃciently handle soft constraints. In particular, the cost per iteration is only
slightly larger than in the hard-constrained counterparts. The material for this
abstract is now part of Chapter 13.
Paper [37] investigates the use of ARMv7A in MPC. The ﬁrst part of the
paper reviews the steps that lead to the implementation techniques proposed in
Part I of this thesis. In particular, the paper introduces the panel-major matrix
format. Afterwards, the computational capabilities of ARMv7A processors are
investigated in detail, and the performance of the gemm and gemv kernels is
evaluated. Finally, the performance of both a Riccati-based IPM and a Riccati-
based ADMM are evaluated with respect to a successful state-of-the-art solver
for embedded optimization. The material of the paper contributed mainly to
Part I of the thesis.
Article [88] aims to assess the computational performance of NMPC and
NMHE of a large-scale mechatronic application, namely the rotational startup
of Airbone Wind Energy systems. The NPMC and NMHE problems are handled
by ACADO, that employs either a condensed formulation of the QPs (solved
by the AS qpOASES) or a sparse formulation of the QPs (solved by the IPM in
HPMPC) arising as sub-problems in the NMPC problem. The material of the
paper contributed to Chapter 14.
Paper [40] deals with the use of the forward Schur-complement recursion as
a structure-exploiting solver for unconstrained MHE problems with additional
equality constraints on the last stage. The implementation of the solver using the
custom linear algebra routines in HPMPC is presented in details. Furthermore,
the solver is tested in both a performance scalability test and as a routine in
the solution of a challenging NMHE problem using ACADO (that is the same
14 Introduction
application in [88]). The material of the paper contributed mainly to chapters
8 and 14.
Article [78] is the journal version of paper [77] and paper [79]. The subject
of the article is no further considered in this thesis.
Article [57] investigates reformulations of step-response models as state-space
models to solve them using the eﬃcient block-factorization algorithms commonly
employed in MPC. Furthermore, it proposes an IPM based on a backward Ric-
cati recursion and a condensing method specially tailored to the shape of the
step-response models formulated as state-space models. The subject of the ar-
ticle is no further considered in this thesis.
Part I
Dense Linear Algebra
Routines for Embedded
Optimization

Chapter 2
Review of dense linear
algebra implementation
techniques
This chapter presents a brief review of dense linear algebra implementation
techniques. The reviews is not meant to be exhaustive, but simply to present
some of the most widespread solutions for linear algebra implementation in the
ﬁeld of embedded optimization and control.
2.1 Assumptions about the computer architec-
ture
In this thesis, only CPUs (Central Processing Units) are considered. Other pro-
cessor types such as GPUs (Graphical Processing Units, and in particular GPG-
PUs, standing for General Purpose GPUs) and FPGAs (Field Programmable
Gate Arrays) are not considered, even if their use in MPC is reported in litera-
ture [41, 24]. Nevertheless, some of the implementation techniques presented in
later chapters may be applied to these processor types too.
18 Review of dense linear algebra implementation techniques
The structure of modern CPU architectures can be better understood by putting
them into historical perspective [84]. The performance of CPUs has improved
exponentially for two decades, starting from early '80s. Until early 2000's, most
of this performance burst came from increase of processor frequency. With the
reduction in transistor size at each new lithography generation, the transistor
could be operated at higher frequencies, and more transistor could be obtained
from the same piece of silicon, driving the cost-per-transistor down. Code writ-
ten and compiled for an older architecture could automatically run faster on
more recent ones.
However, around 2004 CPUs hit the wall of power dissipation: the transistor
got too dense to be able to dissipate the heat generated when operated at high
frequencies, as the power consumption grows approximately with the square
of the frequency. The most famous example is the one of the Intel NetBurst
architecture: designed to reach 10 GHz, it could not increase frequencies beyond
4 GHz. Nonetheless, the empirical law about the doubling in transistor density
every 18-24 months (known as Moore's law) has kept true up to today.
The solution to keep scaling performance without increasing frequencies is to
have more work done per clock cycle. This has been achieved by exploiting
the (still holding) reduction in transistor size and cost at each new lithography
generation. Therefore, new CPUs architectures could use additional transistors
to keep increasing performance by making cores wider (increase single-thread
performance thankt to e.g. design of superscalar processors able to issue mul-
tiple instructions per cycle, or vector execution units) and by increasing the
number of cores in the CPU. In both cases, code written and compiled for older
architectures does not necessarily run faster on more recent ones. Furthermore,
even recompilation can help to a limited extent, since the increased complexity
of CPUs makes it diﬃcult for the compiler to fully exploit hardware capabilities.
In this thesis, code is written to run on a single CPU core. General imple-
mentation techniques that can be used to write high-performance linear algebra
routines on modern architectures are presented. In linear algebra routines de-
sign, the following architectural characteristics are assumed:
 the processor has a FP (Floating-Point) unit, or in general the cost (in
clock cycles) of a memop (moving data from main memory to registers or
vice versa) is higher than the cost of a ﬂop (a ﬂoating-point operation).
 the FP unit is pipelined, meaning that the instruction throughput (the
number of clock cycles between the beginning of the execution of an in-
struction and the beginning of the execution of the following (equal) in-
struction) is smaller than the latency (the number of clock cycles needed
2.2 Linear algebra routines in high-performance computing: optimized
libraries 19
to have the result of an instruction execution available for another opera-
tion).
 the FP unit may be able to operate simultaneously on small vectors of FP
numbers thanks to SIMD (single-instruction multiple-data).
 there is at least one cache, with a LLC (last level cache) large enough to
individually contain each data matrix within the size of interest.
 a cache line is large enough to contain several FP numbers.
 there is a MMU (memory management unit), with a TLB (translation
lookaside buﬀer) used to cache virtual memory translations.
These assumptions hold true for virtually all reasonably recent desktop and
mobile CPUs, and for many embedded CPUs.
Low-end embedded CPUs may lack cache or FP units: in this case, not all of
the presented implementation techniques may be proﬁtable. In particular, if
there is not a FP unit, FP operations can be implemented in software (slow),
or ﬁxed-point operations have to be employed: this rises rather diﬀerent issues
and is not investigated in this thesis.
2.2 Linear algebra routines in high-performance
computing: optimized libraries
This section contains a brief review of widespread libraries for linear algebra
routines developed in the ﬁeld ot high-performance computing.
2.2.1 Reference BLAS and LAPACK
BLAS (Basic Linear Algebra Subprograms) provides de-facto the standard build-
ing block for dense linear algebra routines. The original version was written in
Fortran in 1979. The Netlib [5] version of the library is used as a reference, and
is not optimized for speed.
BLAS is divided into 3 levels [42]:
 Level 1 BLAS contains routines for vector-vector operations, that perform
O(n) ﬂops on O(n) elements. Each vector element is reused O(1) times,
20 Review of dense linear algebra implementation techniques
and typically exactly once (so there is no reuse of data). Since the time
needed to move a vector element from main memory to register is generally
much bigger that the time needed to perform a ﬂops, level 1 BLAS routines
are typically memory-bounded and with little room for optimization.
 Level 2 BLAS contains routines for matrix-vector operations, that perform
O(n2) ﬂops on O(n2) elements. Each matrix element is reused O(1) times,
and typically exactly once (so there is no reuse of matrix data). On the
other hand, each vector element is reused O(n) times. Therefore also level
2 BLAS routines are typically memory-bounded and with little room for
optimization, that can partially reduce memory bandwidth requirements
by re-using vector elements once in registers or cache. In Netlib BLAS,
level 2 BLAS routines are implemented as simple nested double-loops.
 Level 3 BLAS contains routines for matrix-matrix operations, that perform
O(n3) ﬂops on O(n2) elements. Since each matrix element is reused O(n)
times, well optimized versions of these routines can typically attain a large
fraction of the full FP throughput by carefully reusing data once loaded
into registers and caches. In Netlib BLAS, level 3 BLAS routines are
implemented as simple nested triple-loops, without taking architectural
parameters such as cache size into consideration.
In code making use of all BLAS levels, level 3 BLAS routine account for most
of the computational cost. Therefore, most eﬀort is spent in the optimization of
these routines, also because in level 1 and 2 BLAS there is little space to improve
routine performance. There exist several optimized implementation of BLAS,
both commercial (e.g. Intel's MKL [12], AMD's ACML [3]) and open-source
(e.g. ATLAS [4], GotoBLAS/OpenBLAS [10, 94], BLIS [6]).
LAPACK (Linear Algebra PACKage) is a library for numerical linear algebra
[15]. It contains routines for more complex linear algebra operations such as fac-
torizations, matrix inversions, solution of systems of linear equations, and least-
square, eigenvalues and singular value problems. It is build on top of BLAS, as
BLAS routines are used as much as possible in the implementation of LAPACK
routines. LAPACK does not explicitly handle parallelization, and therefore it
relies on an eﬃcient BLAS implementation to attain high-performance on SMT
(Simultaneous Multi-Thread) processors. LAPACK is written in Fortran 90 and
can be found on the Netlib website [11], where the latest version is currently
3.5.0. Optimized BLAS implementations often provide optimized routines for
key LAPACK routines such that factorizations and matrix inversions.
2.2 Linear algebra routines in high-performance computing: optimized
libraries 21
2.2.2 ATLAS
The ATLAS (Automatically Tuned Linear Algebra Software) project [91] is an
instantiation of the AEOS (Automated Empirical Optimization of Software)
paradigm. It provides an optimized implementation of BLAS that typically is
much faster than the reference BLAS from Netlib.
The main idea is that new hardware architectures are released on a regular
basis, and that is is too diﬃcult or time consuming for a human operator to
keep BLAS updated and optimized for each architecture. Therefore, ATLAS
provides a framework to automatically generate an optimized BLAS library. The
library depends on a number of parameters to adapt to the diﬀerent architecture
features, such as number of registers and cache size. During installation, the
performance is automatically and empirically tuned on the speciﬁc machine
by performing an optimization over the parameter space. The optimization
often makes use of heuristics to reduce the size of the parameter space, possibly
resulting in a sub-optimal choice of parameters.
The code of the original library is written entirely in C and depends on com-
pilers to exploit diﬀerent ISAs (Instruction Set Architectures). Recent versions
often employ hand-optimized assembly kernels for performance critical routines
in order to improve performance and explicitly target diﬀerent ISAs. ATLAS
employs implementation techniques such as block for registers and for diﬀer-
ent levels of cache, copy of data into aligned memory, and a block-wise matrix
format (if matrices are large enough to justify the copy). The original gemm ker-
nel is used to multiply squared sub-matrices ﬁtting in L1 cache, where the left
operand is transposed and the right is not-transposed. This scheme optimizes
the memory access in case of scalar instructions, but it is not eﬀective in case
of SIMD instructions (present nowadays on most architectures).
ATLAS performance shows a large improvement with respect to reference BLAS,
even if it is often not competitive with respect to hand-optimized BLAS libraries.
On the other hand, ATLAS can quickly give reasonably good results on new
hardware architectures.
2.2.3 GotoBLAS / OpenBLAS
A rather diﬀerent approach is used in the GotoBLAS library [44]. In this im-
plementation, the focus in on using analytical insight about architecture details
to choose relevant architecture parameters.
22 Review of dense linear algebra implementation techniques
One of the key diﬀerences in the implementation framework with respect to
ATLAS is the focus on minimizing TLB (Translation Lookaside Buﬀer) misses
[43], that is something ignored in previous BLAS implementation, that focused
solely on caches. Furthermore, instead of blocking for L1 cache such that square
sub-matrices of A, B and C can ﬁt in L1 cache at once (as in ATLAS), Goto-
BLAS approach employs a multi-layered approach where the sub-matrices are
not necessarily squared nor have the same size. A sub-matrix of the B ma-
trix is kept into L1 cache while a bigger sub-matrix of A is streamed from L2
cache, one panel (i.e. sub-matrices where one dimension is big and the other is
small) at a time. Registers are used to hold a small sub-matrix of C and there-
fore reusing elements from A and B once on registers, such that the memory
bandwidth between L2 cache and registers is large enough to hold the stream
of data. Furthermore, registers and software prefetch are employed to hide the
higher latency of memory access from L2 cache compared to L1 cache. TLB
misses are minimized by carefully rearranging data in memory such that ele-
ments are stored contiguously in the same order as they are accessed by the
gemm kernel, and by considering also TLB size when choosing blocking size for
L2 cache. The computationally most expensive part of the code (the 'inner-
kernel') is hand-written in optimized assembly, explicitly targeting the ISA of
the diﬀerent architectures. The GotoBLAS inner-kernel consist of the three
innermost loops of a layered approach, and it is therefore relatively complex.
GotoBLAS is typically faster than ATLAS and competitive with vendor's im-
plementations: its performance is usually very close to the full FP throughput.
The approach proposed in GotoBLAS is currently the best publicly-known ap-
proach to optimize BLAS for large-scale performance. GotoBLAS is no more
under development, but a fork, OpenBLAS [94], provides optimized BLAS for
recent architectures. In this thesis, the latest available version of OpenBLAS is
employed (0.2.15).
2.2.4 BLIS
A recent eﬀort to simplify the development of high-performance BLAS imple-
mentations is BLIS (BLAS-like Library Instantiation Software) [86]. It aims at
providing a framework to quickly develop BLAS libraries for new architectures
by focusing on code-reuse and portability [85].
BLIS simpliﬁes GotoBLAS's approach by splitting the inner-kernel in two: the
micro-kernel (i.e. the innermost loop) and a portable macro-kernel (consisting of
two loops around the micro-kernel). The micro-kernel computes a sub-matrix
of C by using two panels from A and B, and it is the only part of the code
that needs to be carefully hand-optimized. Only one gemm variant is covered
2.3 Linear algebra routines in control: code generation 23
by the micro-kernel, namely 'NT' (A not-transposed and B transposed): this
is the optimal variant using SIMD instructions, since it avoids reductions and
duplication operations in the innermost loop. This gemm micro-kernel is used
to implement the entire level 3 BLAS by properly copying and transposing
data matrices, and by using small routines for the corner cases. High-level
algorithmic choices are taken from GotoBLAS as well, such as blocking for TLB
and streaming from L2 cache.
All parameters in the BLIS implementation can be chosen by means of analytical
insight [60]. be found in [9]. The BLIS approach makes the implementation
much more clear than in the GotoBLAS approach, and therefore it also has an
important pedagogical value. As a drawback, the current BLIS version focuses
only on large-scale performance, and therefore the small-scale performance is
particularly poor.
2.2.5 Intel's MKL
Intel Math Kernel Library (MKL) is a library of optimized mathematical rou-
tines, developed by Intel. It contains optimized versions of BLAS, LAPACK,
ScaLAPACK, sparse solvers, fast Fourier transforms and vector mathematical
routines. It is proprietary software, that under certain conditions can be redis-
tributed as freeware. Each developer employing MKL requires a license; free
academic license exists.
The BLAS version in MKL is generally considered to be the best option for Intel
machines. In this thesis, the latest available version of MKL is employed (11.3).
2.3 Linear algebra routines in control: code gen-
eration
In optimized BLAS implementations, the focus is usually on large-scale perfor-
mance, and small-scale performance can be poor due to the overhead of memory
copy and unnecessary blocking. Therefore, BLAS is not often used in embedded
optimization, since most problems in this ﬁeld are of relatively small size.
Instead, an approach that has been widely employed in software for embedded
optimization is code generation of linear-algebra routines. It exploits knowledge
from the speciﬁc problem instance to generate a solver tailored to the special
problem structure and size. Since the structure and size of each matrix is known,
24 Review of dense linear algebra implementation techniques
this knowledge can be employed to perform optimizations at both generation
and compilation time.
In the following, two widely employed approaches are reviewed. For the pur-
poses of this thesis, code generation is solely considered as a linear algebra
implementation techniques.
2.3.1 CVXGEN
Code generation for embedded optimization gained widespread attention thanks
to CVXGEN [62]. CVXGEN is a code generator for convex optimization prob-
lems, and it has been widely employed in model predictive control. The opti-
mization algorithm is a predictor-corrector IPM.
The search direction is computed by means of a sparse LDL factorization.
Knowledge about problem structure (e.g. sparsity pattern) and size is used to
fully unroll all loops in a Netlib-style triple-loop implementation of linear algebra
routines and remove all unnecessary operations (e.g. multiplications by zero).
All indexes are precomputed at generation time, and there are no branches in
the code. CVXGEN then relies on the compiler to optimize the generated code
for the target hardware. The main disadvantage of this approach is that the
code size grows with the size of the problem size (cubically if the data matrices
are dense, since all triple loops are fully unrolled), and therefore it is feasible
only for very small and sparse problems. As a further drawback, this approach
does not make use of instruction cache (since there is no code reuse), further
penalizing performance.
In general, this approach avoids overhead of branches and reduces the number
of ﬂops by removing unnecessary operations, but it makes an extremely bad use
of the hardware resources, attaining a very low performance when measured in
Gﬂops.
2.3.2 FORCES
FORCES [26] is a code generator for numerical optimization, designed with
special focus on optimal control problems. The optimization algorithm is a
predictor-corrector interior point method. For the solution of the unconstrained
sub-problems, it makes use of a block-wise Cholesky factorization of a block
tridiagonal matrix, where generally all blocks have equal size nx × nx.
2.4 Comparison of existing dense linear algebra implementations 25
It distinguishes between dense and sparse operations. Sparse operations are
handled similarly to the CVXGEN solver. However, FORCES makes use of a
rather diﬀerent approach to code generation of dense linear algebra operations.
Namely, dense linear algebra routines are implemented using Netlib-style triple-
loops, but loop sizes are ﬁxed and hard-coded at code-generation time. In
this way, the compiler can decide to unroll loops where proﬁtable. The main
drawback of this approach is that it completely relies on the compiler for the
code optimization: even if loop sizes are ﬁxed, compilers are usually not able
to properly optimize the code (as shown in Chapter 3), and thus this approach
can typically attain only a small fraction of the full FP throughput. Another
drawback is that, in case of diﬀerent variable sizes at diﬀerent stages of the
optimal control problem, a diﬀerent linear algebra routine has to be generated
for each variable size, limiting code reuse to the case of identical operand sizes.
In general, this approach can outperform CVXGEN, but the performance in
Gﬂops in only slightly better that the reference Netlib BLAS implementation,
and far oﬀ the full FP throughput.
2.4 Comparison of existing dense linear algebra
implementations
This section contains a brief comparison of linear algebra implementations of
interest in embedded optimization. Not all existing implementations are tested,
since results in literature can be used to select the most promising ones.
The test problem is the computation of the lower triangular Cholesky factor
of a positive deﬁnite dense matrix. In double precision, this operation can be
computed by means of the dpotrf routine in LAPACK. LAPACK makes use of
level 3 BLAS for the most computationally intensive operations, and therefore
this test indirectly evaluates also the performance of some BLAS routine. The
dpotrf routine is important in embedded optimization since it is often used in
structure-exploiting algorithms.
Reference BLAS and LAPACK version 3.5.0 are tested. The performance of
the reference BLAS is found to be rather sensitive to the choice of compiler
ﬂags. The best performance with the gcc 4.8.4 compiler is obtained using the
compiler ﬂags -O3 -funroll-loops, plus the ﬂags to enable all supported in-
struction sets on the test machines. Among the optimized open-source libraries,
OpenBLAS is chosen for tests. ATLAS is disregarded since the GotoBLAS's
approach (employed in OpenBLAS) is reported to give better performance [44].
BLIS is not considered since it is a reformulation of the GotoBLAS's approach
26 Review of dense linear algebra implementation techniques
and it is still under active development and not completely optimized: therefore
it provides lower performance, especially for small matrices. Additionally, MKL
is considered, as it is the best proprietary library on Intel machines.
Regarding the code generation approaches, only FORCES's approach is consid-
ered. It consists on ﬁxing at compile time the size of the triple-loop based linear
algebra, such that the compiler can perform additional optimizations such as
branch removal and loop unrolling. The code has been prepared by merging in a
single routine and translating to C the needed parts (e.g. only lower triangular
Cholesky factorization is considered) of the reference routines dpotf2, dgemv,
ddot, dscal in BLAS and LAPACK. The code is compiled with gcc 4.8.4 with
ﬂags -O3 -funroll-loops, plus the ﬂags to enable all supported instruction
sets on the test machines (same conﬁguration as the reference BLAS). CVX-
GEN's approach has not been considered, since it aims at sparse routines, and
since it is clearly sub-optimal in relation to code size and instruction cache use
in case of dense matrices.
The test machines are two laptops equipped with the processors Intel Core i7
3520M @ 3.6 GHz max turbo frequency (Ivy-Bridge micro-architecture) and In-
tel Core i7 4800MQ@ 3.7 GHz max turbo frequency (Haswell micro-architecture).
Only single-thread code is considered in this test (and more generally in this the-
sis). The Ivy-Bridge micro-architecture supports the AVX ISA, that is enabled
in gcc with the -mavx ﬂag. The Haswell micro-architecture supports the AVX2
and FMA ISAs (that are enabled in gcc with the -mavx2 -mfma ﬂags), and
has double the full FP throughput when these new ISAs are exploited. These
machines will be extensively employed for test in this thesis, and more details
about them will be given in later sections.
The results of the tests are in Figure 2.1. The bottom plots report the com-
putation time in seconds, in logarithmic scale, averaged over a large number of
operations to increase the accuracy. The top plots report the performance of
the routines in Gﬂops (that is, billions of FP operations per second), that is
computed by dividing the ﬂop count (computed considering the leading term
in the computational complexity, i.e. 13n
3 ﬂops for the dpotrf routine) by the
time in seconds. Performance plots are scaled such that the full FP throughput
(28.8 Gﬂops on the Ivy-Bridge processor and 52.8 on the Haswell processor) is
at the top of each picture. As a note, the full PF throughput for the Haswell
processor assumes a frequency of 3.3 GHz, that is the maximum frequency the
processor can operate at, in case of code employing 256-bit wide instructions.
The reference BLAS version (x) steadily gives a rather low performance, many
times lower than the full FP throughput. The code-generated version () of the
reference BLAS improves performance in case of very small matrices, but as the
matrix size increases, the performance gap with respect to the reference BLAS
2.4 Comparison of existing dense linear algebra implementations 27
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dpotrf
Netlib
MKL
OpenBLAS
CodeGen
(a) Ivy-Bridge
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dpotrf
Netlib
MKL
OpenBLAS
CodeGen
(b) Haswell
10-8
10-7
10-6
10-5
10-4
10-3
10-2
101 102
tim
e 
[s]
matrix size n
dpotrf
Netlib
MKL
OpenBLAS
CodeGen
(c) Ivy-Bridge
10-8
10-7
10-6
10-5
10-4
10-3
10-2
101 102
tim
e 
[s]
matrix size n
dpotrf
Netlib
MKL
OpenBLAS
CodeGen
(d) Haswell
Figure 2.1: Performance test for the LAPACK dpotrf routine on an Intel
Core i7 3520M processor (Ivy Bridge micro-architecture, support-
ing the AVX ISA) and an Intel Core i7 4800MQ (Haswell micro-
architecture, supporting the AVX2 and FMA ISAs).
28 Review of dense linear algebra implementation techniques
version vanishes. This hints at the fact that the compiler can produce similar
code in both cases, and that the performance advantage of the code-generated
version comes from a lower number of branches and smaller overhead in function
calls, rather than from structural diﬀerences in the code. Despite the fact that
the Haswell processor has double the full FP throughput, the performance is
only slightly higher than on the Ivy-Bridge processor.
Both optimized BLAS implementation perform slightly worse than the code-
generated reference version for small matrices (with a cross-over point for matrix
size around 20), but considerable better for larger ones, and the performance gap
increases with the matrix size. It is interesting to notice that the OpenBLAS
(∗) version does not give better performance on the Haswell processor, hinting
at the fact that the code has not been optimized yet for this architecture. On
the contrary, the performance of the MKL version (3) nearly doubles on the
Haswell processor.
This test is a good example of the fact that nowadays FP throughput improves
mainly due to more and wider execution units, since the CPU frequency have
almost stalled. However, it requires a considerable amount of work to exploit
the computational capabilities of the new hardware, since generally compilers
(as the gcc version 4.8.4 in these tests) are not able to compile generic code
(as e.g. the reference BLAS code) into high-performing routines, not even if
code-generated. Additionally, code optimized for an older architecture does not
necessary perform better on a new one (as e.g. dpotrf in OpenBLAS 0.2.15).
2.5 Conclusion
In this section, some dense linear algebra implementations are reviewed. The
most promising ones are compared using the Cholesky factorization as a test
case.
The main results of the tests are:
 Compilers are often unable to compile generic triple-loop based linear al-
gebra code into high-performance routines. This was shown for the gcc
compiler, but it is true more generally. The performance of the resulting
routines is very sensitive to the choice of compiler ﬂags.
 Optimized BLAS libraries employ advanced implementation techniques
that target speciﬁc architectural parameters. The code optimized for an
2.5 Conclusion 29
architecture does not necessarily run faster on a more recent one (e.g.
dpotrf in OpenBLAS 0.2.15).
 Code-generation of triple-loop based linear algebra can improve the per-
formance only for very small matrices, but the resulting routines are struc-
turally analogue to the original reference ones. The performance advan-
tages come from fewer branches and lower functions overhead. As such,
code-generation could be associated with other implementation techniques
to improve performance for very small matrices, at the cost of requiring the
generation of code for each problem instance. Therefore, code-generation
is an implementation technique orthogonal to the ones presented in the
remainder of the thesis.
30 Review of dense linear algebra implementation techniques
Chapter 3
Level 3 BLAS and LAPACK
for embedded optimization
Level 3 BLAS contains routines for basic matrix-matrix operations. LAPACK
contains routines for more complex linear algebra operations such as factoriza-
tions and matrix inversions, and it makes use of level 3 BLAS routines. In the
embedded optimization framework, level 3 BLAS and LAPACK routines can be
used in the implementation of second order optimization methods.
If n is the size (meaning the number of rows or columns) of all matrices involved
in a generic level 3 BLAS or LAPACK operation, then the computational cost
of the routines is of about O(n3) ﬂops, that is cubic in the matrix size. On
the other hand, the data size in memory is of O(n2) elements, that is quadratic
in the matrix size. This means that each matrix element is used O(n) times.
In modern computers, the time needed to move a matrix element from main
memory into registers is much larger than the time needed to perform a ﬂoating-
point operation on that element once in registers. Therefore, the reuse of matrix
elements in registers and caches is a key requirement to obtain high-performance
level 3 BLAS and LAPACK routines.
In the implementation of all level 3 BLAS routines, most of the computation
can be cast in terms of the gemm routine, that is the general matrix-matrix
multiplication routine. In the optimization of level 3 BLAS routines for large
32 Level 3 BLAS and LAPACK for embedded optimization
scale matrices, a carefully optimized gemm routine can be used to obtain high-
performance implementations of all other level 3 BLAS routines [45]. Therefore,
the gemm routine is often used as a benchmark for BLAS implementations.
This chapter is organized as follows. Section 3.1 presents the characteristic fea-
tures of embedded optimization problems, that are exploited in the remainder
of the chapter to design level 3 BLAS routines specially tailored to this class
of problems. Section 3.2 presents in details the optimization of the key rou-
tine in level 3 BLAS: the gemm. Section 3.3 presents the optimization of other
level 3 BLAS and LAPACK routines. Section 3.4 presents the comparison of
diﬀerent implementation techniques for selected level 3 BLAS routines. Finally,
Section 3.5 contains the performance plot for some widely used level 3 BLAS
and LAPACK routines.
3.1 General framework: embedded optimization
Problems in embedded optimization often must be solved in real-time on resource-
constrained hardware. This poses challenges on the development of fast-enough
solvers.
Linear algebra routines are a key aspect in the implementation of these solvers,
since they perform the most computationally expensive operations. A set of
linear algebra routines specially tailored for embedded optimization problems
can take advantage of the special features of this class of problems in order to
reduce the computational time. The following features are considered:
1. Embedded optimization problems must often be solved in real-time on
resource-constrained hardware. The computational speed is a key factor.
2. The size of the matrices is generally relatively small, i.e. with size n in
the order of tens or a few hundreds.
3. Each data matrix is often reused several times, e.g. to solve similar opti-
mal control problems at each sampling instant, or to solve similar uncon-
strained optimization problems at each iteration of an IPM.
4. Structure-exploiting optimization algorithms can exploit the high-level
sparsity pattern of the problem, and therefore the data matrices are gen-
erally dense.
These features can be exploited in the design of the linear algebra routines as
follows:
3.1 General framework: embedded optimization 33
1. Linear algebra routines must make an eﬃcient use of available hardware
resources. Compilers are generally unable to convert generic triple-loop
based linear algebra source code into eﬃcient routines fully exploiting
hardware capabilities. This is due to the fact that modern computer ar-
chitectures are increasingly complex, and compilers lack additional in-
formation about the aim of the routines and the kind of data they are
designed to handle. Therefore the programmer should consider high-level
information into the routines design, as well as explicitly target advanced
hardware features as e.g. pipelines and vector execution units.
This excludes the use of triple-loop based linear algebra routines, that are
commonly employed in MPC.
2. Matrices with size n in the order of tens or a few hundreds are assumed
to ﬁt in some cache level. As a consequence, implementation techniques
like blocking for diﬀerent cache levels are not considered, simplifying the
design of the linear algebra routines. Furthermore, for small matrices the
cost of copying or scaling matrices is not negligible with respect to the cost
of performing level 3 BLAS operations. Therefore, linear algebra routines
should be designed to reduce as much as possible the need to copying or
scaling matrices.
This excludes the use of existing BLAS implementations, since they are
optimized for large scale matrices.
3. Since data matrices are often reused several times, it makes sense to store
them in a matrix format that is particularly favorable for the linear algebra
routines. The cost to convert matrices into this format can be amortized
over several matrix reuses, or the conversion may even be performed oﬀ-
line in some cases.
This excludes as well the use of existing BLAS implementations, since
they assume matrices to be stored in column-major (or Fortran-like) or
row-major (or C-like) orders, while internally using optimized formats.
4. Sparse linear algebra requires the use of a special matrix storage and the
eﬃcient handling of the matrix element indexes. The use of sparse linear
algebra makes sense only if the matrix elements are scattered over the
matrix, and not gathered into dense sub-matrices, otherwise dense linear
algebra on these sub-matrices is preferable. Sparse linear algebra can not
make use of processor features as e.g. vectorization, and therefore it has
a low performance with respect to the full FP throughput, and it makes
sense only in case of really sparse problems. Therefore, only dense linear
algebra routines are considered, with the exception of very special and
common sparse matrices with ﬁxed structure (i.e. diagonal, triangular).
34 Level 3 BLAS and LAPACK for embedded optimization
3.2 Optimizing the gemm routine
The gemm routine is the general matrix-matrix multiplication routine. In the
BLAS standard, it has the interface (considering the double precision version,
and using C notation)
void dgemm_(char *transA, char *transB, int *m, int *n, int *k, \
double *alpha, double *A, int *lda, double *B, int *ldb, \
double *beta, double *C, int *ldc);
and it computes
C ← α · op(A) · op(B) + β · C
where α and β are scalars, and op(A) can be either A of AT (similarly for B)
depending on the ﬂags transA and transB. The matrices op(A), op(B) and C
have size m × k, k × n and m × n. The matrices A, B and C are stored in
column-major (or Fortran-like) order, where consecutive elements on the same
row are stored lda (standing for leading dimension of matrix A), ldb and ldc
positions away in memory: therefore these quantities represent the number of
rows in the matrices as allocated in memory. This gives great ﬂexibility in the
computation with sub-matrices.
The following alternative interfaces are considered in this thesis:
void dgemm_nn_lib(int m, int n, int k, double *A, int sda, \
double *B, int sdb, int alg, double *C, int sdc, \
double *D, int sdd, int tc, int td);
computing
op(D)← α ·A ·B + β · op(C)
and
void dgemm_nt_lib(int m, int n, int k, double *A, int sda, \
double *B, int sdb, int alg, double *C, int sdc, \
double *D, int sdd, int tc, int td);
op(D)← α ·A ·BT + β · op(C)
3.2 Optimizing the gemm routine 35
where op(C) and op(D) are controlled by the ﬂags tc and td, and α and β are
controlled by the alg ﬂag as
alg =

0 ⇒ α← 1, β ← 0
1 ⇒ α← 1, β ← 1
−1 ⇒ α← −1, β ← 1
These are the only cases (corresponding to computation, upgrade and downgrade
of the result matrix) that needed to be implemented, and they are explicitly
implemented avoiding therefore to scale the result sub-matrices. In the 'NN'
('NT') version, the matrices A, B, op(C) and op(D) have size m × k, k × n
(n × k), m × n and m × n. The matrices A, B, C and D are assumed to be
stored in the panel-major format proposed in Section 3.2.2. The quantities sda,
sdb, sdc and sdd represent the 'secondary' dimension of matrices in panel-major
format, i.e. the number of columns in the matrices as allocated in memory.
Notice that these routines take 4 matrix operands, meaning that the result
matrix D does not necessarily overwrite the matrix C: it can do so if C and D
correspond to the same memory location. This feature is useful in many cases
to avoid an explicit matrix copy.
Compared to the standard gemm interface, these alternative formulations are
somehow lower-level interfaces, closer to the interface of the underlying gemm
kernels. This has the advantage of reducing the overhead of the routine, in-
creasing performance for small matrices. As a drawback, they are less general,
covering only the cases commonly encountered in embedded optimization.
3.2.1 Optimizing the gemm kernel
The gemm kernels are the routines accounting for the innermost loop in the
implementation of the diﬀerent gemm variants. These kernels compute a ﬁxed-
size sub-matrix of the result matrix D by adding a ﬁxed-size sub-matrix of C
with the product of two panels (meaning with this a matrix where one of the two
dimension is much larger than the other) from A and B, where one of the two
panel dimensions is ﬁxed. Each iteration of the kernel loop computes a rank-
1 update of the result sub-matrix. In order to achieve the best performance,
the gemm kernels are optimized for diﬀerent computer architectures. Therefore
features as e.g. the height of the panel and the size of the ﬁxed-size sub-matrix
of D computed by the kernels are architecture-dependent. As an example, the
'NT' variant of the gemm kernel computing a 8× 4 sub-matrix of D, where the
matrices are assumed to be stored in the panel-major format with panel height
bs = 4 (see Section 3.2.2 for more details), is
36 Level 3 BLAS and LAPACK for embedded optimization
void kernel_dgemm_nt_8x4_lib4(int kmax, double *A, int sda, \
double *B, int alg, double *C, int sdc, double *D, int sdd, \
int alg, int tc, int td);
Notice that the interface of the gemm routines closely resemble the interface of
the corresponding underlying gemm kernel.
The gemm kernels are the key kernels in all level 3 BLAS routines, since the
computationally most expensive parts of all level 3 BLAS routines can be cast
in term of these kernels. In turn, LAPACK routines are build on top of level 3
BLAS routines, and therefore the gemm kernels account for most of the compu-
tations in LAPACK routines too. This section presents the generic techniques
used in the implementation of all gemm kernels, and shows how to optimize them
for an hypothetical computer architecture. A collection of gemm kernels opti-
mized for a number of computer architectures and the speciﬁc implementation
details are in Chapter 5.
3.2.1.1 Blocking for registers
The most important technique is probably blocking for registers, meaning with
that the simultaneous computation of all elements of a sub-matrix of the result
matrix small enough to ﬁt into registers. It has the twofold aim of hiding latency
of instructions, and reducing the number of memops.
About hiding instructions latency, most FP instructions are pipelined, and their
latency is larger than their throughput. A pipelined instruction is performed on
a number of steps, each one taking a certain amount of clock cycles (often 1) and
being performed by a diﬀerent circuitry. The operands of the instruction have
to go through all the steps, one after the other, and the result is available after
a number of clock cycles equal to the latency of the instruction. The idea of
pipelining is that, while an instruction is at a certain stage of the pipeline, other
equal but independent instructions (meaning a sequence of the same instruction
operating on independent operands, such that the result of an instruction is not
the operand of another) can be processed at the same time, at other stages of
the pipeline. If all instructions are independent, after an initial delay needed for
the ﬁrst instruction to go through the entire pipeline (and equal to the latency of
the instruction), every number of clock cycles equal to the throughput another
instruction is completed, and at any time all stages of the pipeline are busy,
working on diﬀerent operands. If there is dependency between the output of
an instruction and the input of another instruction, then the second instruction
can not be processed until the result of the ﬁrst instruction is available: this
stalls the pipeline.
3.2 Optimizing the gemm routine 37
Blocking for registers can be used to hide latency of instructions to obtain full
throughput, since the computation of several result elements at the same time
can provide enough independent instruction to keep the pipeline full.
Blocking for registers can also be used to reduce the number of memops, since
each matrix element can be reused several times once loaded into registers: this
means that fewer loads are necessary to perform the same number of ﬂops. This
is useful to reduce the memory bandwidth requirements below the maximum
memory bandwidth available in the system, and therefore to avoid that the
kernel becomes memory-bounded.
The idea is explained with an example. Let us consider an hypothetical processor
that can perform a FMA (fused-multiply-add) every clock cycle (throughput=1),
while the result is available after 4 clock cycles (latency=4). The FMA is a 3-
operands instruction deﬁned as
z ← FMA(x, y, z)=˙z + x · y.
Furthermore, that hypothetical processor can load one FP register from L1 cache
every clock cycle.
Without loss of generality, the following gemm operation is considered: two
squared matrices A and B of size n×n are multiplied, and the result is used to
update the content of a square matrix C of size n× n, as
C ← C +A ·B.
Using the deﬁnition of matrix-matrix product, each element cij of C can be
computed as the inner product
cij = cij +
n−1∑
k=0
aik · bkj .
that is performed using the sequence of n FMA instructions
bkj
aik cij ← cij + aik · bkj , k = 0, 1, . . . , n− 1. (3.1)
as
cij ← cij + ai0 · b0j
cij ← cij + ai1 · b1j
cij ← cij + ai2 · b2j
. . . . . .
38 Level 3 BLAS and LAPACK for embedded optimization
where the colors are used to highlight instruction dependencies. These FMA
instructions are dependent, because cij is the result of an instruction and the
operand of the following one. Therefore, a FMA can be performed every 4
clock cycles. In any case, 2 FP numbers (aik and bkj) need to be loaded from
memory to perform each FMA. Therefore also ignoring for a moment the latency
constraint, it would not be possible to perform 1 FMA every clock cycle and
keep the pipeline full due to the memory bandwidth constraint.
One way to keep the pipeline full besides these constraints is to compute several
elements of the matrix C at the same time. If 4 FP registers can be used to hold
a 2×2 sub-matrix of C, then the 4 elements of the sub-matrix can be computed
simultaneously (using 0 and 1 as indexes in place of i+ 0, i+ 1, j + 0, j + 1 to
keep the notation lighter)
bk0 bk1
a0k c00 ← c00 + a0k · bk0 c01 ← c01 + a0k · bk1
a1k c10 ← c10 + a1k · bk0 c11 ← c11 + a1k · bk1
, k = 0, . . . , n− 1. (3.2)
as
c00 ← c00 + a00 · b00
c10 ← c10 + a10 · b00
c01 ← c01 + a00 · b01
c11 ← c11 + a10 · b01
c00 ← c00 + a01 · b10
c10 ← c10 + a11 · b10
c01 ← c01 + a01 · b11
c11 ← c11 + a11 · b11
c00 ← c00 + a02 · b20
. . . . . .
Regarding the instruction latency issue, this time in the 3 idle clock cycles
between consecutive updates of the c00 element, other 3 elements are updated,
ideally keeping the pipeline full if the elements from A and B can be loaded fast
enough.
Regarding the memory bandwidth issue, each A and B element is reused 2 times
once in registers, since e.g. in the update of c10 only a1k needs to be loaded,
since bk0 has already been loaded to compute c00 at the previous clock cycle,
and so on. Therefore, only 1 element from either A or B needs to be loaded
at each clock cycle. Generally speaking, in level 3 BLAS operations, a cubic
3.2 Optimizing the gemm routine 39
number of ﬂops is performed on a quadratic number of matrix elements, so the
larger the sub-matrix held in registers, the higher the reuse-factor.
Thanks to the beneﬁts of hiding instruction latency and decreasing memory
bandwidth requirements, it is possible to keep the FMA pipeline full and get
full throughput while satisfying the latency and bandwidth constraints. The
achieved speed-up with respect to the case (3.1) is of a factor 4.
The blocking idea can be applied to other memory levels (as for example blocking
for level 2 or 3 cache) to take into account the fact that the available memory
bandwidth decreases at lower levels in the memory hierarchy. However, since
our target are relatively small scale matrices that are assumed to ﬁt in some
cache level, blocking for cache is not further considered in this thesis.
3.2.1.2 Use of SIMD instructions
SIMD (Single-Instruction Multiple-Data) are instructions that perform the same
operation in parallel on all elements of small vectors of data. In theory, instruc-
tions operating on vectors of size nv can improve the performance up to a factor
nv.
Many modern architectures have SIMD instructions, since they are a relatively
easy and eﬃcient way to increase single-thread performance, especially in sci-
entiﬁc computing. In particular, x86 and x86_64 architectures have several
versions of SSE instructions (operating on 128-bit-wide vectors, each holding 2
double or 4 single precision FP numbers) and AVX instructions (operating on
256-bit-wide vectors, each holding 4 double or 8 single precision FP numbers),
while ARM architecture has NEON instructions (operating on 128-bit-wide vec-
tors).
Compilers can attempt to automatically vectorize scalar code, emitting SIMD
instructions. However, this is not a simple task, since the use of SIMD may
require deep changes to the code structure (e.g. to ensure proper alignment of
data) that are better suited to the programmer understanding of the overall al-
gorithm. The use of SIMD can be ensured by explicitly coding them in assembly
or inline assembly (low level solution, that gives full control also over the instruc-
tion scheduling and register allocation) or by means of intrinsics (higher lever
solution, where intrinsics are special functions called from C code and directly
mapped to SIMD instructions, leaving to the compiler instruction scheduling
and register allocation).
Continuing the example, let us assume that the hypothetical processor has 2-
40 Level 3 BLAS and LAPACK for embedded optimization
wide vector units (i.e. each holding 2 FP numbers), and vector FMA and
load instructions (operating on the 2-wide vectors) with the same latency and
throughput of the scalar instructions used in Section 3.2.1.1. The kernel op-
erating on the 2 × 2 sub-matrix in (3.2) can be implemented using the vector
instructions as
bk0 bk1
a0k
a1k
[
c00
c10
]
←
[
c00
c10
]
+
[
a0k
a1k
]
·
[
bk0
bk0
] [
c01
c11
]
←
[
c01
c11
]
+
[
a0k
a1k
]
·
[
bk1
bk1
]
(with k = 1, . . . n − 1), where the squared brackets indicates the small vectors.
As a result, the number of instructions is halved. This means that there are no
more enough independent instructions to keep the vector FMA pipeline full, and
the throughput is equal to the one in (3.2), since a 2-wide FMA every 2 clock
cycles is equivalent to 1 scalar FMA per clock cycle. It is important to note
that the vectors containing elements from B have the same element repeated:
depending on the ISA, this has to be done explicitly (e.g. using shue or
broadcast instructions in SSE and AVX ISAs), or it is handled implicitly by the
FMA instruction (e.g. NEON ISA).
In order to have again enough independent instructions to keep the pipeline full,
it is possible to compute an additional 2 × 2 sub-matrix of C. Therefore, the
optimal gemm kernel making use of vector instructions computes e.g. a 4× 2 (or
possibly a 2× 4, even if this has the disadvantage of requiring more shues and
broadcast instructions, due to the fact that the elements from B are repeated)
sub-matrix of C as
bk0 bk1
a0k
a1k
[
c00
c10
]
←
[
c00
c10
]
+
[
a0k
a1k
]
·
[
bk0
bk0
] [
c01
c11
]
←
[
c01
c11
]
+
[
a0k
a1k
]
·
[
bk1
bk1
]
a2k
a3k
[
c20
c30
]
←
[
c20
c30
]
+
[
a2k
a3k
]
·
[
bk0
bk0
] [
c21
c31
]
←
[
c21
c31
]
+
[
a2k
a3k
]
·
[
bk1
bk1
] (3.3)
(with k = 1, . . . n − 1). As a side eﬀect, each B element is now used 4 times,
further reducing memory bandwidth requirements. Overall, this kernel gives a
speed-up of a factor 2 with respect to (3.2), and therefore of a factor 8 with
respect to (3.1).
A drawback of vectorization is that a kernel operating on larger sub-matrices
of C is used. This means that, as the vector size increases, it also increases the
minimum size of the sub-matrix required to get full throughput, and therefore
the performance increases get smaller for small matrices. Generally speaking,
large latencies and low memory bandwidth have a negative eﬀect on performance
for small-scale matrices.
As said, SIMD often have alignment requirements. In our implementation, this
3.2 Optimizing the gemm routine 41
is automatically ensured by the choice of the matrix format used by the linear
algebra routines.
3.2.2 Use of contiguous memory and panel-major matrix
format
The use of contiguous memory is important for several reasons: it helps to fully
exploit the available memory bandwidth, it improves cache reuse and it reduces
the TLB misses.
When an element is fetched from memory, data is moved into cache in chunks
(called cache lines) of typically 32 or 64 bytes. This means that the access to
elements belonging to the same cache line is faster, since only one cache line
needs to be moved into cache. On the contrary, random access of elements
often requires a diﬀerent cache line for each element. Therefore the access of
contiguous elements maximizes the eﬀective memory bandwidth.
In order to speed-up cache access and reduce its complexity and cost, a certain
cache line can be mapped in a limited number n of locations in cache: this kind
of cache is called n-way associative. As a consequence, it may happen that cache
lines are evicted from cache even if this is not fully utilized. As an example, if
a matrix is stored in column-major (or Fortran-like) order, for certain column
length it can happen that contiguous elements on the same row are mapped into
the same cache location, evicting each other. This eﬀectively acts as a reduction
in cache size. Use of contiguous memory can mitigate this, since consecutive
cache lines are mapped in diﬀerent cache locations.
Finally, memory is seen from a program as virtual memory, that is mapped
into physical memory locations by means of a translation table in the MMU
(Memory Management Unit), the page table. The TLB (Translation Lookaside
Buﬀer) is a cache for the page table, containing the physical address of the most
recently used memory pages (each usually of size 4 KB). If memory is accessed in
a non-contiguous way, it may happen that TLB is not large enough to translate
the entire content of cache, increasing the number of expensive TLB misses.
In [43], a gemm design based on the minimization of the TLB misses is proposed.
In this approach, the needed sub-matrices from the A and B matrices are packed
into memory buﬀers before each call to the gemm kernel. These sub-matrices are
carefully packed (and possibly transposed) using a multi-layered matrix format.
Matrix elements are stored in the exact same order as accessed by the gemm
kernel, and taking into account cache and TLB sizes. This approach gives near
full FP throughput for large matrices, but it incurs in a notable overhead for
42 Level 3 BLAS and LAPACK for embedded optimization
small matrices, since in this case the (quadratic) cost of packing data dominates
the (cubic) cost of performing FP operations.
Taking into account the fact that matrices in embedded optimization are rela-
tively small, and therefore assumed to ﬁt in cache, it is possible to modify the
above approach to reduce the overhead due to the packing of data. The key idea
is that data matrices in embedded optimization are often reused several times.
Therefore it makes sense to convert only once the data matrices into an optimal
format (used as the default matrix format by all linear algebra routines) and
reuse the converted matrices several times, therefore well amortizing the con-
version cost. Furthermore, linear algebra routines can be designed such that the
output matrix is automatically stored into this optimal format at no extra cost,
meaning that only the original data matrices possibly need to be converted.
Since blocking for cache is not employed, the optimal matrix format is rather
simple: the complex matrix format proposed in [43] simpliﬁes to a single layer.
Namely, in the gemm routine, the A and B matrices are packed into horizontal
panels of contiguous data, as showed in Figure 3.1. The panel height (in the
following bs, for block size) has to be the same for all operand matrices. As a
consequence, the generic kernel size mr × nr has the constraint that both mr
and nr have to be multiple of bs. The values of mr and nr are architecture-
dependent and a function of the number of registers as well as the SIMD width.
The value of bs is usually chosen as the smaller of mr and nr, such that every
time a cache line is accessed, it is fully utilized.
Fig. 3.1 shows the panel-major matrix layout and the behavior of the 'NT'
variant of the gemm micro-kernel, that computes D ← α ·A ·BT + β · C, where
the left factor A is not-transposed and the right factor B is transposed. This is
the optimal variant, since both A and B are accessed panel-wise (i.e. data is read
along panels). Furthermore, the regular access pattern of data in memory (i.e.
access of contiguous memory locations) can be easily detected by the hardware
prefetcher (if present in the architecture). On the contrary, in the 'NN' variant
the A matrix is optimally accessed panel-wise, but the B matrix is accessed
across panels (i.e. only a few columns of each B panel are used, before moving
to the following panel), therefore making a worse use of caches and TLBs. This
complex access pattern is generally not detected by the hardware prefetcher, and
therefore software prefetch has to be explicitly used to move B elements into
cache before they are needed. In summary, embedded optimization algorithms
making use of the proposed linear algebra routines should be designed to use
the 'NT' gemm variant whenever possible.
Continuing the example of the hypothetical processor, since the SIMD width is
2 and the optimal gemm kernel is 4× 2, a good choice for bs is 2. In Fig. 3.1 the
behavior of the 4× 2 kernel operating on matrices packed with bs = 2 is shown:
3.2 Optimizing the gemm routine 43
+ = ·
bs
T
Figure 3.1: Matrix layout in memory (called panel-major matrix format): ma-
trix elements are stored in the same order such as the gemm micro-
kernel accesses them. This micro-kernel implements the optimal
'NT' variant (namely, A not-transposed, B transposed). The panel
height bs is the same for the left and the right matrix operand,
as well as for the result matrix. Each arrow represents the bs ele-
ments that are on the same column within a panel. The diagonal
lines indicate that, once the last element of a column is accessed,
the following element to be accessed is the ﬁrst of the consecutive
column within the same panel.
two panels of A and one panel of B are streamed to compute 4× 2 elements of
C. Notice that, in this case, following the approach proposed in [43], A would
be packed into a buﬀer with bs = 4 while B would be packed into a buﬀer with
bs = 2: this is the optimal choice since it further improves cache use and reduces
TLB misses, but it is not compatible with the constraint that bs has to be the
same for all data matrices. This constraint is necessary to allow the use the
panel-major matrix format as the common matrix format for all linear-algebra
routines.
The proposed approach gives steady and near to full FP throughput for matrices
ﬁtting in LLC, minimizing the issues related to cache associativity and TLB
misses. The performance is high also for small matrices, since the cost of packing
data into the optimal matrix format is well amortized.
3.2.3 Order of outer loops
A gemm routine optimized for small matrices can be implemented by means of
two loops around the carefully optimized gemm kernel. In case of a gemm kernel
where mr and nr are not equal, the order of these two loops has a big impact on
the performance of the gemm routine as the size of the factor matrices increase.
44 Level 3 BLAS and LAPACK for embedded optimization
Let us assume that mr > nr (i.e. the gemm kernel computes a sub-matrix of
C with more rows that columns): this is the usual case in architectures with
SIMD instructions, since it reduces the cost of shuing the vectors containing
the B elements (see Chapter 5 for more details). Therefore, in the gemm kernel,
the number of streamed panels from A (i.e. mr/bs) is larger than the number
of streamed panels from B (i.e. nr/bs). Under this assumption, it is convenient
that the outermost loop is over the rows, and the intermediate loop over the
columns of the result matrix. In this way, the B matrix is swept by loading nr/bs
panels at a time from L2 cache or LLC while the mr/bs panels from A are kept
into L1 cache. This reduces the memory bandwidth requirements from L2 cache
or LLC, since a lower number of non-L1-residents panels needs to be loaded to
compute the same amount of ﬂops.
Ignoring cache associativity, as a rule of thumb this approach gives close to full
performance in the computation of matrices with k up to the value such that
mr ·k+nr ·k elements can ﬁt in L1 cache at once. In practice, this k value is often
in the range 200 to 400, that is large enough for most embedded optimization
applications. For larger values of k, optimal performance can be recovered by
adding blocking for diﬀerent cache levels. However, this is not of interest for
this thesis.
3.2.4 Transposition, edges and corners handling
The interface of the gemm kernel has two ﬂags to control the transposition of
the C and D matrices. This is something rather diﬀerent compared to the
standard gemm interface in BLAS, that allows for the matrices A and B to
be transposed. This diﬀerent choice is justify by the fact that the proposed
linear algebra routines are specially tailored for small scale performance, and
therefore a lower level interface (i.e. closer to the interface of the underlying
gemm kernel) has been chosen in order to reduce overhead. In the gemm kernel,
the transposition of the C and D matrices can be performed at little or no extra
cost (depending on the ISA), and it adds extra ﬂexibility. For example, the
product D ← AT · BT can be rewritten as D ← (B · A)T and therefore cast in
terms of the 'NN' variant of the gemm routine.
About edges and corners handling, let us assume that the optimal gemm kernel
has size mr × nr. Since in general mr > 1 and nr > 1, there exists the issue
of computing the result matrix C when m is not multiple of mr and n is not
multiple of nr.
In optimized BLAS libraries where packing is employed, this is handled in the
packing routine [45, 86]: the sub-matrices are padded with zeros while packed
3.2 Optimizing the gemm routine 45
in the buﬀers, in order to ensure the correctness of the result. Then the result
sub-matrix of the exact size is copied from a buﬀer into the correct memory
location, again ensuring correctness of the result. However, this approach can
not be employed since in the proposed gemm implementation matrices are not
packed into buﬀers, and in any case it introduces overhead due to the additional
matrix copy.
A simple solution that does not require packing is to already add the padding
to each matrix C when stored in memory in the panel-major format, such that
the padded matrix C+ has size m+ × n+, where m+ is the smaller multiple of
mr that is larger than m, and similarly for n
+. In this way it is possible to
simply repeat the kernel until all the result matrix is fully covered. However,
this approach has the severe drawback of preventing the work with sub-matrices,
since the data surrounding the result sub-matrix may be corrupted.
Another solution would be the use of smaller kernels to exactly cover the edges
and the corners of the result matrix. This requires a trade-oﬀ between per-
formance and code reuse. In fact, it is possible to handle all cases with just 3
additional kernels (in the following called 'unitary' kernels), of sizemr×1, 1×nr
and 1× 1, and repeat them until the edges and the corners are exactly covered.
However, these kernels generally have very low performance: it is possible to
partially alleviate the instruction latency constraint by using several registers
to store partial accumulations of each result element, but it is not possible to
overcome the memory bandwidth constraint. This aﬀects particularly the per-
formance of small matrices, that is the main focus in embedded optimization.
On the opposite side, it is possible to write a kernel to handle each edge and
corner with just one kernel call. These kernels give better performance than the
repetition of unitary kernels. However, the number of required kernels grows
with the square of the sizes of the optimal gemm kernel size, making this approach
unappealing due to the need of writing and testing a large number of kernels
(lack of code reuse).
In the proposed linear algebra implementation framework, the most eﬃcient
solution is found to be the following. A few small kernels are designed: the
number of rows is a multiple of the eﬀective SIMD width (in fact, the time
required to compute 1 element or an entire SIMD vector of elements is the
same), or a power of two in case of scaler ISAs; the number of columns is a
power of two. This greatly reduces the number of possible row and column size
combinations. Each kernel is designed to compute in registers a sub-matrix of
the result matrix equal to the kernel size, but to store in memory sub-matrices of
diﬀerent (equal or smaller) sizes. The overall set of kernels is able to store sub-
matrices of any size equal or smaller than the optimal gemm kernel. An optimal
kernel without the variable-storing-size logic is used for the sub-matrices entirely
46 Level 3 BLAS and LAPACK for embedded optimization
in the inside of the result matrix C: this avoids the overhead of the storing logic.
The variable-storing-size kernels take care of the corners, and at a computational
cost lower that the repetition of unitary kernels.
As an example, the optimal 'NT' variant of the gemm kernel for the AVX in-
struction set computes a 8×4 sub-matrix of D, where the matrices are assumed
to be stored in the panel-major format with panel height bs = 4. The interface
is:
void kernel_dgemm_nt_8x4_lib4(int kmax, double *A, int sda, \
double *B, int alg, double *C, int sdc, double *D, int sdd, \
int alg, int tc, int td);
The analogue kernel allowing for variable-storing-size has the interface
void kernel_dgemm_nt_8x4_vs_lib4(int mr, int nr, int kmax, \
double *A, int sda, double *B, int alg, double *C, int sdc, \
double *D, int sdd, int alg, int tc, int td);
where mr is in the range 5 to 8, and nr is in the range 3 to 4. Other kernels
take care of the remaining cases.
3.2.5 Low rank updates handling
Low rank updates are matrix-matrix products where m and n are much larger
than k, and therefore the products computes a large matrix with small rank.
The implementation techniques described in this section are highly eﬀective
in case the rank k is not too small. In fact, the gemm kernel is a loop over
k, and therefore if k is very small, the overhead of calling the kernel, zeroing
accumulation registers before the loop, shuing accumulation registers after the
loop and storing results can be easily much higher than the time spent on the
loop itself. Therefore, the performance of the gemm routine can be poor in case
of low rank updates, that are rather frequent operations (e.g. in the structure
exploiting factorization of the condensed Hessian matrix presented in Chapter
9).
A possible solution to this is the implementation of specialized kernels for low
rank updates, where the order of the two innermost loops is switched (i.e. the
3.3 Optimizing other level 3 BLAS and LAPACK routines 47
innermost loop gets over n and the middle loop gets over k). Furthermore, the
kernel performs a ﬁxed-rank update of the result matrix, e.g. the cases of rank
equal to 1, 2, 3 and 4 can be explicitly coded, and higher rank updates can be
computed looping over the rank 4 kernel plus a ﬁnal clean-up.
The advantage of this implementation is that the cases of rank 1, 2, 3 and 4 are
explicitly coded, and therefore the gemm routine reduces to only two loops, while
the loop over k is totally unrolled. This greatly improves the performance, at
the expense of requiring a specialized kernel for each explicitly coded rank size.
The low rank implementation can therefore be employed if k is smaller than a
certain threshold, while the standard implementation if k is larger. The same
technique can be employed in the implementation of e.g. the syrk routine.
3.3 Optimizing other level 3 BLAS and LAPACK
routines
The computationally most expensive parts of level 3 BLAS and LAPACK rou-
tines can be cast in terms of the gemm kernel [45]. In fact, the gemm kernel can be
used to compute, upgrade or downgrade a rectangular sub-matrix of the result
matrix with the product of two rectangular matrices.
The reminder of the section presents techniques to obtain high-performance
level 3 BLAS and LAPACK routines based on the optimized gemm kernel, with
special focus on small-scale performance.
3.3.1 Triangles, factorizations, substitutions and inversions
handling
In the implementation of level 3 BLAS and LAPACK routines, the gemm kernel
can not take care of triangular factor matrices, triangular result matrices, fac-
torizations, substitutions (i.e. solution of triangular system of equations) and
inversions. These specialized operations require specialized routines. Several
approaches can be used in the implementation of these routines and in their use
of the gemm kernel.
In standard BLAS and LAPACK, routine to handle triangular matrices and
substitutions are part of level 3 BLAS, while factorizations and inversions are
part of LAPACK, that is build on top of BLAS.
48 Level 3 BLAS and LAPACK for embedded optimization
In optimized level 3 BLAS libraries, when packing is employed it is possible to
implement all level 3 BLAS routines (with the exception of trsm, implementing
substitutions) using the sole gemm kernel and proper packing/padding routines
[45]. The trsm routine is an exception, since the downgrade part of the rou-
tine can be cast in terms of gemm kernel, while the substitution part can not.
In [86], two trsm approaches are compared. In one approach, the gemm kernel
is explicitly used for the downgrade, while another specialized routine (not a
kernel, since there are no loops) takes care of the substitution part. This ap-
proach has the advantage of requiring the design only of the gemm kernel, but it
has the drawback of larger overhead since there are two function calls and the
result sub-matrix needs to be loaded and stored in memory twice. In the other
approach, the gemm kernel and the specialized substitution routines are merged
into a single trsm kernel. This requires the design of a specialized trsm kernel,
but it has lower overhead and therefore it gives better performance for small
matrices.
In the proposed implementation of linear algebra for embedded optimization,
the second approach for the implementation of all level 3 BLAS routines is em-
ployed, since it gives the best performance for small matrices. Therefore, a
specialized kernel (or better, a set of specialized kernels to cover all sizes of the
result matrix, as done for the gemm kernel in Section 3.2.4) is designed. In such
kernels, the main loop is literally copied-and-pasted from the gemm kernel, while
specialized procedures before and after this loop take care of triangular matri-
ces and substitutions. This approach requires the design of several specialized
kernels, but once the gemm kernel is available, it can be easily edited to get all
other level 3 BLAS kernels.
LAPACK routines make use for BLAS routines, but in general not of BLAS
kernels, since their interfaces are not standardized and therefore not exposed
(the BLIS project is an exception, exposing also its lower level interface). LA-
PACK contains both unblocked and blocked versions of all routines. Unblocked
versions make use of level 2 BLAS and elementary operations such that square
roots and divisions. They compute the result matrix one row or column at a
time, and are usually employed for small matrices and as routines in blocked
versions. Blocked versions make use of level 3 BLAS and unblocked LAPACK
routines for factorizations and substitutions (that are the matrix equivalent of
square roots and divisions). They compute the result matrix one sub-matrix at
a time, and they rely on the underlying optimized BLAS routines to provide
high-performance for large matrices. In the context of embedded optimization,
the main drawback of this approach is that it suﬀers from a considerably over-
head (due to the many levels of routines), and the small-scale performance is
particularly poor.
Some optimized BLAS libraries (as e.g. OpenBLAS) contain an optimized ver-
3.3 Optimizing other level 3 BLAS and LAPACK routines 49
sion of some of the key LAPACK routines (such as Cholesky and LU factor-
ization, triangular matrix inversion, multiplication of two triangular matrices).
These routine are written making use of the optimized level 3 BLAS kernels (and
not routines), and therefore exhibit a much better performance for small matri-
ces. In particular, this allows the choice of a much smaller threshold to switch
to the blocked version of the algorithms, therefore casting more computations
in the terms of the optimized BLAS kernels.
In the proposed implementation of linear algebra routines for embedded opti-
mization, there is no distinction between level 3 BLAS and LAPACK routines.
Namely, special kernels are written for the LAPACK routines as well, and they
are implemented using the same approach used for all level 3 BLAS routines.
Therefore, there are no unblocked LAPACK routines, and the optimized kernels
are used for all matrix sizes. Said in another way, the block size of the blocked
version of LAPACK routines is chosen equal to the gemm kernel size, and the
unblocked version of LAPACK routines is merged with level 3 BLAS kernels to
build specialized kernels. In case of small matrices, numerical tests show that
this approach gives the best performance.
3.3.2 Merging of linear algebra routines
In the case of level 3 BLAS and LAPACK routines, the best performance for
small matrices is given by the use of tailored kernels where both the main loop of
the gemm kernel (accounting for the upgrade and downgrade) and the specialized
procedures (handling triangular sub-matrices, and factorizations, substitutions
and inversion of ﬁxed-size matrices) are merged into a single kernel. The ap-
proach can be generalized to more complex operations, that are not part of
BLAS or LAPACK but that can be computed using two or more BLAS and
LAPACK routines. In some cases, specialized kernels can be written for these
complex operations, that reduce the number of function calls and avoid the
overhead of repeatedly loading and storing the same sub-matrix. In other cases,
it may be possible to merger routines that internally make use of the same ker-
nel (e.g. potrf and trsm), such that the merged routine operates on larger
matrices, reducing the overhead due to the handling of edges and corners.
Some examples of complex operations that are commonly found in embedded
optimizations and that can be easily merged are:
 (Symmetric) matrix upgrade followed by Cholesky factorization. This is a
very common operation, that computes D ← (C+A ·AT )1/2. The upgrade
of the C matrix (i.e. the computation of A · AT ) and the downgrade in
50 Level 3 BLAS and LAPACK for embedded optimization
the Cholesky factorization are both computed uinsg the main loop in the
gemm kernel, and therefore they can be naturally merged. Two kernels are
required, one for the computation of the diagonal blocks (syrk_potrf ker-
nel, giving symmetric upgrade of C, symmetric downgrade and Cholesky
factorization) and one for the oﬀ-diagonal blocks (gemm_trsm kernel, giv-
ing upgrade of C, downgrade and substitution). This routine can be used
to eﬃciently implement the backward Riccati recursion in Chapter 8.
A variant of this operation is the case where the matrix A is triangular.
The required kernels are lauum_potrf and trmm_trsm for the diagonal
and oﬀ-diagonal blocks respectively. This routine can be employed in the
implementation of the forward Schur-complement recursion in Chapter 8.
 Cholesky factorization followed by matrix substitution. This is another
very common operation, that is e.g. used in the implementation of the
Schur complement as A·B−1 ·AT = A·(L·LT )−1AT = (AL−T )·(AL−T )T ,
where B is a positive deﬁnite matrix and L is its lower triangular Cholesky
factor. The factorization of B and the computation of AL−T can be
merged in a single routine, that can be implemented as a Cholesky factor-
ization routine that operates on rectangular matrices (i.e. in the case of
the lower triangular Cholesky factorization, for matrices with more rows
that columns). This routine can be employed in the implementation of
the forward Schur-complement recursion in Chapter 8.
 The explicit expression for the inverse of a lower triangular matrix can
be obtained by considering the inversion of a 2× 2 block partition of the
matrix, as [
A
B C
]−1
=
[
A−1
−C−1BA−1 C−1
]
.
This algorithm can be implemented in several fashions, either recursively
or iteratively. The triangular matrix inversion is implemented in the LA-
PACK routine trtri, that makes use of an iterative algorithm. At a
generic iteration, the sub-matrix A−1 contains the part of the matrix that
has already been inverted, while C−1 has not been computed yet. There-
fore, the new block −C−1BA−1 is computed using trmm for the operation
B · A−1 (since A−1 is available explicitly), and trsm for the operation
C−1(BA−1) (since C is available but C−1 is not).
The proposed implementation computes the transposed inverse instead of
the inverse: therefore the transposed inverse of a lower triangular matrix
is an upper triangular matrix. This choice is justiﬁed by the fact that most
of the computation in this operation can be cast in terms of the optimal
'NT' version of the gemm kernel. The explicit expression for the transposed
inverse of a lower triangular matrix can be obtained by considering the
3.3 Optimizing other level 3 BLAS and LAPACK routines 51
inversion of a 2× 2 block partition of the matrix, as[
A
B C
]−T
=
[
A−T −A−TBTC−T
C−T
]
.
The routine is implemented in a fashion similar to the other BLAS and
LAPACK routines, namely the result matrix is computed in sub-matrices
of the size of the optimal gemm kernel, one panel at a time. In the com-
putation of each new panel, a specialized routines computes the ﬁxed-size
sub-matrix of A−T . In case of vector instructions available in the archi-
tecture, an eﬀective way to compute A−T is as IA−T , i.e. by using a
procedure for trsm on the identity matrix. Afterwards, the element on
the top-left corner is computed using a specialized kernel trtri, that can
be seen as a merge of a kernel for trmm (for the computation of −A−TBT ,
that internally makes use of the 'NT' version of the gemm kernel, since the
matrix A−T is directly available, and the matrix B is transposed by the
kernel) and a kernel for trsm (that computes (A−TBT )C−T , where the
sub-matrix C has ﬁxed size).
In practice, these specialized routines are found to boost performance for small
matrices, while somehow lowering performance for larger ones. This is due to
the fact that these complex operations have more operands, and therefore the
overall size of data matrices will exceed caches size for smaller size of the operand
matrices.
3.3.3 Notable routines
This section reports the interface of notable routines as implemented in HPMPC,
together with the rationale behind the choice of the interfaces. All matrices are
assumed to be in panel-major format.
3.3.3.1 syrk
The syrk routine has the interface
void dsyrk_nt_lib(int m, int n, int k, double *pA, int sda, \
double *pB, int sdb, int alg, double *pC, int sdc, \
double *pD, int sdd);
52 Level 3 BLAS and LAPACK for embedded optimization
and it computes the lower triangular part of the matrixD ← α·A·BT+β·C. The
interface is similar to the one of the gemm routine, with the exception of the lack
of ﬂags for transposition (that has not been implemented yet). The possibility
of having two distinguished matrices for the right and the left factor can speed
up the computation of the lower triangular part of the matrix (X · Y ) ·XT , in
case it is not favorable to compute the Cholesky factorization of Y to ensure
symmetry.
3.3.3.2 potrf
The potrf routine has the interface
void dpotrf_lib(int m, int n, double *pC, int sdc, \
double *pD, int sdd, double *inv_diag_D);
where the top n×n sub-matrix C0 of the matrix C is supposed to be symmetric
positive semi-deﬁnite. If m = n, the routine computes the lower triangular
Cholesky factor D of the n × n matrix C. If m > n, the routine computes the
lower triangular Cholesky factor D0 of the upper n×n matrix C0, and use it to
compute the substitution D1 of the lower (m− n)× n matrix C1, as
[
D0
D1
]
= dpotrf_lib
([
C0
C1
])
=
[
C
1/2
0
C1D
−T
0
]
where the exponent
1/2 denotes the lower triangular Cholesky factorization. The
n × 1 vector inv_diag_D returns the inverse of the diagonal of the lower tri-
angular Cholesky factor D, and it can be used to speed up the computation of
subsequent substitutions, avoiding the need to compute divisions.
If the matrix is singular, when one of the diagonal elements is found to be zero
during factorization, that element and all elements below on the same column
are set to zero, as well as the corresponding element in the inverse diagonal
vector. Therefore singularity can be detected by checking for zero elements
in inv_diag_D. In this way, the routine can be used for the factorization of a
(squared) symmetric positive semi-deﬁnite matrix, that is an operation useful
to speed up the implementation of the backward Riccati recursion in case of
singular recursion matrix.
3.3 Optimizing other level 3 BLAS and LAPACK routines 53
3.3.3.3 syrk_potrf
The syrk_potrf routine is a merger of the syrk and potrf routines. Namely
the routine has the interface
void dsyrk_dpotrf_lib(int m, int n, int k, double *pA, int sda, \
double *pB, int sdb, int alg, double *pC, int sdc, \
double *pD, int sdd, double *inv_diag_D);
It computes[
D0
D1
]
= dpotrf_lib
(
α ·
[
A0
A1
]
·BT + β ·
[
C0
C1
])
=
[
(α ·A0 ·BT + β · C0)1/2
(α ·A1 ·BT + β · C1)D−T0
]
where the matrix A0 has size n × k, the matrix A1 has size (m − n) × k, the
matrix B has size k×n, the matrices C0 and D0 have size n×n and the matrices
C1 and D1 have size (m− n)× n.
3.3.3.4 lauum_potrf
The lauum_potrf routine is a merger of the LAPACK lauum and potrf routines.
Namely the routine has the interface
void dlauum_dpotrf_lib(int m, int n, int k, double *pA, int sda, \
double *pB, int sdb, int alg, double *pC, int sdc, \
double *pD, int sdd, double *inv_diag_D);
It computes[
D0
D1
]
= dpotrf_lib
(
α ·A ·BT + β ·
[
C0
C1
])
=
[
(α ·A ·BT + β · C0)1/2
(β · C1)D−T0
]
where the matrix A has size k× k, the matrix B has size k× k, the matrices C0
and D0 have size n× n and the matrices C1 and D1 have size (m− n)× n.
3.3.3.5 trtri
The trtri routine is has the interface
54 Level 3 BLAS and LAPACK for embedded optimization
void dtrtri_lib(int m, double *pA, int sda, int use_inv_diag_A, \
double *inv_diag_A, double *pC, int sdc);
It computes the transposed inverse C = A−T of the matrix A. The choice to
compute the transposed inverse instead of the inverse is justiﬁed by the fact
that the former internally makes use of the 'NT' version of the gemm kernel
(that is the optimal version), while the latter makes use of the 'NN' version.
Furthermore, if the trtri routine is used in the inversion of a positive deﬁnite
matrix, the choice of computing the transposed inverse implies that the result
matrix is already in the correct format to be used as an input for a lauum routine
implemented using the 'NT' version of gemm.
3.4 Comparison of implementation techniques for
dsyrk + dpotrf
In this section contains an exhaustive comparison of implementation techniques
for the dsyrk + dpotrf routine. Namely, the operation of interest is(
A+B ·BT )1/2
where A and B are squared matrices of size n, and A is symmetric positive
deﬁnite. The exponent
1/2 indicates the lower triangular Cholesky factorization.
The test processor is an Intel Core i7 3520M, that during all tests runs at the
maximum turbo frequency of 3.6 GHz. The processor implements the Ivy-Bridge
micro-architecture and supports the SSE4.2 (that operates on 128-bit vectors,
each containing 2 doubles) and AVX (that operates on 256-bit vectors, each
containing 4 doubles) instruction sets. In particular, the AVX ISA is employed,
since it gives twice the full FP throughput compared to the SSE4.2 ISA. The
Ivy-Bridge core can perform a vector multiply and a vector add every clock
cycle: this gives a full FP throughput of 2 · 4 · 3.6 = 28.8 Gﬂops in double
precision.
The operative system is an Ubuntu 14.04 distribution, and the Linux kernel
version is 3.19. The code is compiled with gcc 4.8.4, and the use of the AVX
instruction set is enabled by means of the -mavx ﬂag. The ﬂags -O3 and
-funroll-loops are used for the reference and code-generated reference im-
plementations, where the latter ﬂag is found to slightly improve performance.
The ﬂag -O2 is used for OpenBLAS and for the proposed linear algebra for em-
bedded optimization (HPMPC), since these implementations are already hand
optimized, and perform loop unroll explicitly.
3.4 Comparison of implementation techniques for dsyrk + dpotrf 55
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
test dsyrk + dpotrf
(a)
0
2
4
6
8
10
12
14
0 5 10 15 20
G
flo
ps
matrix size n
test dsyrk + dpotrf
(b)
Figure 3.2
The comparison is made as comments to a series of performance plots. For each
ﬁgure, there are two sub-ﬁgures: one for matrix size n up to 300 in steps of 4
(left sub-ﬁgure), and one for matrix size up to 24 in steps of 1 (right sub-ﬁgure).
The left sub-ﬁgure is scaled such that the top of the pictures corresponds to the
full FP throughput (28.8 Gﬂops). The right sub-ﬁgure can be seen as a zoom
into the black square in the left sub-ﬁgure, and it is scaled such that the top of
the picture corresponds to 50% of the full FP throughput (14.4 Gﬂops).
Figure 3.2 shows the performance plot of the dsyrk routine from the reference
Netlib BLAS and the dpotrf routine from LAPACK using the Netlib BLAS
(). Both routines are coded in Fortran. This approach gives a maximum
performance of about 5 Gﬂops, that is about 18% of full FP throughput. The
performance graph is steadily low for all matrix size: therefore there are likely
no memory bandwidth limitations, but the processor sits idle most of the time
due to instruction dependencies, or it performs unnecessary operations. The gcc
compiler can be used to produce the assembly version of the compiled code. In
this case, the inspection of the assembly code shows that the compiler is able to
auto-vectorize the code (that therefore makes use of the AVX ISA), but that the
structure of the code is not deeply changed (e.g. there is no sign of blocking for
registers or cache). As a comparison, if the reference code is compiled without
the -mavx ﬂag or with the -O2 ﬂag, scalar instructions are emitted, and the
maximum attained performance is slightly lower than 3 Gﬂops (about 10%).
Therefore, even if vector instructions give a full FP throughput 4 times as high
as scalar instructions, the performance given by employing vectorization alone
is lower than twice as high, and still many times lower than full FP throughput.
56 Level 3 BLAS and LAPACK for embedded optimization
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
test dsyrk + dpotrf
(a)
0
2
4
6
8
10
12
14
0 5 10 15 20
G
flo
ps
matrix size n
test dsyrk + dpotrf
(b)
Figure 3.3
Figure 3.3 adds the performance plot of the C translation of the reference
routines presented in Figure 3.2 (o). Furthermore, the code generation ap-
proach used e.g. in the implementation of FORCES is employed to improve
performance: the size of all loops is ﬁxed at compile time. Therefore the com-
piler can perform additional optimizations such as remove branches and unroll
loops when proﬁtable. The performance graph is steadily for all matrix size, and
shows a noticeable improvement only for very small matrices. As the matrix size
increases, the performance plots of the reference and code-generated references
implementations gets almost indistinguishable. Also this time the assembly code
inspection shows that the compiler is able to auto-vectorize the code, but not
to deeply change the structure of the code. As a consequence, the performance
improvements for small matrices come from the reduction of overhead due to
branches and function calls, rather than from structural improvements in the
code.
Figure 3.4 adds the performance plot of the dsyrk and potrf as provided
by OpenBLAS version 0.2.15 (∗). OpenBLAS is an highly optimized BLAS
implementation, that provides also optimized version of key LAPACK routines
such as dpotrf. For large matrices, the performance of this implementation
approaches the full FP throughput, arriving at 20 Gﬂops (or 70%) for the largest
tested matrix size. Due to blocking for cache, the performance remains close
to full FP throughput for even larger matrix sizes. For small matrix size, the
performance is worse that both reference BLAS (break-even point around 5)
and code-generated reference BLAS (break-even point around 12).
The asymptotic performance of OpenBLAS is so much better than the previous
3.4 Comparison of implementation techniques for dsyrk + dpotrf 57
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
test dsyrk + dpotrf
(a)
0
2
4
6
8
10
12
14
0 5 10 15 20
G
flo
ps
matrix size n
test dsyrk + dpotrf
(b)
Figure 3.4
implementations due to the fact that its code is tailored for the speciﬁc hardware.
In particular, the tested version is optimized for the Sandy-Bridge architecture.
Figure 3.5 removes the performance plot of the reference dsyrk and potrf
routines, and adds the performance plot of the ﬁrst iteration of the proposed
linear algebra for embedded optimization (x). This ﬁrst iteration uses only C
code (no assembly nor intrinsics), but it exploits knowledge about the number
of available registers to perform block for registers. The matrices are stored in
column-major (or Fortran) order. The performance is similar to the reference
and the code-generated reference versions, arriving at a maximum of 5.8 Gﬂops
(20% of full FP throughput). The assembly code inspections shows that in
this case the compiler is not able to auto-vectorize the code, and therefore the
attained performance is actually 80% of the scalar full throughput. This shows
as the use of blocking for register alone gives as much speed up as the use of
4-wide vector instructions without blocking for registers. For large matrices,
or for matrix size such that 96, 128, 160, 192, 256, the performance decreases
due to memory bandwidth constraints or ﬁnite associativity of cache. The left
sub-ﬁgures shows a stair-wise performance, with steps equal to 4: this is due to
the fact that the optimal gemm kernel size is 4× 2.
Figure 3.6 shows the performance plot of the second iteration of the proposed
linear algebra for embedded optimization (x). This second iteration explicitly
targets SIMD instructions by means of intrinsics, while the matrices are still
stored in column-wise format. The optimal gemm kernel size is now 8× 4. Since
the AVX instruction set can operate on 4 doubles at a time, the performance
58 Level 3 BLAS and LAPACK for embedded optimization
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
test dsyrk + dpotrf
(a)
0
2
4
6
8
10
12
14
0 5 10 15 20
G
flo
ps
matrix size n
test dsyrk + dpotrf
(b)
Figure 3.5
shows a big improvement, arriving at a maximum of 20 Gﬂops (or 68% of full
FP throughput) for a matrix size of n = 168. However, the performance is
rather erratic, exacerbating the behavior already seen in the ﬁrst iteration. For
matrix size larger than about n = 150, OpenBLAS gives the best performance.
Figure 3.7 shows the performance plot of the third iteration of the proposed
linear algebra for embedded optimization (x). This third iterations makes use
of the panel-major matrix format. As a result, the performance is much more
regular, since this matrix format improves cache usage, and therefore also for
relatively large matrices data can be streamed from cache fast enough to keep
execution units busy. In the large-scale test (Figure 3.7a), the performance
keeps increasing with the matrix size, and the maximum performance is of 26
Gﬂops (90% of full FP throughput). For even larger matrices, the performance
would decrease since block for cache is not employed; however, the performance
is close to full FP throughput for the matrix sizes of interest in most embedded
optimization applications. In the small-scale test (Figure 3.7b), the performance
is almost identical to the second iteration case in Figure 3.6b: in fact, the panel-
major matrix format gives no improvements if the matrices are small enough to
entirely ﬁt in L1 cache even in column-major format.
In the implementation of the dsyrk routine, the result matrix is build panel-wise,
one block at at time: therefore in the innermost loop around the gemm kernel,
two panels from the A matrix are reused, while the B matrix is streamed one
panel at a time. On the contrary, in the implementation of the dpotrf routine,
the result matrix is build across panels, one block at a time. This was the
default choice in early implementations of the dpotrf routine in HPMPC, due
3.4 Comparison of implementation techniques for dsyrk + dpotrf 59
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
test dsyrk + dpotrf
(a)
0
2
4
6
8
10
12
14
0 5 10 15 20
G
flo
ps
matrix size n
test dsyrk + dpotrf
(b)
Figure 3.6
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
test dsyrk + dpotrf
(a)
0
2
4
6
8
10
12
14
0 5 10 15 20
G
flo
ps
matrix size n
test dsyrk + dpotrf
(b)
Figure 3.7
60 Level 3 BLAS and LAPACK for embedded optimization
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
test dsyrk + dpotrf
(a)
0
2
4
6
8
10
12
14
0 5 10 15 20
G
flo
ps
matrix size n
test dsyrk + dpotrf
(b)
Figure 3.8
to the possibility of storing in a favorable format and reusing the triangular
matrix needed in the triangular substitutions. However, this means that in the
innermost loop around the gemm kernel, only one panel from the A matrix is
reused, while the B matrix needs to be streamed two panels at a time, requiring
twice the memory bandwidth if the B matrix is not already in L1 cache.
Figure 3.8 shows the performance plot of the fourth iteration of the proposed
linear algebra for embedded optimization (3). This fourth iteration implements
merging of linear algebra routines. Namely, the dsyrk and the dpotrf routines
are merged into the single dsyrk_dpotrf routine. In Figure 3.8 it is chosen
to keep the loop order of the dpotrf routine: therefore the result matrix is
computed across panels. The performance for small matrices gets a good im-
provement compared to the use of un-merged routines, as shown in Figure 3.8b.
However, the performance for large scale matrices gets slightly worse (3.8a).
This is due to the fact that also the dsyrk part employed the sub-optimal outer
loops order.
Figure 3.9 shows the performance plot of the ﬁfth and last iteration of the
proposed linear algebra for embedded optimization (∆). In this case, the favor-
able matrix format for the triangular matrix used in the triangular substitution
is dropped. This slightly lowers performance for small matrices (Figure 3.9b).
However, this gives the possibility to implement dsyrk_dpotrf (and therefore
also dpotrf alone) using the optimal outer loops order around the kernel. There-
fore, the result matrix is build panel-wise, and only one panel from the B matrix
of the gemm kernel needs to be streamed, while two panels from the A matrix
3.5 Performance of level 3 BLAS and LAPACK routines 61
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
test dsyrk + dpotrf
(a)
0
2
4
6
8
10
12
14
0 5 10 15 20
G
flo
ps
matrix size n
test dsyrk + dpotrf
(b)
Figure 3.9
of the gemm kernel are reused between iterations and therefore likely already
present in L1 cache. This gives a performance for larger matrices (Figure 3.9a)
only slightly lower than the un-merged routines.
The resulting routine therefore shows a good performance in both the small scale
and the large scale tests. It outperforms reference BLAS and OpenBLAS for all
matrix sizes of interest (for very large matrices, OpenBLAS performs better due
to the use of blocking for cache). The code-generated version of the reference
BLAS can outperform the proposed routine only for extremely tiny matrices,
of size smaller than n = 5. This is due to the fact that the proposed routine is
implemented in library format, and therefore a number of branches have to be
taken in order to select the correct loops sizes. If extremely fast routines are
needed in case of these extremely tiny matrices, the best option is to hand code
these special cases using the proposed techniques, and hand remove all loops
and branches.
3.5 Performance of level 3 BLAS and LAPACK
routines
This section contains performance plots for the level 3 BLAS and LAPACK
routines that constitute the backbone of the optimization algorithms presented
in Part II and Part III of this thesis. For each routine, four versions are com-
pared: the version implemented using the techniques presented in this thesis
(HPMPC), the reference BLAS version (Netlib BLAS and LAPACK 3.5.0), an
62 Level 3 BLAS and LAPACK for embedded optimization
optimized vendor BLAS (Intel's MKL 11.3) and an optimized open-source BLAS
(OpenBLAS 0.2.15). Both MKL and OpenBLAS provide an optimized version
of the tested LAPACK routines.
All routines are tested on two processors, implementing Intel Ivy-Bridge and
Intel Haswell micro-architectures. The former supports the AVX ISA, while the
latter supports the AVX2 and FMA ISAs, that are the best ISAs supported even
in more recent micro-architectures such as Intel Broadwell and Intel Skylake.
The two test machines have the same memory conﬁguration, namely 8 GB of
DDR3/DDR3L memory in dual-channel conﬁguration (for a total data width
of 128 bits), running at 1600 MHz, that gives a maximum bandwidth of 25.6
GB/s. Therefore, the diﬀerence in performance is solely due to the processors.
These tests also show the current support state of the latest and previous latest
x86_64 ISAs regarding FP.
The tests are performed in double precision and, unless diﬀerently stated, for
squared matrices of size n between 4 and 300, in steps of 4 (and therefore they
are meant to evaluate the performance for matrix size of interest for embedded
optimization applications). In all tests, only one thread is employed: therefore,
the single-thread version of optimized BLAS libraries is considered.
3.5.1 Performance on Intel Ivy-Bridge micro-architecture
The test processor is the same as in Section 3.4, namely an Intel Core i7 3520M
running at the maximum turbo frequency of 3.6 GHz. The processor implements
the Intel Ivy-Bridge micro-architecture, and it supports the AVX ISA. It can
perform a 256-bit-wide FP multiplication and a 256-bit-wide FP addition every
clock cycle, and therefore in double precision the full FP throughput for single-
threaded code is of 8 ﬂops per cycle (28.8 Gﬂops at 3.6 GHz). The AVX ISA has
been employed in several micro-architectures, and is generally well supported in
optimized BLAS version.
The performance plots are in Figure 3.10. The overall result of the tests is that
the routines in HPMPC give (sometimes much) better performance for small
matrices that both optimized BLAS versions. In the case of the dgemm routines,
the diﬀerence in performance is relatively small. However, the performance gap
increases for specialized BLAS routines (such as dtrmm and dsyrk), and even
more for LAPACK routines (such as dpotrf and dtrtri). Generally speak-
ing, OpenBLAS seems to have better performance than MKL on level 3 BLAS
routines, but worse on LAPACK routines. The reference BLAS and LAPACK
versions give poor performance on all tests.
3.5 Performance of level 3 BLAS and LAPACK routines 63
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dgemm_nt
Netlib
HPMPC
MKL
OpenBLAS
(a) dgemm_nt
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dgemm_nn
Netlib
HPMPC
MKL
OpenBLAS
(b) dgemm_nn
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dtrmm
Netlib
HPMPC
MKL
OpenBLAS
(c) dtrmm
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dsyrk
Netlib
HPMPC
MKL
OpenBLAS
(d) dsyrk
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dpotrf
Netlib
HPMPC
MKL
OpenBLAS
(e) dpotrf
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dtrtri
Netlib
HPMPC
MKL
OpenBLAS
(f) dtrtri
Figure 3.10: Performance test for key level 3 BLAS and LAPACK routines on
an Intel core i7 3520M processor (Ivy Bridge micro-architecture,
supporting the AVX ISA).
64 Level 3 BLAS and LAPACK for embedded optimization
3.5.2 Performance on Intel Haswell micro-architecture
The test processor is an Intel Core i7 4800MQ. This processor has a base fre-
quency of 2.7 GHz and a maximum turbo frequency of 3.7 GHz. However, if the
256-bit wide SIMD units are employed, the turbo frequency is lowered to 3.3
GHz, that is the frequency observed in these tests. The processor implements
the Intel Haswell micro-architecture, and it supports the AVX2 and FMA ISA.
It can perform 2 256-bit-wide FP fused-multiplication-addition every clock cy-
cle, and therefore in double precision the full FP throughput for single-threaded
code is of 16 ﬂops per cycle (52.8 Gﬂops at 3.3 GHz). The AVX2 and FMA
ISAs have been employed only in the most recent micro-architectures, and often
they are not yet fully employed in optimized BLAS versions.
For the test on the Intel Haswell micro-architecture, a custom gcc compiler has
been employed. See Appendix A for more details.
The performance plots are in Figure 3.11. Also in this case, the routines in
HPMPC give better performance for small matrices. In general, MKL routines
perform better that OpenBLAS ones on all tests, showing that these recent
ISAs are not yet completely exploited in the open-source BLAS version. In
particular, the dtrmm routine has been optimized for the AVX2 and FMA ISAs
only in the latest OpenBLAS version (0.2.15), and the performance of the LA-
PACK routines is still very poor. On the contrary, the vendor MKL version give
reasonably good performance on all tests.
Generally speaking, the routines in HPMPC give excellent performance. It is
interesting to notice as for this micro-architecture there is a noticeable diﬀer-
ence in performance between the optimal 'NT' version and the sub-optimal 'NN'
version of the dgemm kernel. This is due to the fact that the Haswell doubles
the full FP throughput with respect to the Ivy-Bridge micro-architecture, and
therefore more care is needed in the data streaming from memory in order to
keep the execution units busy. The performance advantages of HPMPC routines
over optimized BLAS routines are particularly big for LAPACK routines. This
is especially true for the dtrtri routine, where the huge diﬀerence in perfor-
mance for small matrices is partially due to the fact that the custom routine
in HPMPC can avoid to re-compute the reciprocal of the diagonal elements, if
these have already been computed e.g. by a previous call to the dpotrf rou-
tine. The reference BLAS and LAPACK versions give very poor performance,
that shows very little improvement with respect to the tests on the Ivy-Bridge
micro-architecture: this shows that a more recent processor does not necessarily
give better performance, if the code does not exploit the new features.
3.5 Performance of level 3 BLAS and LAPACK routines 65
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dgemm_nt
Netlib
HPMPC
MKL
OpenBLAS
(a) dgemm_nt
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dgemm_nn
Netlib
HPMPC
MKL
OpenBLAS
(b) dgemm_nn
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dtrmm
Netlib
HPMPC
MKL
OpenBLAS
(c) dtrmm
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dsyrk
Netlib
HPMPC
MKL
OpenBLAS
(d) dsyrk
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dpotrf
Netlib
HPMPC
MKL
OpenBLAS
(e) dpotrf
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dtrtri
Netlib
HPMPC
MKL
OpenBLAS
(f) dtrtri
Figure 3.11: Performance test for key level 3 BLAS and LAPACK routines on
an Intel core i7 4800MQ processor (Haswell micro-architecture,
supporting the AVX2 and FMA ISAs).
66 Level 3 BLAS and LAPACK for embedded optimization
3.5.3 Performance in case of low rank updates
In this section, the performance of some linear algebra routine is tested in the
case of low-rank updates. Namely, the 'NT' version of the dgemm routine and
the dsyrk routine are tested for matrices with ﬁxed m = n = 100 and variable
and small k ∈ [1, 24]. Tailored routines to the low rank case are compared with
the standard routines and to several BLAS implementations. The cases of rank
equal to 1, 2, 3 and 4 are explicitly coded in the routines tailored to low rank
updates; higher ranks are obtained looping over the rank 4 kernel. Results are
in Figure 3.12.
As a general ﬁgure, the performance improves rather regularly as the rank of
the update increases. For very small ranks, the performance is particularly low.
For ranks equal to 1, 2, 3 and 4 (i.e. the explicitly coded cases) the routines
tailored to low rank updates show a great performance improvement. However,
the performance does not improve for larger ranks, showing a clear period-
4 pattern (since the rank 4 kernel has the highest performance). The explicit
coding of higher ranks (that is possible as long as there are enough FP registers)
would improve performance for these cases, at the expense of requiring a larger
number of kernels. For the considered case, a safe choice is to prefer the low
rank routines version for k ≤ 4, and the standard routines version for k > 4.
Interestingly, for both the Ivy-Bridge and the Haswell architectures, the perfor-
mance of the dgemm routine in MKL is higher for rank 1 than for rank 2, hinting
at the fact that the rank 1 case is explicitly coded in MKL too. Furthermore,
the reference Netlib implementation has better performance in case of very low
ranks than optimized BLAS routines, due to the diﬀerent order of the loops.
3.6 Conclusion
This section presented some level 3 BLAS linear algebra implementation tech-
niques specially tailored for embedded optimization, and proposed a matrix
format giving optimal performance for the proposed linear algebra routines, in
case of matrices roughly ﬁtting in LLC. Even if each single implementation tech-
nique is not new, their combination provides a novel implementation strategy
for the linear algebra in embedded optimization.
In the case of level 3 BLAS and LAPACK routines, the bulk of the computation
is cast in terms of the loop in the gemm kernel. LAPACK routines are imple-
mented in the same fashion as level 3 BLAS routines, namely using tailored
3.6 Conclusion 67
0
5
10
15
20
25
0 5 10 15 20
G
flo
ps
rank k
dgemm_nt - m=n=100
HPMPC
HPMPC low rank
Netlib
MKL
OpenBLAS
(a) dgemm_nt Ivy-Bridge
0
5
10
15
20
25
0 5 10 15 20
G
flo
ps
rank k
dsyrk - m=n=100
HPMPC
HPMPC low rank
Netlib
MKL
OpenBLAS
(b) dsyrk Ivy-Bridge
0
10
20
30
40
50
0 5 10 15 20
G
flo
ps
rank k
dgemm_nt - m=n=100
HPMPC
HPMPC low rank
Netlib
MKL
OpenBLAS
(c) dgemm_nt Haswell
0
10
20
30
40
50
0 5 10 15 20
G
flo
ps
rank k
dgemm_nt - m=n=100
HPMPC
HPMPC low rank
Netlib
MKL
OpenBLAS
(d) dsyrk Haswell
Figure 3.12: Performance test on low rank updates. The matrices size m and
n are ﬁxed to 100, while the rank update k ∈ [1, 24].
68 Level 3 BLAS and LAPACK for embedded optimization
kernels based on the optimal gemm kernel.
All steps in the optimization process are described in details, in a pedagogical
fashion, and numerical tests show the performance advantages arising from each
single step.
Numerous numerical tests conﬁrm that the routines in HPMPC give a nice per-
formance advantage over the corresponding level 3 BLAS and LAPACK routines
in optimized BLAS libraries or code-generators for embedded optimization, for
the matrix size of interest for embedded optimization applications.
As a ﬁnal note, reference BLAS and LAPACK routines (that are the model of
current code-generator for embedded optimization) perform poorly, and their
performance shows very little improvement in the Haswell micro-architecture
compared with the Ivy-Bridge micro-architecture, despite the fact that the for-
mer supports the AVX2 and FMA ISAs and has double the full FP throughput.
This is a good example of the fact that a more recent processor does not neces-
sarily give better performance, if the code does not exploit the new features.
Chapter 4
Level 2 BLAS for embedded
optimization
Level 2 BLAS contains routines for basic matrix-vector operations. In the em-
bedded optimization framework, level 2 BLAS routines are the backbone of ﬁrst
order optimization methods, and are used together with level 3 BLAS routines
in the implementation of second order optimization methods.
If n is the size (number of rows or columns) of the matrix involved in a generic
level 2 BLAS operation, then the computational cost of the operation is of about
O(n2) ﬂops, that is the same as to the matrix size in memory. Therefore, each
matrix element is reused O(1) times (and often exactly once) in a level 2 BLAS
operation. In modern architectures the time needed to move a matrix element
from memory into registers is much larger than the time needed to perform a
ﬂoating-point operation on that element once in registers. Therefore, the cost of
level 2 BLAS routines is typically dominated by the cost of streaming the matrix
from memory into registers. As a consequence, these routines can typically
attain a low fraction of the full FP throughput, and the performance decreases
as the matrix size increases, exceeding cache size. This is a key diﬀerence with
respect to the level 3 BLAS and LAPACK routines.
For large matrices, the computational cost of level 2 BLAS routines is negligible
with respect to the cost of level 3 BLAS and LAPACK routines. However, for the
70 Level 2 BLAS for embedded optimization
small matrix sizes typical of embedded optimization the cost of level 2 BLAS
routines is often not negligible. Furthermore, in the implementation of ﬁrst
order optimization methods, there are not level 3 BLAS and LAPACK routines.
Therefore, in the framework of embedded optimization, the performance of level
2 BLAS routines plays an important role too.
In the implementation of level 2 BLAS routines, most of the computation can
be cast in term of two routines: the gemv routine (that is the general matrix-
vector multiplication routine), and to a smaller extent the symv routine (that
is the symmetric matrix-vector multiplication routine). The role of the gemv
routine is analogue to the role of the gemm routine in level 3 BLAS and LAPACK
routines, as most of level 2 BLAS routines can be implemented using a kernel
for gemv. A notable exception is the symv routine: even if it is possible to
implement it by means of two triangular matrix-vector multiplications (and
therefore making use of the gemv kernel), since it has a reuse factor of 2 it
makes sense to write a specialized kernel exploiting this to reduce bandwidth
requirements. Furthermore, in optimization symmetric matrices are relatively
widespread (e.g. the Hessian of the cost function in optimization problems),
therefore many applications can beneﬁt from a specialized symv kernel.
The assumptions in Section 3.1 about the nature of the embedded optimiza-
tion problems still hold. In particular, level 2 BLAS routines are designed with
special care on reducing overhead and enhancing small scale performance. All
matrices are assumed to be stored in the panel-major format proposed in Section
3.2.2. This is the optimal matrix format for the 'NT' version of gemm optimized
for small matrices. It provides advantages over row-major or column-major ma-
trix orders also in the case of the gemv and symv kernels. In the case of level 2
BLAS routines, since there is O(1) reuse of matrix elements, the explicit pack-
ing or transposition of matrices is never advantageous, and therefore it is not
employed in optimized BLAS libraries, that assume row-major or column-major
matrix formats for level 2 BLAS routines. In the proposed implementation ap-
proach for embedded optimization, however, the conversion of matrices into the
panel-major format is assumed to be performed oﬀ-line or to be well amortized
e.g. over the several iterations of an optimization method. Therefore, the use of
the panel-major matrix format can enhance performance with respect to level
2 BLAS routines in optimized BLAS libraries.
In the proposed linear algebra routines for embedded optimization, vectors are
assumed to be stored contiguously in memory (this is equivalent to incx and
incy equal to 1 in standard level 2 BLAS routines). This implies that it is not
possible to use the proposed level 2 BLAS routines to operate on a single row
or column of matrices stored in the panel-major matrix format. This simpliﬁes
the implementation of level 2 BLAS routines, at the cost of reducing ﬂexibility.
However, this has not been found to be an issue in the implementation of op-
4.1 Optimizing the gemv routine 71
timization algorithms. In the rare cases where operations with matrix rows or
columns are necessary, explicit copy into a contiguous vector is employed. This
can also be seen as the packing of a vector into a favorable format, since each
vector element is reused O(n) times in level 2 BLAS routines, and unit stride
optimizes the reuse of cache lines and allows vectorization.
This chapter is organized as follows. Section 4.1 and 4.2 presents in details the
optimization of the key routine in level 3 BLAS: the gemv and symv respectively.
Section 4.3 presents the optimization of other level 2 BLAS routines. Finally,
Section 4.4 contains the performance plot for some widely used level 2 BLAS
routines.
4.1 Optimizing the gemv routine
The gemv routine is the general matrix-vector multiplication routine. In the
BLAS standard, it has the interface (considering the double precision version,
and using C notation)
dgemv_(char *trans, int *m, int *n, double *alpha, double *A, \
int *lda, double *x, int *incx, double *beta, double *y, \
int *incy);
and it computes
y ← α · op(A) · x+ β · y
where α and β are scalars, and op(A) can be either A of AT depending on the
ﬂag trans. The matrix A has size m × n. The matrix A is stored in column-
major (of Fortran-like) order, where consecutive elements on the same row are
stored lda elements away. The elements in the vectors x and y are stored incx
and incy elements away. This can be used to operate on single rows or columns
of matrices, giving great ﬂexibility.
The following alternative interfaces are considered in this thesis:
void dgemv_n_lib(int m, int n, double *A, int sda, \
double *x, int alg, double *y, double *z);
computing
z ← α ·A · x+ β · y
and
72 Level 2 BLAS for embedded optimization
void dgemv_t_lib(int m, int n, double *A, int sda, \
double *x, int alg, double *y, double *z);
computing
z ← α ·AT · x+ β · y
where α and β are controlled by the ﬂag alg in the same way as in the gemm
routine, as
alg =

0 ⇒ α← 1, β ← 0
1 ⇒ α← 1, β ← 1
−1 ⇒ α← −1, β ← 1
In both versions, the matrix A has size m× n.
Notice that these routines take 1 matrix operand and 3 vector operands, meaning
that the result vector z does not necessarily overwrite the vector y: it can do
so if y and z correspond to the same memory location. This feature is useful in
many cases to avoid an explicit vector copy.
Compared to the standard gemv interface, these alternative formulation are
somehow lower-level interfaces, closer to the interface of the underlying gemv
kernels. This has the advantage of reducing the overhead of the routine, in-
creasing performance for small matrices. As a drawback, they are less general,
covering just the cases commonly encountered in embedded optimization.
4.1.1 Optimizing the gemv kernel
The gemv kernels are the routines accounting for the innermost loop in the
implementation of the diﬀerent gemv variants. These kernels compute a ﬁxed-
size sub-vector of the result vector z by adding a ﬁxed-size vector of y with
the product of a panel from A and the vector x, where one of the two panel
dimensions are ﬁxed. Each iteration of the kernel loop computes a rank-1 update
of the result sub-vector. In order to achieve the best performance, the gemv
kernels are optimized for diﬀerent computer architectures. Therefore features
as e.g. the height of the panel and the size of the ﬁxed-size sub-vector of z
computed by the kernels are architecture-dependent. As an example, the 'N'
variant of the gemv kernel computing a 8×1 sub-vector of z, where the A matrix
is assumed to be stored in the panel-major format with panel height bs = 4 (see
Section 3.2.2 for more details), is
void kernel_dgemv_n_8_lib4(int kmax, double *A, int sda, \
double *x, int alg, double *y, double *z);
4.1 Optimizing the gemv routine 73
Notice that the interface of the gemm routines closely resemble the interface of
the corresponding underlying gemm kernel.
The gemv kernels are the key kernels in all level 2 BLAS routines, since the
computationally most expensive parts of all level 2 BLAS routines can be cast
in term of these kernels. This section presents the generic techniques used in
the implementation of all gemv kernels, and shows how to optimize them for an
hypothetical computer architecture.
4.1.1.1 Blocking for registers
In the gemv routine, and more generally in all level 2 BLAS routines, the reuse
factor of matrix elements is O(1) (and usually exactly 1, with the exception
of the symv routine, where it is 2). Therefore the gemv routine is generally
memory-bound, meaning that the factor limiting the attained performance is
the memory bandwidth and not the computational throughput. Well-optimized
level 2 BLAS routines can attain a good performance if the matrix data is
already in L1 cache, and the performance decreases as the data needs to be
loaded from lower levels of cache or from main memory.
Blocking for registers can be employed also in the implementation of gemv, where
it can provide enough independent instructions to hide instruction latency, and
to a limited extent it can reduce memory bandwidth. In fact, matrix elements
have a reuse factor of O(1), but vector elements have a reuse factor of O(n),
and therefore the memory bandwidth requirements can be almost halved.
The same hypothetical processor introduced in Section 3.2.1.1 is considered in
the optimization choices. This processor can perform one FMA (fused-multiply-
add) every clock cycle (throughput=1), while the result is available for further
computations after 4 cycles (latency=4). Furthermore, this processor can load
one ﬂoating point register from L1 cache every clock cycle.
In the case of the gemv routine, the kernels for the 'N' and for the 'T' ver-
sions are analogue when only scalar instructions are employed, but they are
fundamentally diﬀerent when vector instructions are employed.
In the 'N' version case, without loss of generality the following gemv operation
is considered: a squared matrix A of size n × n is multiplied by a vector x of
size n, and the result is used to update the content of a vector y of size n, as
y ← y +A · x.
74 Level 2 BLAS for embedded optimization
Using the deﬁnition of matrix-vector product, each element yi of y can be com-
puted as the inner product
yi = yi +
n∑
k=0
aik · xk
that is performed using the sequence of n FMA instructions
xk
aik yi ← yi + aik · xk , k = 0, 1, . . . , n− 1. (4.1)
The dependency pattern is analogue to the one in (3.1): each FMA depends on
the result of the previous FMA. If the data is present in L1 cache, a FMA can
be performed every 4 clock cycles. In any case, 2 FP numbers (aik and xk) need
to be loaded from memory to perform each FMA. Therefore also ignoring for a
moment the latency constraint, it would not be possible to perform 1 FMA every
clock cycle and keep the pipeline full due to the memory bandwidth constraint
from L1 cache.
Block for cache can mitigate these latency and bandwidth constraints also in
the case of the gemv kernel. If 4 FP registers are used to hold a 4× 1 sub-vector
of y, the 4 elements of the sub-vector can be computed simultaneously (using 0,
1, 2, 3 in place of i+ 0, i+ 1, i+ 2, i+ 3 to keep the notation lighter) as
xk
a0k y0 ← y0 + a0k · xk
a1k y1 ← y1 + a1k · xk
a2k y2 ← y2 + a2k · xk
a3k y3 ← y3 + a3k · xk
, k = 0, 1, . . . , n− 1. (4.2)
Again, the dependency pattern is analogue to the one in (3.2), with a FMA
instruction depending on the result of the 4th-last FMA instruction. Regarding
the instruction latency issue, the use of 4 accumulation registers can totally
hide FMA instruction latency. Regarding the memory bandwidth issue, even
assuming that required data is present in L1 cache, the gemv kernel is memory-
bound. In fact, each x element is now reused 4 times once in registers, but the A
elements are not, and therefore 5 FP loads needs to be performed every 4 cycles
(4 elements from A and 1 element from x), while the processor can perform at
most 4 FP loads. Therefore, even assuming that the data is already present in
L1 cache, this gemv kernel can attain at most 80% of the scalar FP throughput
of the processor.
In practice, if the required data is not already present in L1 cache, the perfor-
mance is further limited by the main memory bandwidth, that is generally not
high enough to stream the matrix fast enough and keep execution units busy.
4.1 Optimizing the gemv routine 75
A similar analysis can be done for the 'T' version of the gemv kernel. Let us
consider without loss of generality the operation
y ← y +AT · x.
Using the deﬁnition of matrix-vector product when the matrix is transposed,
each element yi of y can be computed as the inner product
yi = yi +
n−1∑
k=0
aki · xk
where the only diﬀerence with respect to the 'N' version is the fact that the
indexes of the A elements are swapped. The equivalent of (4.2) for the 'T'
variant is
ak0 ak1 ak2 ak3
xk y0 ⇐ ak0 · xk y1 ⇐ ak1 · xk y2 ⇐ ak2 · xk y3 ⇐ ak3 · xk (4.3)
(where the simbol⇐ is used as a short expression for the accumulation operator,
and therefore y0 ⇐ ak0 · xka is equivalent to y0 ← y0 + ak0 · xk) for k =
0, 1, . . . , n− 1. The analysis is analogue to the 'N' case.
4.1.1.2 Use of SIMD instructions
If available on the processor, SIMD (Single-Instruction Multiple-Data, that are
instructions that perform the same operation in parallel on all elements of a
small vector of data) can be used to improve the performance of the gemv kernel
in case the limiting factor is not the memory bandwidth from main memory.
Continuing the example, let us assume that the hypothetical processor has 2-
wide vector units (i.e. each holding 2 FP numbers), and vector FMA and
load instructions (operating on the 2-wide vectors) with the same latency and
throughput of the scalar instructions used in Section 4.1.1.1.
The 'N' kernel operating on the 4 × 1 sub-vector in (4.2) can be implemented
using vector instructions as
xk
a0k
a1k
[
y0
y1
]
←
[
y0
y1
]
+
[
a0k
a1k
]
·
[
xk
xk
]
a2k
a3k
[
y2
y3
]
←
[
y2
y3
]
+
[
a2k
a3k
]
·
[
xk
xk
] , k = 0, 1, . . . , n− 1.
where the squared brackets indicates the small vectors. Also in the gemv case,
as a result the number of instructions is halved. Therefore there are not enough
76 Level 2 BLAS for embedded optimization
independent instructions, and the attained throughput is only slightly larger
than in the scalar case: the vector kernel attains the 50% of the vector through-
put, while the scalar kernel attains the 40% (since 80% of scalar throughput is
equivalent to 40% of 2-wide-vector throughput).
In order to increase the number of independent instructions, it is possible to
compute an additional 4×1 sub-vector of y. Therefore, the gemv kernel computes
a 8× 1 sub-matrix of y as
xk
a0k
a1k
[
y0
y1
]
←
[
y0
y1
]
+
[
a0k
a1k
]
·
[
xk
xk
]
a2k
a3k
[
y2
y3
]
←
[
y2
y3
]
+
[
a2k
a3k
]
·
[
xk
xk
]
a4k
a5k
[
y4
y5
]
←
[
y4
y5
]
+
[
a4k
a5k
]
·
[
xk
xk
]
a6k
a7k
[
y6
y7
]
←
[
y6
y7
]
+
[
a6k
a7k
]
·
[
xk
xk
]
, k = 0, 1, . . . , n− 1.
Notice that this time each x element is reused 8 times. This kernel has enough
independent FMA instructions to totally hide FMA instruction latency. How-
ever, even if the data is already present in L1 cache, the L1 cache bandwidth
limits the maximum performance. In fact, depending on the ISA, this kernel
can attain at most 80.0% of the full FP throughput (if the xk element is loaded
and broadcast to all vector components with a single instruction), or 88.9% (if
the elements xk and xk+1 for two consecutive loop iterations are loaded every
two loop iterations with a single load instruction). Notice that an entire column
of 8 elements from A needs to be processed at the same time in order to obtain a
well-optimized kernel. In case of wider vector units or longer FMA latency, this
number will be even larger. Therefore, in case of small matrices, this aﬀects the
performance of the gemv kernel more than the performance of the gemm kernel.
In the vector version of the 'T' gemv kernel, vectorization is used to compute
multiple loop iterations with a single instruction, instead of to compute several
y elements with a single instruction as in the 'N' gemv kernel. In order to make
the notation easier, the elements of the ﬁrst iteration are denoted with h and
the elements of the second iteration with k. The 'T' gemv kernel operating on
the 4 × 1 sub-vector of y computes two following loop iterations and it can be
implemented using vector instructions as[
ah0
ak0
] [
ah1
ak1
] [
ah2
ak2
] [
ah3
ak3
]
[
xh
xk
] [
y00
y10
]
⇐
[
ah0
ak0
]
·
[
xh
xk
] [
y01
y11
]
⇐
[
ah1
ak1
]
·
[
xh
xk
] [
y02
y12
]
⇐
[
ah2
ak2
]
·
[
xh
xk
] [
y03
y13
]
⇐
[
ah3
ak3
]
·
[
xh
xk
].
(where⇐ is a short for the accumulation operator, as in (4.3)) for h = 0, 2, . . . , n−
4.1 Optimizing the gemv routine 77
2 and k = 1, 3, . . . , n − 1 (assuming n even). The exponents 0 and 1 are used
to represent the two partial sums of each yi element. At the end of the main
loop, the partial sums in the registers need to be reduced (i.e. all components
of a single vector have to be summed together to compute the result element
yi). Depending on the ISA, this can be a rather expensive operation, but it is
amortized over n/2 eﬀective loop iterations. Furthermore, at the end of the main
gemv loop, a scalar clean-up loop iteration is required in case n is odd. Due to
the L1 cache bandwidth, this kernel can attain at most the 80% of the full FP
throughput. However, in practice the performance is reduced by the vectors
reduction overhead. Furthermore notice that, in case of wider vectors of size nv,
even more loop iterations are computed at once, implying that the reduction
overhead is amortized over an eﬀective smaller number of iteration n/nv.
Generally speaking, the use of SIMD can improve the eﬀective memory band-
width from L1 cache (if an entire vector can be loaded with a single instruction),
but it does not aﬀect the memory bandwidth from main memory. Therefore, in
general the use of SIMD can improve the performance of the gemv kernel when
the data is already present in cache, but not if data has to be loaded from main
memory.
4.1.2 Use of contiguous memory and panel-major matrix
format
In the implementation of the gemv kernel (and more in general of level 2 BLAS
routines) the use of contiguous memory plays a less crucial role than in the
implementation of the gemm kernel (and more in general of level 3 BLAS and
LAPACK routines). This is due to the fact that in level 2 BLAS routines there
is a reuse factor of matrix elements of O(1) (and generally exactly 1). Unless the
matrix A is reused in immediately subsequent linear algebra routines without
being evicted from cache, the use of a matrix format that optimizes cache reuse
is of limited beneﬁt.
Nevertheless, this matrix format has still some useful feature. The matrix panels
are aligned to cache boundaries, and therefore once a cache line is moved from
main memory to cache, it is fully utilized by both variants of the gemv kernel.
Furthermore, the memory access pattern has some regularity. The regular access
pattern of the 'N' version of the gemv kernel (i.e. the A matrix is read along
panels) can be easily detected by the hardware prefetcher (if present in the
architecture). The less regular access pattern of the 'T' version of the gemv kernel
(i.e. the A matrix is read across panels, with only a few column from a panel
accessed before moving to the following panel) can generally not be detected
by the hardware prefetcher, but nevertheless it can be eﬃciently exploited by
78 Level 2 BLAS for embedded optimization
means of software prefetch.
As a result, even if of limited beneﬁt on its own, the panel-major matrix format
provides a regular enough access pattern of matrix elements. This access pattern
can be directly detected by the hardware prefetcher ('N' version of gemv), or
explicitly exploited by means of software prefetch ('T' version of gemv), and this
may help in moving data into cache before it is needed, enhancing performance.
4.1.3 Edges handling
In the gemv case, edges are handled similarly to what done in Section 3.2.4 for
the gemm case. The main diﬀerence is that the gemv routines are implemented
as a single loop around the gemv kernels. Therefore, the edges in one matrix
dimensions are already handled by the loop within the kernels. On the other
matrix dimension, a set of kernels of ﬁxed and variable-store-size is implemented:
this set has to handle edges size between 1 and the optimal kernel size in only
one matrix dimension, greatly reducing the number of needed kernels.
One optimal gemv kernel for each of the 'N' and 'T' versions takes care of most of
the computation load in case of large matrices. These kernels can store vectors
of ﬁxed size, and they do not have the overhead of the variable-store-size logic.
A set of variable-store-size kernels takes care of the edges for each of the 'N'
and 'T' gemv versions. In case of the 'N' version, the kernel size is multiple of
the SIMD width, since it does not make any diﬀerence in computational time if
the vectors are fully utilized or not. In case of the 'T' version, it is possible to
take advantage from a ﬁner granularity, since the kernel size aﬀects the number
of instructions in the kernel loop, and not the degree of utilization of vectors as
in the 'N' case.
4.2 Optimizing the symv routine
The symv routine is the symmetric matrix-vector routine. In the BLAS stan-
dard, it has the interface (considering the double precision version, and using C
notation)
dsymv_(char *uplo, int *n, double *alpha, double *A, int *lda, \
double *x, int *incx, double *beta, double *y, int *incy);
4.2 Optimizing the symv routine 79
and it computes
y ← α ·A · x+ β · y
where α and β are scalars. The matrix A has size n×n and it is symmetric; the
ﬂag uplo controls if either the lower or the upper triangular part of A is accessed
during computation. The matrix A is stored in column-major (of Fortran-like)
order, where consecutive elements on the same row are stored lda elements
away. The elements in the vectors x and y are stored incx and incy elements
away. This can be used to operate on single rows or columns of matrices, giving
great ﬂexibility.
The following alternative interface is considered in this thesis:
void dsymv_lib(int m, int n, double *A, int sda, \
double *x, int alg, double *y, double *z);
computing
z ← α ·A · x+ β · y
where α and β are controlled by the ﬂag alg in the same way as in the gemm
routine, as
alg =

0 ⇒ α← 1, β ← 0
1 ⇒ α← 1, β ← 1
−1 ⇒ α← −1, β ← 1
The matrix A has size m×m, where m ≥ n. If m = n, A if a dense symmetric
matrix. If m > n, A is a symmetric matrix in the form
A =
[
A0 A
T
1
A1
]
where A0 is a dense symmetric n×n matrix and A1 is a dense generic (m−n)×n
matrix. Matrices in this form are common in optimization (e.g. the KKT matrix
of an equality constrained QP), and many applications can beneﬁt from the the
possibility of computing also the matrix-vector multiplications involving A1 by
means of a routine with reuse factor 2.
4.2.1 Optimizing the symv kernel
The symv kernel is the routine accounting for the innermost loop in the imple-
mentation of the symv routine. This kernel combines the 'N' and the 'T' versions
of the gemv kernel, computing simultaneously
zn ← α ·A · xn + β · yn and zt ← α ·AT · yt + β · yt
80 Level 2 BLAS for embedded optimization
where A is a matrix panel, such that each A element is reused twice once in
registers. In Section 4.1.1.2, it is shown that the vector implementation of
the 'T' version of the gemv kernel access the matrix across panels, since this
allows to amortize the vector reduction over n loop iterations. On the contrary,
accessing the matrix along panels would imply that the vector reduction has to
be computed at each loop iteration, greatly aﬀecting performance. Therefore,
also the symv kernel is required to access the matrix across panels: the 'T' part
of the symv kernel is thus identical to the 'T' version of the gemv kernel, while
the 'N' part has a diﬀerent matrix access pattern with respect to the 'N' version
of the gemv kernel. As a consequence, the symv kernel computes a ﬁxed-size
sub-vector of the result vector zt, and a low-rank update of the result vector zn.
Each iteration of the kernel loop computes a rank-1 update of the sub-vector of
the result vector zt, and a low-rank update of an element of the result vector
zn.
In order to achieve the best performance, the symv kernel is optimized for dif-
ferent computer architectures. Therefore features as e.g. the height of the
panel, the degree of the low-rank update of the result vector zn and the size
of the ﬁxed-size sub-vector of the result vector zt computed by the kernel are
architecture-dependent. As an example, the symv kernel computing a rank-4
update of the result vector zn and a 4 × 1 sub-vector of the result vector zt,
where the A matrix is assumed to be stored in the panel-major format with
panel height bs = 4 (see Section 3.2.2 for more details), is
void kernel_dsymv_4_lib4(int kmax, int tri_A, double *A, int sda, \
double *x_n, double *x_t, int alg, double *y_n, double *y_t, \
double *z_n, double *z_t);
Notice that the interface of the symm routines does not closely resemble the
interface of the corresponding underlying symm kernel, as the latter has many
more arguments. The ﬂag tri_A is used to indicate whether the beginning of
the A panel is a triangle or not.
The symv kernel can be used to compute the symv routine, and more generally to
compute simultaneously the 'N' and 'T' versions of gemv. The routine computing
this latter operation will be called gemv_nt, since it simultaneously accounts for
the 'N' and 'T' versions of gemv. This section presents the generic techniques
used in the implementation of all symv kernels, and shows how to optimize them
for an hypothetical computer architecture.
4.2 Optimizing the symv routine 81
4.2.1.1 Blocking for registers
In the symv, the reuse factor of the matrix elements is 2: this compares favorably
with the reuse factor of the other level 2 BLAS routines, that is equal to 1.
Therefore the symv routine is still expected to be memory-bound, but since the
memory bandwidth requirements are lower, the performance is expected to be
better than the gemv routine when the memory has to be loaded from main
memory. On the other hand, the fact that the 'N' part of the symv kernel
computes a low-rank update of the entire result vector zn can introduce some
issue regarding the possibility to hide instruction latency.
Blocking for registers can be employed also in the implementation of symv, where
it can provide enough independent instructions to hide instruction latency, and
to a limited extent it can reduce memory bandwidth.
The same hypothetical processor introduced in Sections 3.2.1.1 and 4.1.1.1 is
considered in the optimization choices. This processor can perform one FMA
(fused-multiply-add) every clock cycle (throughput=1), while the result is avail-
able for further computations after 4 cycles (latency=4). Furthermore, this
processor can load one ﬂoating point register from L1 cache every clock cycle.
Let us consider the 'T' version of the gemv kernel computing a sub-vector of yt
of size 4 as basis for the symv kernel. This gemv kernel uses 4 registers to hold
the 4 elements of the sub-vector of yt, that are loaded before the main loop. At
each loop iteration, one element from xt and 4 elements from A are loaded and
multiplied by means of 4 FMAs.
The symv kernel expands the 'T' version of the gemv kernel by using additional 4
registers to hold the 4 elements fo the sub-vector of xn. At each loop iterations,
one additional element from yn is loaded, and 4 additional FMAs are performed.
The resulting 8 FMAs computed at each loop iteration are
ak0 ak1 ak2 ak3
ynk y
n
k ⇐ ak0 · xn0 ynk ⇐ ak1 · xn1 ynk ⇐ ak2 · xn2 ynk ⇐ ak3 · xn3
xtk y
t
0 ⇐ ak0 · xtk yt1 ⇐ ak1 · xtk yt2 ⇐ ak2 · xtk yt3 ⇐ ak3 · xtk
(4.4)
(where⇐ is a short for the accumulation operator, as in (4.3)) for k = 0, 1, . . . , n−
1. It is immediately evident that the 4 FMAs coming from the 'T' part are in-
dependent, but the 4 FMAs coming from the 'N' part are not. Even issuing the
FMAs from the 'T' part between the dependent FMAs from the 'N' part, there
are not enough independent instructions to keep the FMA pipeline busy, as on
average it is possible to issue a FMA every 2 cycles, for a maximum performance
of 50% of the scalar full FP throughput.
82 Level 2 BLAS for embedded optimization
It is possible to have enough independent FMA instructions by considering two
consecutive iteration loops. In order to make the notation easier, the elements
of the ﬁrst iteration are denoted with h and the elements of the second iteration
with k. The resulting 16 FMAs computed in two consecutive loops are
ak0 ak1 ak2 ak3
ynh y
n
h ⇐ ah0 · xn0 ynh ⇐ ah1 · xn1 ynh ⇐ ah2 · xn2 ynh ⇐ ah3 · xn3
xth y
t
0 ⇐ ah0 · xth yt1 ⇐ ah1 · xth yt2 ⇐ ah2 · xth yt3 ⇐ ah3 · xth
ynk y
n
k ⇐ ak0 · xn0 ynk ⇐ ak1 · xn1 ynk ⇐ ak2 · xn2 ynk ⇐ ak3 · xn3
xtk y
t
0 ⇐ ak0 · xtk yt1 ⇐ ak1 · xtk yt2 ⇐ ak2 · xtk yt3 ⇐ ak3 · xtk
(4.5)
Let us assume that the couples of FMAs involving the same A element are
issued in sequence. This means that in the above scheme there are 8 couples
of FMAs (2 rows of 4 columns each), and that full throughput can be achieved
if each couple is followed by an independent couple. Each couple depends on
all couples on the same row (due to the dependency between ynk elements), and
each couple depends on the couple immediately above or below (due to the
dependency between yti elements). A possible sequence of couples that can be
issued without stalling the FMA pipeline due to dependencies is[
1 5 3 7
6 2 8 4
]
(4.6)
where the number indicates the order of issue of the couple in the corresponding
position.
During the 16 clock cycles needed to issue all the 16 independent FMAs, 8 ele-
ments are loaded from memory. Therefore, if the matrix and vector elements are
in L1 cache, it is theoretically possible to achieve the scalar full FP throughput.
Notice that this scheme requires a considerable amount of registers: 4 to hold
the xni elements, 4 to hold the y
t
i elements, and 4 to hold y
n
h , y
n
k , x
t
j , x
t
k, plus at
least 1 for the A elements and possibly 1 for the intermediate results (depending
on the ISA), for a total of 13 or 14 FP registers.
4.2.1.2 Use of SIMD instructions
Similarly to the case of the gemv, if available on the processor SIMD (Single-
Instruction Multiple-Data, that are instructions that perform the same opera-
tion in parallel on all elements of a small vector of data) can be used to improve
the performance of the symv kernel in case the limiting factor is not the memory
bandwidth from main memory.
4.2 Optimizing the symv routine 83
Continuing the example, let us assume that the hypothetical processor has 2-
wide vector units (i.e. each holding 2 FP numbers), and vector FMA and
load instructions (operating on the 2-wide vectors) with the same latency and
throughput of the scalar instructions used in Section 4.1.1.1.
As in the vector version of the 'T' gemv kernel, vectorization is used to compute
several multiple loop iterations with a single instruction, instead of to compute
several elements of the result vector with a single instruction as in the 'N' gemv
kernel. The symv kernel computing a 4×1 sub-vector of yt and a rank-4 update
of yn can be implemented using vector instructions as
[
ak0
ah0
] [
ak1
ah1
] [
ak2
ah2
] [
ak3
ah3
]
[
ynh
ynk
] [
ynh⇐ah0 ·xn0
ynk⇐ak0 ·xn0
] [
ynh⇐ah1 ·xn1
ynk⇐ak1 ·xn1
] [
ynh⇐ah2 ·xn2
ynk⇐ak2 ·xn2
] [
ynh⇐ah3 ·xn3
ynk⇐ak3 ·xn3
]
[
xth
xtk
] [
yt,00 ⇐ah0 ·xth
yt,10 ⇐ak0 ·xtk
] [
yt,01 ⇐ah1 ·xth
yt,11 ⇐ak1 ·xtk
] [
yt,02 ⇐ah2 ·xth
yt,12 ⇐ak2 ·xtk
] [
yt,03 ⇐ah3 ·xth
yt,13 ⇐ak3 ·xtk
]
(4.7)
(where ⇐ has the usual meaning of short notation for the accumulator opera-
tor, and some of the square brackets have been removed to make the notation
shorter: it is intended that all operations inside each pair of square brackets are
vector operations on 2-wide vectors) that accounts for 8 FMAs, and is analogue
to (4.4) implemented using scalar instructions. In fact, it is immediately evident
that the 4 FMAs coming from the 'T' part are independent, but the 4 FMAs
coming from the 'N' part are not. Even issuing the FMAs from the 'T' part
between the dependent FMAs from the 'N' part, there are not enough indepen-
dent instructions to keep the FMA pipeline busy, as on average it is possible to
issue a FMA every 2 cycles, for a maximum performance of 50% of the vector
full FP throughput.
In the vector case, it is possible to have enough independent FMA instructions by
considering four consecutive iteration loops instead of two. In order to make the
notation easier, the elements of the ﬁrst, second, third and fourth iteration are
denoted with h, k, l and p. The resulting 16 FMAs computed in two consecutive
84 Level 2 BLAS for embedded optimization
loops are [
ak0
ah0
] [
ak1
ah1
] [
ak2
ah2
] [
ak3
ah3
]
[
ynh
ynk
] [
ynh⇐ah0 ·xn0
ynk⇐ak0 ·xn0
] [
ynh⇐ah1 ·xn1
ynk⇐ak1 ·xn1
] [
ynh⇐ah2 ·xn2
ynk⇐ak2 ·xn2
] [
ynh⇐ah3 ·xn3
ynk⇐ak3 ·xn3
]
[
xth
xtk
] [
yt,00 ⇐ah0 ·xth
yt,10 ⇐ak0 ·xtk
] [
yt,01 ⇐ah1 ·xth
yt,11 ⇐ak1 ·xtk
] [
yt,02 ⇐ah2 ·xth
yt,12 ⇐ak2 ·xtk
] [
yt,03 ⇐ah3 ·xth
yt,13 ⇐ak3 ·xtk
]
[
ynl
ynp
] [
ynl ⇐al0 ·xn0
ynp⇐ap0 ·xn0
] [
ynl ⇐al1 ·xn1
ynp⇐ap1 ·xn1
] [
ynl ⇐al2 ·xn2
ynp⇐ap2 ·xn2
] [
ynl ⇐al3 ·xn3
ynp⇐ap3 ·xn3
]
[
xtl
xtp
] [
yt,00 ⇐al0 ·xtl
yt,10 ⇐ap0 ·xtp
] [
yt,01 ⇐al1 ·xtl
yt,11 ⇐ap1 ·xtp
] [
yt,02 ⇐al2 ·xtl
yt,12 ⇐ap2 ·xtp
] [
yt,03 ⇐al3 ·xtl
yt,13 ⇐ap3 ·xtp
]
(4.8)
It is possible to issue an independent vector FMA every clock cycle by issuing
them using the same scheme (4.6) as in the scalar case for two following loop
cycles. During the 16 clock cycles needed to issue all the 16 independent vector
FMAs, 8 registers are loaded from memory. Therefore, if the matrix and vector
elements are in L1 cache, it is theoretically possible to achieve the vector full
FP throughput.
This scheme requires the same amount of (vector) FP registers as the scalar case
requires scalar registers, i.e. 13-14 depending on the ISA. As a drawback, all
instructions of 4 consecutive loop indexes need to be issued in a special order.
In case of wider vectors, even more consecutive loops need to be considered at
once, and this may aﬀect the performance in case of small matrices.
The analysis about the use of contiguous memory and the panel-major matrix
factor, and about the edge handling is analogue to the gemv case, found in
Sections 4.1.2 and 4.1.3, and therefore it is not repeated.
4.3 Optimizing other level 2 BLAS routines
The computationally most expensive part of level 2 BLAS routines can be cast
in terms of the gemv kernel, similarly as the computationally most expensive
parts of level 3 BLAS and LAPACK routines can be cast in terms of the gemm
kernel. In fact, the gemv kernel can be used to compute, upgrade or downgrade
a sub-vector of the result vector with the product of one rectangular matrix and
a vector. In the special case of level 2 BLAS routines with reuse factor of matrix
elements equal to 2, it is convenient to use the symv kernel instead, even if these
routines could be implemented using the gemv kernel.
4.3 Optimizing other level 2 BLAS routines 85
The reminder of the section presents techniques to obtain high-performance
level 2 BLAS routines based on the optimized gemv kernels, with special focus
on small-scale performance.
4.3.1 Triangles and substitutions handling
The proposed level 2 BLAS routines for embedded optimization handle triangles
and substitutions in the same way as the proposed level 3 BLAS and LAPACK
routines handle the matrix counterpart of triangles and substitutions. Namely,
specialized kernels are designed. In these kernels, the main loops is literally
copied-and-pasted from the gemv and symv kernels, while specialized procedures
before and after this loop take care of triangular matrices and substitutions.
This approach requires the design of several specialized kernels, but once the
gemv and symv kernels are available, they can be easily edited to get all other
level 2 BLAS kernels.
4.3.2 Merging of linear algebra routines
Also in the case of level 2 BLAS routines, it is possible to employ merging
of linear algebra routines in order to reduce overhead, and therefore improve
performance.
Some examples of complex operations that are commonly found in embedded
optimization and that can be easily merged are:
 The 'N' version of the trsv routine followed by the 'N' version of the gemv
routine. It computes[
y0
y1
]
←
[
A−10 x0
x1 −A1 ·A−10 x0
]
where A =
[
A0
A1
]
(4.9)
and A0 is a lower triangular invertible matrix and A1 is a generic matrix.
The matrix A could be the output of the potrf routine for m > n, as
e.g. in the eﬃcient implementation of the backward Riccati recursion in
Chapter 8.
 The 'T' version of the gemv routine followed by the 'T' version of the trsv
routine. It computes[
y0
y1
]
←
[
A−T0 · (x0 −AT1 · x1)
x1
]
where A =
[
A0
A1
]
(4.10)
86 Level 2 BLAS for embedded optimization
and A0 is a lower triangular invertible matrix and A1 is a generic matrix.
The matrix A could be the output of the potrf routine for m > n, as
e.g. in the eﬃcient implementation of the backward Riccati recursion in
Chapter 8.
4.3.3 Notable routines
This section reports the interface of notable routines as implemented for embed-
ded optimization, together with the rationale behind the choice of the interfaces.
All matrices are assumed to be in panel-major format.
4.3.3.1 trsv
There are two versions of the trsv routine, namely 'N' (where the matrix is
considered as not-transposed) and 'T' (where the matrix is considered as trans-
posed). The 'N' version of the routine is deﬁned as
void dtrsv_n_lib(int m, int n, double *pA, int sda, \
int use_inv_diag_A, double *inv_diag_A, double *x, double *y);
where m ≥ n and the top n×n sub-matrix A0 of A is assumed to be invertible.
If m = n, the routine is analogue to the dtrsv routine in standard BLAS. If
m > n, the routine implements the merging of the gemv and trsv routines as
deﬁned in Section 4.3.2. The 'N' version computes the operation (4.9), while
the version 'T' computes the operation (4.10). The matrix A is in the format
as returned e.g. by the potrf routine proposed in Section 3.3.3. If used, the
vector inv_diag_A has to contain the inverse of the diagonal elements of the
matrix A0: this vector is returned e.g. by the proposed potrf routine, avoiding
the additional divisions.
4.3.3.2 gemv_nt
The symv kernel can be used to implement the 'N' and 'T' version of gemv in a
single routine, when the matrix in the 'N' and 'T' versions of gemv is the same.
This has the advantage of allowing for a reuse of matrix elements equal to 2,
once they are moved into registers. The routine is deﬁned as
4.4 Performance of level 2 BLAS routines 87
void dgemv_nt_lib(int m, int n, double *pA, int sda, double *x_n, \
double *x_t, int alg, double *y_n, double *y_t, double *z_n, \
double *z_t);
The interface is analogue to the one of the gemv routines, with the diﬀerence
that all vectors for both version have to be provided at once.
4.4 Performance of level 2 BLAS routines
This section contains performance plots for the level 2 BLAS routines that
constitute the backbone of the optimization algorithms presented in Part II
and Part III of this thesis. For each routine, three version are compared: the
version implemented using the techniques presented in this thesis (HPMPC),
the reference BLAS (Netlib BLAS 3.5.0), an optimized vendor BLAS (Intel's
MKL 11.3) and an optimized open-source BLAS (OpenBLAS 0.2.15).
All routines are tested on the same two processors employed in Section 3.5, im-
plementing Intel Ivy-Bridge and Intel Haswell micro-architectures respectively.
The former supports the AVX ISA, while the latter supports the AVX2 and FMA
ISAs, that are the best ISAs supported even in more recent micro-architectures
such as Intel Broadwell and Intel Skylake. The two test machines have the same
memory conﬁguration, namely 8 GB of DDR3/DDR3L memory in dual-channel
conﬁguration (for a total data width of 128 bits), running at 1600 MHz, that
gives a maximum bandwidth of 25.6 GB/s. Therefore, the diﬀerence in perfor-
mance is solely due to the processors. These tests also show the current support
state of the latest and previous latest x86_64 ISAs regarding FP.
The tests are performed in double precision, for squared matrices of size n
between 4 and 300, in steps of 4, and are meant to evaluate the performance
for matrix size of interest for embedded optimization applications. In all tests,
only one thread is employed: therefore, the single-thread version of optimized
BLAS libraries is considered.
4.4.1 Performance on Intel Ivy-Bridge micro-architecture
The performance plots are in Figure 4.1. The overall result is that in level 2
BLAS routines, if the matrix data is not already in cache, the cost to move the
matrix data is the factor limiting performance. Most routines in HPMPC give
very high performance if the matrix data is already in L1 cache (for squared
88 Level 2 BLAS for embedded optimization
matrices of size up to size 64). The performance however decreases to the level
of the other optimized BLAS version if the matrix data is in L2 cache (for
squared matrices of size up to about 180), in L3 cache or main memory (the
latter case not in ﬁgure). It is interesting to notice that there is a very little
drop in performance when streaming data from L2 or L3 cache.
The dgemv_nt routine is a custom one and not part of BLAS. It performs the
two general matrix-vector products, one with the matrix considered normal and
one with the matrix considered transposed. Therefore, this operation can be
performed by means of two calls to the dgemv routines in standard BLAS. The
advantage of having a custom routine for this operation (that is employed in the
computation of the residuals of the KKT system) is that every matrix element
is reused twice once in registers. This gives a good performance advantage of
the custom routine in HPMPC over the use of two calls to the dgemv routine in
optimized BLAS. Reference BLAS performs rather poorly.
4.4.2 Performance on Intel Haswell micro-architecture
The performance plots are in Figure 4.2. Even more than in the case of
level 3 BLAS and LAPACK routines, the recent AVX2 and FMA ISAs do not
seem totally exploited in level 2 BLAS routines from optimized BLAS libraries.
Therefore, generally the routines in HPMPC give some performance advantage
over the corresponding routines in optimized BLAS libraries. In this micro-
architecture, it is interesting to notice that, beside the performance drop when
data has to be streamed from L2 cache, there is another performance drop when
data has to the streamed from L3 cache. This performance drop was much
smaller in the Ivy-Bridge architecture, hinting at the fact that, with respect to
the Ivy-Bridge micro-architecture, the L1 and L2 cache bandwidth has doubled
in the Haswell micro-architecture, while the L3 cache bandwidth looks the un-
changed. Reference BLAS performs rather well on routines are not transposed,
and extremely poorly on routines where the matrix is transposed (compare e.g.
the performance of the 'N' and 'T' versions of the dgemv routine in Figures 4.2a
and 4.2b). This is due to the fact that the 'N' version of the level 2 BLAS rou-
tines is based on the scal level 1 BLAS routine, that can be trivially vectorized,
and that streams the result vector, and therefore having independent consecu-
tive FMAs. Whereas the 'T' version of the level 2 BLAS routines is based on
the dot level 1 BLAS routines, that is harder to vectorize since it requires ﬁnal
reduction, and that uses a single accumulation variable in consecutive iterations
of the inner loops. Therefore, the 'N' versions can take advantage of the higher
throughput given by the FMA ISA, while the 'T' versions suﬀer the higher la-
tency (the FMA instruction has a latency of 5 cycles, while employing unfused
instructions, the latency of the multiplication instruction can be hidden by the
4.4 Performance of level 2 BLAS routines 89
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dgemv_n
Netlib
HPMPC
MKL
OpenBLAS
(a) dgemv_n
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dgemv_t
Netlib
HPMPC
MKL
OpenBLAS
(b) dgemv_t
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dtrmv
Netlib
HPMPC
MKL
OpenBLAS
(c) dtrmv
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dtrsv
Netlib
HPMPC
MKL
OpenBLAS
(d) dtrsv
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dsymv
Netlib
HPMPC
MKL
OpenBLAS
(e) dsymv
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dgemv_nt
Netlib
HPMPC
MKL
OpenBLAS
(f) dgemv_nt
Figure 4.1: Performance test for key level 2 BLAS routines on an Intel core i7
3520M processor (Ivy Bridge micro-architecture, supporting the
AVX ISA).
90 Level 2 BLAS for embedded optimization
register renaming and the addition instruction has a latency of 3 cycles), and
therefore give lower performance with the FMA ISA that with the AVX ISA.
4.5 Conclusion
This section extended to the level 2 BLAS linear algebra routines the implemen-
tation strategy proposed in Chapter 4. In case of level 2 BLAS routines, there is
smaller room for optimization, since matrix elements have a reuse factor of O(1)
(and typically exactly 1) and the streaming of data from main memory is often
the bottleneck. However, if the matrices are already in cache (as it may be in
the case of embedded optimization algorithms), the proposed implementation
strategy gives a nice performance improvement.
In the case of level 2 BLAS routines, the bulk of the computation is cast in
terms of the gemv kernel, with the notable exception of the symv-like routines,
that can take advantage of a tailored kernel due to the reuse factor equal to 2.
Numerical tests conﬁrm that, if the matrix data has to be streamed from L2
cache, L3 cache or main memory, there is little diﬀerence in performance between
diﬀerent implementation of level 2 BLAS routines, since the cost to move the
matrix data is the factor limiting performance. However, if the matrix data
is already present in L1 cache (or L2 for the Haswell micro-architecture), the
routines in HPMPC give a nice performance advantage over the corresponding
routines in optimized BLAS and LAPACK libraries. Reference BLAS routines
(especially the 'T' versions) perform rather poorly also in case of level 2 BLAS
routines.
4.5 Conclusion 91
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dgemv_n
Netlib
HPMPC
MKL
OpenBLAS
(a) dgemv_n
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dgemv_t
Netlib
HPMPC
MKL
OpenBLAS
(b) dgemv_t
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dtrmv
Netlib
HPMPC
MKL
OpenBLAS
(c) dtrmv
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dtrsv
Netlib
HPMPC
MKL
OpenBLAS
(d) dtrsv
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dsymv
Netlib
HPMPC
MKL
OpenBLAS
(e) dsymv
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dgemv_nt
Netlib
HPMPC
MKL
OpenBLAS
(f) dgemv_nt
Figure 4.2: Performance test for key level 2 BLAS routines on an Intel core
i7 4800MQ processor (Haswell micro-architecture, supporting the
AVX2 and FMA ISAs).
92 Level 2 BLAS for embedded optimization
Chapter 5
Optimizing gemm kernels
on diﬀerent architectures
This chapter describes the practical implementation of the gemm kernel on a
number of architectures. Code snippets are included, showing instruction and
implementation methods characteristic of the diﬀerent architectures.
Depending on the architecture, the code is implemented in either intrinsics or
assembly. The choice between the two is a trade-oﬀ between easy of implemen-
tation and control over the resulting object code. Both of them generally give
good control over the instructions that are present on the actual object code,
and the choice between scalar and vector code.
Assembly code gives full control over instruction scheduling and register alloca-
tions. These features are particularly useful in case of in-order processor that do
not implement register renaming. In fact, in this case the order of the instruc-
tion in the code corresponds exactly to the order in which they are executed in
the processor. Therefore, if there are dependencies between two instructions,
the processor stalls until the operands of the second instructions are available.
Furthermore, register names have to be used carefully to avoid the introduction
of dependencies.
Generally speaking, intrinsics (or intrinsic functions) are functions whose imple-
mentation is handled speciﬁcally by the compiler. In the considered framework,
94 Optimizing gemm kernels on diﬀerent architectures
the icc, gcc and clang compilers (among others) provide C intrinsics that are
mapped directly into processor instructions (scalar and vector), and that op-
erates on quantities that are mapped into processor registers. This gives an
easy way to guide the compiler in the use of eﬃcient instructions from C, with-
out the need to write assembly code. As an example, this enables the explicit
use of SIMD instructions, and of operations that are not directly available in
the C language such as fused-multiply-add and shue. When intrinsics are em-
ployed, the compiler takes care of instruction scheduling and registers allocation.
This generally works well with out-of-order processors that implement register
renaming, but in case of in-order processor it may be preferable to have full
control over these aspects. Therefore, in this chapter intrinsics are used only for
architectures supporting out-of-order execution and register renaming.
The material contained in this chapter has diverse origins. Detailed descrip-
tion of many micro-architectures can be found in [1]. Useful instruction tables
(containing e.g. latency, throughput and execution unit) for x86 and x86_64
architectures can be found in [2]. Examples of highly-optimized kernels for lin-
ear algebra can be found in the open-source OpenBLAS library [94] or in the
open-sourc BLIS library [6].
Nonetheless, all kernels presented in this section have been originally tailored
for the proposed implementation strategy for embedded optimization (e.g. they
assume the panel-major matrix format). Some kernels (as e.g. dgemm and sgemm
for the x86 Bonnell, or sgemm for the ARMv7A processors) signiﬁcantly outper-
form the corresponding versions in OpenBLAS.
All performance plots in this chapter are scaled on the y-axis such that the top
of the ﬁgure corresponds to the full FP performance. This makes it intuitive the
evaluation of the performance: a good routine has a performance plot steadily
close to the top of the ﬁgure.
5.1 x86
The x86 is a family of backward compatible instruction sets. Over the years,
many extensions have been added.
In the following, by x86 it is meant a 32-bit CISC (Complex Instruction Set
Computing) architecture. It features complex addressing modes and instructions
where one of the source operands can be in memory. There are 8 GP (General
Purpose) registers. On the FP side, the SSEx instruction sets up to SSE3
are considered in this thesis. These instruction sets operate on 8 128-bit wide
5.1 x86 95
registers. The instructions take two-operands, and thus one of the two source
operands has to be overwritten with the result.
Only natively 32-bit processors are considered. Therefore 64-bit processors oper-
ating in 32-bit mode are not considered. In 32-bit mode, the maximum amount
of byte-addressable memory is limited to 4 GB, and the number of register
names is limited to 8 GP and 8 FP. The memory limit does not aﬀect perfor-
mance in embedded optimization, but the smaller number of FP registers does.
In particular, generally it is not possible to reach near full FP throughput if only
8 FP registers are employed, since latency of instructions can not be completely
hidden.
5.1.1 Intel Bonnell (Atom)
The micro-architecture of the ﬁrst generation of Atom processors is called Bon-
nell. Both 32-bit and 64-bit processors are based on this micro-architecture.
Since the test machine considered in tests is 32-bit, only the 32-bit version is
considered in this thesis.
The focus is on low cost and low power consumption processors. The Bonnell
micro-architecture implements has an in-order dual-issue pipeline, and there
is no register renaming nor speculative execution. As many modern x86 and
x86-64 micro-architectures, the instructions are translated into simpler internal
micro-operations. However, in the case of the Bonnell micro-architecture, the
micro-operations are more CISC-like than RISC-like, as they can combine an
ALU (Arithmetic Logic Unit) operation with a load or a store. Therefore the
Bonnell micro-architecture has many similarities with the P5 micro-architecture
of the original Pentium processor. There are 24 KB L1 data cache and 32 KB
L1 instruction cache, plus an uniﬁed 512 KB L2 cache.
On the FP side, the SSE, SSE2 and SSE3 instruction sets are supported. The
SSE instruction set contains mainly instructions for single-precision computa-
tion on 128-bit wide registers (each holding 4 single-precision FP numbers). The
SSE2 instruction set contains mainly instructions for double-precision compu-
tation on 128-bit wide registers (each holding 2 double-precision FP numbers).
The SSE3 instruction set contains instructions that allow to work horizontally
in a register, e.g. adding or subtracting the elements in a vector register, or
duplicating the lower element to the register. However, the SSE2 instruction set
is implemented in a low power fashion, with some vector instruction split into
two scalar ones.
The small number of registers, the fact that instructions have 2 operands, the
96 Optimizing gemm kernels on diﬀerent architectures
lack of a fused-multiply-add instruction and the lack of advanced features (such
as out-of-order computation and register renaming) makes the optimization on
this processor diﬃcult: therefore, the use of inline assembly code is necessary
to obtain good performance. Out of 8 available FP registers, 4 are employed as
accumulation registers, and the other 4 are used to load A and B elements, and
to hold intermediate results.
Both the dgemm and the sgemm kernels for x86 Bonnell are novel and an im-
provement over original Goto's kernels in OpenBLAS.
5.1.1.1 dgemm
The SSE2 instruction set provides support for 2-wide SIMD in double preci-
sion. However, the double-precision vector multiplication has a latency and a
throughput of 9 cycles, making the vector version of dgemm eﬀectively slower
that the scalar version. Therefore, a scalar version of the code is considered in
the following.
The double precision scalar multiplication has a throughput of 2 (i.e. an instruc-
tion can be issued every 2 clock cycles) and the scalar addition a throughput
of 1. Since in the matrix-matrix multiplication there is an equal number of
multiplications and additions, the multiplication is the limiting factor, and on
average also the addition is issued every 2 clock cycles.
Out of the 8 FP registers, 4 can be used to hold a 2× 2 sub-matrix of C, while
the other 4 registers are used to hold elements from A and B, and intermediate
results from the multiplications. A 2×2 kernel is employed, with 2 is the height
of the panels in the panel-major matrix format.
The optimized code of an iteration over k is
1: addsd %%xmm4, %%xmm2
2: movsd 16(%%edx), %%xmm4
3: mulsd 16(%%edx), %%xmm6
4: addsd %%xmm7, %%xmm3
5: movsd 24(%%eax), %%xmm7
6: mulsd 24(%%eax), %%xmm4
7: addsd %%xmm5, %%xmm1
8: movsd 24(%%edx), %%xmm5
5.1 x86 97
9: mulsd 24(%%edx), %%xmm7
10: addsd %%xmm6, %%xmm0
11: movsd 32(%%eax), %%xmm6
12: mulsd 16(%%eax), %%xmm5
where the right operand is overwritten with the result of the instruction. Reg-
isters xmm0 to xmm3 are used to hold a submatrix of D,[
xmm0 xmm1
xmm2 xmm3
]
←
[
d00 d01
d10 d11
]
The instructions addsd and mulsd perform respectively the scalar double-precision
FP addition and multiplication. The instruction movsd can load, store or move
between registers a scalar double-precision FP number. Since there are not
fused-multiply-accumulate nor out-of-order execution, the general idea behind
the optimization is to hide multiplications latency by having enough indepen-
dent instructions between each multiplication and the relative addition, with
the constraint of the limited number of registers. In this case, the result of the
multiplication at line 3 is held in the register xmm6, and the relative addition
takes place at line 10: in this way a latency of 5 cycles can be completely hidden.
At line 3, the choice of taking one of the operands from memory instead of from
the register xmm4 loaded at the previous line is justiﬁed by the fact that in this
way there are less dependent instructions, and that the maximum capacity of
one load per cycle is not exceeded.
A performance test shows that the maximum performance is attained when the
data ﬁts the L1 cache size, hinting at the fact that hardware data prefetch is not
implemented. If software prefetch is employed, the high performance is attained
also for data ﬁtting into L2 cache, since the memory bandwidth between L2 and
L1 cache is big enough to feed the processor, and the latency is hidden by moving
the data into L1 cache before it is needed.
5.1.1.2 sgemm
The SSE instruction set provides support for 4-wide SIMD in single precision.
Both 4-wide vector multiplication and addition have a throughput of 2, while
scalar versions have a throughput of 1. Therefore in single precision it is advan-
tageous to use the vector version, that has twice the full FP throughput.
Again, out of the 8 FP registers, 4 can be used to hold a sub-matrix of D and 4
for intermediate results. Since each register can hold 4 ﬂoats, a 4×4 sub-matrix
98 Optimizing gemm kernels on diﬀerent architectures
of D can be hold in 4 registers. This means that a 4×4 kernel is employed, and
the panel height is 4.
The optimized code of an iteration over k is
1: addps %%xmm1, %%xmm5
2: movaps %%xmm0, %%xmm1
3: shufps $0, %%xmm0, %%xmm0
4: mulps 16(%%eax), %%xmm0
5: addps %%xmm2, %%xmm6
6: movaps %%xmm1, %%xmm2
7: shufps $85, %%xmm1, %%xmm1
8: mulps 16(%%eax), %%xmm1
9: addps %%xmm3, %%xmm7
10: movaps %%xmm2, %%xmm3
11: shufps $170, %%xmm2, %%xmm2
12: mulps 16(%%eax), %%xmm2
13: addps %%xmm0, %%xmm4
14: movaps 32(%%edx), %%xmm0
15: shufps $255, %%xmm3, %%xmm3
16: mulps 16(%%eax), %%xmm3
Registers xmm4, xmm5, xmm6, xmm7 hold respectively the ﬁrst, second, third
and fourth column of the 4× 4 sub-matrix of D, each consisting of 4 elements:
xmm4←

d00
d10
d20
d30
 , xmm5←

d01
d11
d21
d31
 , xmm6←

d02
d12
d22
d32
 , xmm7←

d03
d13
d23
d33
 .
Instructions addps and mulps perform respectively the vector (packed) single-
precision FP addition and multiplication. Instruction movps can load, store
or move between registers a vector of 4 packed single-precision FP numbers.
Instruction shufps shues the content of vectors of 4 packed single-precision
FP numbers.
In the x86 (and in the following x86_64) architecture, a scalar instruction can
operate only on the lower element of a register, while a vector instruction can
operate only on all element of a register at the same time (and not on sub-
vectors). Thus shue instructions are needed to reorder the elements within a
5.2 x86_64 99
register. In particular, in the above code, assuming that before line 1 the xmm0
register has been loaded with the vector
xmm0←

b0k
b1k
b2k
b3k

then the shue instruction at line 3 broadcasts the element b0k to all elements of
the vector xmm0, and similarly shue instructions at lines 7, 11 and 15 broadcast
respectively the elements b1k, b2k and b3k to all elements of vectors xmm1, xmm2
and xmm3. The move instructions at lines 2, 6 and 10 save the original content
of register xmm0 (before the shue instructions destroy its value) into registers
xmm1, xmm2 and xmm3 as soon as they are free after the immediately previous
addition. The move instruction at line 14 loads xmm0 with a new vector from
matrix B.
The element from the A matrix in each multiplication instruction has to be
fetch from memory even if it is the same for all 4 multiplications, since the
small number of registers (8) prevents the use of an extra register to hold its
value.
Also in single precision software prefetch has to be employed to obtain the best
performance for data ﬁtting L2 cache.
5.1.1.3 Results
The test processor is the Intel Atom N270, with one core at 1.6 GHz. As
shown in Figure 5.1, the dgemm and sgemm kernels attain a best performance
of respectively 1.34 Gﬂops (83% of full FP throughput) and 4.63 Gﬂops (72%
of full FP throughput). There is a big performance boost over the OpenBLAS
implementation, that has a best performance respectively of 0.92 Gﬂops (57%)
and 2.63 Gﬂops (41%). This is due to the better instruction scheduling of the
proposed implementation schemes, compared to the kernels in OpenBLAS.
5.2 x86_64
The x86_64 is the 64-bit version of the x86 instruction set. The original speci-
ﬁcation has been created by AMD and released in 2000.
100 Optimizing gemm kernels on diﬀerent architectures
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 50 100 150 200 250 300
G
flo
ps
n
Intel Atom - dgemm
HPMPC_kernel_2x2
OpenBLAS
(a) Atom dgemm.
0
1
2
3
4
5
6
0 50 100 150 200 250 300
G
flo
ps
n
Intel Atom - sgemm
HPMPC_kernel_4x4
OpenBLAS
(b) Atom sgemm.
Figure 5.1: Performance test of diﬀerent implementations of gemm for squared
matrices of size n × n, n ∈ [4, 300], on an Intel Atom N270, code
compiled with gcc 4.6.3. Peak performance in double (single) pre-
cision is 1 · 1.6 = 1.6 Gﬂops (4 · 1.6 = 6.4 Gﬂops).
In the following, by x86-64 it is meant a 64-bit CISC architecture. The amount
of byte-addressable memory is now equal to 264, that is well beyond the amount
of memory available on today computers. There are 16 GP registers. On the
FP side, there are several instruction sets such as the SSEx ISAs up to SSE4.2,
the AVX and AVX2 ISA (all supporting 16 FP registers), and the forthcoming
AVX512 ISA (supporting 32 FP registers). Generally speaking, the larger num-
ber of registers makes optimization easier. In particular, in most architectures,
near full FP throughput can be achieved only in 64-bit mode.
5.2.1 Intel Core
The Core architecture (codename Merom) arrived to market in 2006 as a "tock"
(a new micro-architecture). Is is implemented on Intel 65 nm process. The
following "tick" (codename Penryn) is a die shrink implemented on Intel 45 nm
process.
The Core micro-architecture is an evolution of the older P6 micro-architecture
(used in the Pentium Pro, Pentium II, Pentium III), after the false step rep-
resented by the NetBurst micro-architecture (used in the Pentium IV). While
the NetBurst micro-architecture focused on improving performance by achiev-
ing high frequencies (e.g. by employing very long pipelines) at the expenses
of eﬃciency and power consumption, the Core micro-architectures focuses on
5.2 x86_64 101
eﬃciency and increasing the work done per clock cycle.
On the ISA side, there are not many new features compared to NetBurst. Merom
supports SSEx ISAs up to SSSE3. Penryn supports also SSE4.1, that contains
specialized instructions and therefore has limited impact on FP performance:
in this framework the only notable new instruction is blend, that however in
Penryn has the same latency and throughput of the more general shuffle in-
struction part of the SSE and SSE2 ISAs. All SSEx ISAs have two-operand
instructions.
However, the Core micro-architecture considerably diﬀers from the NetBurst
micro-architecture. The Core micro-architecture is a 64-bit superscalar proces-
sor that can issue 4 instructions per clock cycle. It is a dual-core design, that
drops the SMT (simultaneous multi-thread) introduced in the NetBurst micro-
architecture, and re-introduced in the following Nehalem micro-architecture. It
supports speculative and out-of-order execution, and register renaming on both
GP and FP registers. There are 32 KB of 8-way associative L1 instruction cache
and 32 KB of 8-way associative L1 data cache per core. There is also a L2 cache
shared between CPU cores.
On the FP side, the Core micro-architecture has 128-bit wide FP execution
units, and therefore it can issue 128-bit wide SIMD instructions in one clock
cycle (that are therefore fully pipelined). As a comparison, the NetBurst micro-
architecture has 64-bit wide FP execution units, and therefore 128-bit wide
SIMD instructions are internally split into two instructions. The Core micro-
architecture can sustain a 128-bit FP multiplication and a 128-bit FP addition
every clock cycle. However, FP shues are issued on the same execution ports as
the multiplication (shufpd and addition shufps, and this seriously aﬀects per-
formance of the gemm kernel. Therefore, the universal shift instruction pshufd
is preferred, since it is issued on a diﬀerent execution port, and can therefore
co-issued with a FP multiplication and a FP addition at each clock cycle. The
Core micro-architecture has 1 128-bit load unit and 1 128-bit store unit.
In multiplication has a latency of 5 clock cycles in double precision and 4 in
single precision, and a throughput of 1 in both precisions, while addition has
a latency of 3 clock cycles and a throughput of 1 in both precisions. Since the
processor can perform register renaming, at least 3 registers have to be used
as accumulation registers, to hide the addition latency. However, in practice
it is convenient to use as more registers in order to increase the reuse factor
and therefore reduce the number of memory operations. Since there are 16 FP
registers in both single and double precision, in the gemm implementation scheme
8 registers are used to hold a sub-matrix of C, while the other 8 registers are
used to hold elements from A and B, and intermediate results.
102 Optimizing gemm kernels on diﬀerent architectures
5.2.1.1 dgemm
Each 128-bit register can hold 2 double-precision FP numbers. In double preci-
sion, both the multiplication and the addition instructions have a throughput of
1 instruction per clock cycle, and at each clock cycle both a multiplication and
an addition can be issued. Therefore in double precision the full FP throughput
of the Core architecture is 4 ﬂops per clock cycle.
In the dgemm optimization, 8 out of 16 registers can be used to hold a 4 × 4
sub-matrix of D. Thus a 4× 4 kernel is employed, with panel height of bs = 4.
The Core architecture is out of order, that means that instructions can be
reordered on the ﬂy by the processor. However, the Reorder Buﬀer (ROB) size is
rather limited, and therefore instructions order can not diﬀer too much compared
to the optimal one. This is an argument in favor to the use of assembly instead
of C intrinsics. Furthermore, for performance reasons it is convenient to use
the universal shue instruction pshufd instead of the FP version shufpd. The
intrinsic function making use of the pshufd instruction is _mm_shuffle_epi32,
that operates on 32-bit integers (therefore cast would be needed) and has a
single register operand, while the actual pshufd instruction has two register
operands.
Therefore, the code is better written in assembly. The optimized code of an
iteration over k is:
1: addpd %%xmm6, %%xmm10
2: movaps 16(%%rbx), %%xmm6
3: addpd %%xmm3, %%xmm14
4: movaps %%xmm2, %%xmm3
5: pshufd $0x4e, %%xmm2, %%xmm7
6: mulpd %%xmm0, %%xmm2
7: mulpd %%xmm1, %%xmm3
8: addpd %%xmm4, %%xmm11
9: addpd %%xmm5, %%xmm15
10: movaps %%xmm7, %%xmm5
11: mulpd %%xmm0, %%xmm7
12: mulpd %%xmm1, %%xmm5
13: addpd %%xmm2, %%xmm8
14: movaps 32(%%rbx), %%xmm2
15: addpd %%xmm3, %%xmm12
16: movaps %%xmm6, %%xmm3
5.2 x86_64 103
17: pshufd $0x4e, %%xmm6, %%xmm4
18: mulpd %%xmm0, %%xmm6
19: mulpd %%xmm1, %%xmm3
20: addpd %%xmm7, %%xmm9
21: addpd %%xmm5, %%xmm13
22: movaps %%xmm4, %%xmm5
23: mulpd %%xmm0, %%xmm4
24: movaps 32(%%rax), %%xmm0
25: mulpd %%xmm1, %%xmm5
26: movaps 48(%%rax), %%xmm1
The registers %%xmm8 to %%xmm15 are used as accumulation registers, holding
a permutation of a 4 × 4 sub-matrix of D. Namely, the registers %%xmm8 to
%%xmm11 hold the top 2× 4 sub-matrix, as
xmm8←
[
d00
d11
]
, xmm9←
[
d01
d10
]
, xmm10←
[
d02
d13
]
, xmm11←
[
d03
d12
]
and similarly for registers %%xmm12 to %%xmm15, holding the bottom 2 × 4 sub-
matrix.
The other registers are used to hold elements from A and B, and to hold inter-
mediate results.
5.2.1.2 sgemm
Each 128-bit register can hold 4 single-precision FP numbers. In single precision,
both the multiplication and the addition instructions have a throughput of 1
instruction per clock cycle, and at each clock cycle both a multiplication and
an addition can be issued. Therefore in single precision the full FP throughput
of the Core architecture is 8 ﬂops per clock cycle.
In the sgemm optimization, 8 out of 16 registers can be used to hold a 8 × 4
sub-matrix of D. Thus a 8× 4 kernel is employed, with panel height of bs = 4.
The arguments in favor to the choice of assembly over C intrinsics in the double
precision case still hold in the single precision case. The optimized code of an
iteration over k is:
1: addps %%xmm6, %%xmm10
104 Optimizing gemm kernels on diﬀerent architectures
2: addps %%xmm3, %%xmm14
3: movaps %%xmm2, %%xmm3
4: pshufd $0x39, %%xmm2, %%xmm7
5: mulps %%xmm0, %%xmm2
6: mulps %%xmm1, %%xmm3
7: addps %%xmm4, %%xmm11
8: addps %%xmm5, %%xmm15
9: movaps %%xmm7, %%xmm5
10: pshufd $0x39, %%xmm7, %%xmm6
11: mulps %%xmm0, %%xmm7
12: mulps %%xmm1, %%xmm5
13: addps %%xmm2, %%xmm8
14: movaps 16(%%rbx), %%xmm2
15: addps %%xmm3, %%xmm12
16: movaps %%xmm6, %%xmm3
17: pshufd $0x39, %%xmm6, %%xmm4
18: mulps %%xmm0, %%xmm6
19: mulps %%xmm1, %%xmm3
20: addps %%xmm7, %%xmm9
21: addps %%xmm5, %%xmm13
22: movaps %%xmm4, %%xmm5
23: mulps %%xmm0, %%xmm4
24: movaps 16(%%rax), %%xmm0
25: mulps %%xmm1, %%xmm5
26: movaps 16(%%rcx), %%xmm1
The implementation scheme is analogue to the double precision one, just ported
to single precision. The registers %%xmm8 to %%xmm15 are used as accumulation
registers, holding a permutation of a 8×4 sub-matrix ofD. Namely, the registers
%%xmm8 to %%xmm11 hold the top 4× 4 sub-matrix, as
xmm8←

d00
d11
d22
d33
 , xmm9←

d01
d12
d23
d30
 , xmm10←

d02
d13
d20
d31
 , xmm11←

d03
d10
d21
d32

and similarly for registers %%xmm12 to %%xmm15, holding the bottom 4 × 4 sub-
matrix.
The other registers are used to hold elements from A and B, and to hold inter-
mediate results.
5.2 x86_64 105
0
1
2
3
4
5
6
7
8
9
0 50 100 150 200 250 300
G
flo
ps
n
Intel Core - dgemm
HPMPC_kernel_4x4
OpenBLAS
(a) Intel Core dgemm.
0
2
4
6
8
10
12
14
16
18
0 50 100 150 200 250 300
G
flo
ps
n
Intel Core - sgemm
HPMPC_kernel_8x4
OpenBLAS
(b) Intel Core sgemm.
Figure 5.2: Performance test of gemm kernel for squared matrices of size n×n,
n ∈ [4, 300], on an Intel Core, code compiled with gcc 4.8.4. Peak
performance in double (single) precision is 4 · 2.4 = 9.6 Gﬂops
(8 · 2.4 = 19.2 Gﬂops).
5.2.1.3 Results
The test machine is a laptop equipped with the Intel Core 2 Duo P8600 proces-
sor running at a nominal maximum frequency of 2.4 GHz, and equipped with
3 MB L2 cache. This processor is an implementation of the Penryn micro-
architecture. Figure 5.2 contains the performance plot of the gemm routine. The
dgemm kernel attains a maximum performance of 9.68 Gﬂops, that is 100.8%
of full FP throughput: therefore it looks like the processor is running slightly
faster than the nominal frequency. The sgemm kernel attains a maximum per-
formance of 18.8 Gﬂops, that is 98% of the full FP throughput. As a reference,
OpenBLAS attains a maximum performance of 9.09 Gﬂops (94.6%) in double
precision and 18.03 Gﬂops (93.9%) in single precision. More importantly than
the best absolute performance, the proposed implementation attains a very high
performance already for small matrices, being around 50% for matrices of size
8 and very close to the full throughput for matrices of size 16.
5.2.2 Intel Nehalem
The Nehalem architecture arrived to market in 2008 as a "tock" (a new micro-
architecture). It is implemented on the same Intel 45 nm planar process of the
Penryn architecture, that is a "tick" (shrink to a new process technology) of the
Merom micro-architecture. The following Westmere architecture is a "tick" of
106 Optimizing gemm kernels on diﬀerent architectures
Nehalem, making use of Intel 32 nm planar process.
On the ISA side, Nehalem does not introduce many improvements. It introduces
the SSE4.2 ISA, that is of no interest in linear algebra routines implementation.
The Nehalem micro-architecture shares with its predecessor (Core) the fact that
it is a 64-bit superscalar processor that can issue 4 instructions per clock cycle.
It re-introduces SMT with two threads per physical core. It supports speculative
and out-of-order execution, and register renaming on both GP and FP registers.
There are 32 KB of 4-way associative L1 instruction cache and 32 KB of 8-way
associative L1 data cache per core, plus 256 KB of 8-way associative L2 cache
that is now per core. It also introduces a L3 cache shared between CPU cores.
The Nehalem introduces big improvements over Core micro-architecture on all
aspects except the execution units. Therefore, the full FP throughput is the
same as the Core micro-architecture. Furthermore, since the proposed linear
algebra implementation for embedded optimization does not explicitly consid-
ers caches, TLBs or memory features, the implementation scheme is the same
developed for the Core micro-architecture. There are only two small diﬀerences:
a kernel specially designed for the Nehalem micro-architecture can be slightly
simpliﬁed by using the FP shue instructions shufpd and shufps (now issued
on a diﬀerent port compared to FP multiplication and FP addition) instead of
the universal shue instruction pshufd. This, together with the bigger reorder
windows, should allow an intrinsics version of the code to obtain reasonable
good performance, without the need for inline assembly. Another diﬀerence is
that the blend instruction can be issued on two ports, and therefore long se-
quences of blend instructions can be processed faster. However, the diﬀerences
are small, and the implementation scheme for the Core micro-architecture gives
excellent performance.
5.2.3 Intel Sandy-Bridge
The Sandy-Bridge architecture arrived to market in 2011 as a "tock" (a new
micro-architecture). It is implemented on the same Intel 32 nm planar process of
the Westmere architecture, that is a "tick" (shrink to a new process technology)
of the Nehalem micro-architecture. Sandy-Bridge is found in 2nd generation
Intel Core processors. The following Ivy-Bridge architecture (found in 3rd gen-
eration Intel Core processors) is a "tick" of Sandy-Bridge, making use of Intel
22 nm FinFET process.
On the ISA side, Sandy-Bridge presents many improvements. The change most
useful to numerical code is the introduction of the AVX ISA (providing 256-bit
5.2 x86_64 107
wide FP SIMD). This ISA doubles the width of the various SSE (i.e. SSE, SSE2,
SSE3, SSE4.1, SSE4.2) ISAs, and introduces 3 and 4 operands instructions
(while the various SSE ISAs have 2 operands instructions).
The Sandy-Bridge micro-architecture shares with its predecessor (Nehalem) the
fact that it is a 64-bit superscalar processor that can issue 4 instructions per
clock cycle. It supports two threads per physical core, speculative and out-of-
order execution, and register renaming on both GP and FP registers. There are
32 KB of 8-way associative L1 instruction cache and 32 KB of 8-way associative
L1 data cache per core, plus 256 KB of 8-way associative L2 cache per core.
There is also a L3 cache shared between CPU cores.
On the FP side, the Sandy-Bridge architectures introduces the new AVX instruc-
tion set, that extends the FP registers from 128-bit to 256-bit, and introduces
new 3 and 4 operands instructions. There are 6 execution ports. In particular,
there are separate ports for SIMD multiplication, SIMD addition, and SIMD
shue, allowing a 256-bit wide multiplication, a 256-bit wide addition and a
256-bit wide shue to be co-issued at each clock cycle: this doubles the full
FP throughput compared to the previous micro-architecture. In order to feed
the wider 256-bit AVX units, the Sandy-Bridge micro-architecture has 2 128-bit
load units and 1 128-bit store unit. However, it has only two address gener-
ation units, and therefore it can sustain two 128-bit loads or one 128-bit load
and a 128-bit store per clock cycle. This means that the Sandy-Bridge micro-
architecture can sustain the load of a 256-bit wide register per clock cycle, but
it can not sustain the store of a 256-bit wide register per clock cycle. As a com-
parison, the Nehalem micro-architecture has 1 128-bit load unit and 1 128-bit
store unit, and it can sustain a 128-bit load and a 128-bit store per clock cycle.
In both single and double precision, multiplication has a latency of 5 clock
cycles and a throughput of 1, while addition has a latency of 3 clock cycles
and a throughput of 1. Since the processor can perform register renaming, at
least 3 registers have to be used as accumulation registers, to hide the addition
latency. However, in practice it is convenient to use as more registers in order to
increase the reuse factor and therefore reduce the number of memory operations.
Since there are 16 FP registers in both single and double precision, in the gemm
implementation scheme 8 registers are used to hold a sub-matrix of C, while
the other 8 registers are used to hold elements from A and B, and intermediate
results.
108 Optimizing gemm kernels on diﬀerent architectures
5.2.3.1 dgemm
Each 256-bit register can hold 4 double-precision FP numbers. In double preci-
sion, both the multiplication and the addition instructions have a throughput of
1 instruction per clock cycle, and at each clock cycle both a multiplication and
an addition can be issued. Therefore in double precision the full FP throughput
of the Sandy-Bridge architecture is 8 ﬂops per clock cycle.
In the dgemm optimization, 8 out of 16 registers can be used to hold a 8 × 4
sub-matrix of D. Thus a 8× 4 kernel is employed, with panel height of bs = 4.
The Sandy-Bridge architecture is out of order, that means that instructions can
be reordered on the ﬂy by the processor. Therefore the order of instruction is not
critical, and the compiler can take care of instruction scheduling. Furthermore,
the Sandy-Bridge architecture implements register renaming. This means that
a single register name can be used for all intermediate results, since it is mapped
on diﬀerent physical registers.
Therefore, the code can be safely written using intrinsics. The optimized code
of an iteration over k is:
1: tmp = _mm256_mul_pd( a_0123, b_0 );
2: b_1 = _mm256_shuffle_pd( b_0, b_0, 0x5 ); // b_1032
3: A_0 = _mm256_load_pd( &A0[4] ); // prefetch
4: d_0 = _mm256_add_pd( d_0, tmp );
5: tmp = _mm256_mul_pd( a_4567, b_0 );
6: b_0 = _mm256_load_pd( &B[4] ); // prefetch
7: d_4 = _mm256_add_pd( d_4, tmp );
8: tmp = _mm256_mul_pd( a_0123, b_1 );
9: b_2 = _mm256_permute2f128_pd( b_1, b_1, 0x1 ); // b_3210
10: A_4 = _mm256_load_pd( &A1[4] ); // prefetch
11: d_1 = _mm256_add_pd( d_1, tmp );
12: tmp = _mm256_mul_pd( a_4567, b_1 );
13: d_5 = _mm256_add_pd( d_5, tmp );
14: tmp = _mm256_mul_pd( a_0123, b_2 );
15: b_1 = _mm256_shuffle_pd( b_2, b_2, 0x5 ); // b_2301
16: d_3 = _mm256_add_pd( d_3, tmp );
17: tmp = _mm256_mul_pd( a_4567, b_2 );
18: d_7 = _mm256_add_pd( d_7, tmp );
19: tmp = _mm256_mul_pd( a_0123, b_1 );
5.2 x86_64 109
20: d_2 = _mm256_add_pd( d_2, tmp );
21: tmp = _mm256_mul_pd( a_4567, b_1 );
22: d_6 = _mm256_add_pd( d_6, tmp );
The registers d_0 to d_7 are the accumulation registers, holding a permutation
of a 8 × 4 sub-matrix of D. Namely, the registers d_0 to d_3 contain the
permutation of the upper 4× 4 block, as
d_0←

d00
d11
d22
d33
 , d_1←

d01
d10
d23
d32
 , d_2←

d03
d12
d21
d30
 , d_3←

d02
d13
d20
d31

and similarly for the registers d_4 to d_7, holding the lower 4 × 4 block. The
use of this permutation avoids the saturation of the load unit. In fact, it allows
to reduce the number of load instructions compared to the scheme that employs
the broadcast instruction to ﬁll the b_0 vector with 4 copies of the b0k element
(similarly for the other elements). If the scheme employing broadcast instruc-
tions is considered, one broadcast instruction is needed at each clock cycle to
load B elements, plus one load instruction every 4 clock cycles to load A el-
ements. This exceeds the micro-architecture load issue capability of 1 256-bit
load per clock cycle.
On the contrary, in the employed scheme, the extra load instructions are replaced
by shuffle and permute2f128 instructions, that can be issued in parallel with
multiplication, addition and load instructions. At the end, the correct permuta-
tion can be recovered by means of two layers of blend instructions before storing
the result.
At the beginning of the k iteration, the registers a_0, a_4 and b_0 contain the
elements
a_0←

a0k
a1k
a2k
a3k

a_1←

a4k
a5k
a6k
a7k

, b_0←

b0k
b1k
b2k
b3k

prefetched during the previous iteration. During the k-th iteration, the registers
A_0, A_1 and b_0 are prefetched with the A and B elements used in the following
iteration k + 1.
110 Optimizing gemm kernels on diﬀerent architectures
5.2.3.2 sgemm
Each 256-bit register can hold 8 single-precision FP numbers. In single precision,
both the multiplication and the addition have a throughput of 1 instruction per
clock cycle, and at each clock cycle both a multiplication and an addition can
be issued. Therefore, in single precision the full FP throughput is 16 ﬂops per
clock cycle.
The panel height is chosen as bs = 8, that is equal to the FP registers size in
ﬂoats. In the sgemm optimization, 8 out of 16 registers can be used to hold a
sub-matrix of D.
A natural choice would be to hold a 8 × 8 sub-matrix of D, such that each
element from A and B is reused 8 times once in registers. However, in practice
this implementation scheme is found to be sub-optimal, and the reason seems
to lie in the saturation of the shue unit, since this scheme requires to shue
the B vector at each clock cycle.
An alternative scheme holds a 16 × 4 sub-matrix of D. This allows to reuse
the shued element of B in two consecutive clock cycles, reducing pressure
on the shue unit. As a drawback, this scheme employs only the upper 4
elements the rows in each panel (that has an height of 8 elements). This is sub-
optimal since a cache line is not fully utilized once moved into L1 cache. The
optimal solution (i.e. employ diﬀerent panel heights for the A and B matrix)
can not be employed in the proposed linear algebra implementation scheme for
embedded optimization, since all matrices need to have the same panel height
value. Anyhow, in practice this does not seem to aﬀect performance.
As discussed in the dgemm case, the code can be safely written using intrinsics.
The optimized code of an iteration over k is:
1: tmp = _mm256_mul_ps( a_0, b_t );
2: B_0 = _mm256_broadcast_ps( (__m128 *) &B[8] );
3: d_0 = _mm256_add_ps( d_0, tmp );
4: tmp = _mm256_mul_ps( a_8, b_t );
5: b_t = _mm256_shuffle_ps( b_0, b_0, 0x55 );
6: d_4 = _mm256_add_ps( d_4, tmp );
7: tmp = _mm256_mul_ps( a_0, b_t );
8: A_0 = _mm256_load_ps( &A0[8] );
9: d_1 = _mm256_add_ps( d_1, tmp );
10: tmp = _mm256_mul_ps( a_8, b_t );
11: b_t = _mm256_shuffle_ps( b_0, b_0, 0xaa );
5.2 x86_64 111
12: d_5 = _mm256_add_ps( d_5, tmp );
13: tmp = _mm256_mul_ps( a_0, b_t );
14: A_8 = _mm256_load_ps( &A1[8] );
15: d_2 = _mm256_add_ps( d_2, tmp );
16: tmp = _mm256_mul_ps( a_8, b_t );
17: b_t = _mm256_shuffle_ps( b_0, b_0, 0xff );
18: d_6 = _mm256_add_ps( d_6, tmp );
19: tmp = _mm256_mul_ps( a_0, b_t );
20: d_3 = _mm256_add_ps( d_3, tmp );
21: tmp = _mm256_mul_ps( a_8, b_t );
22: b_t = _mm256_shuffle_ps( B_0, B_0, 0x00 );
23: d_7 = _mm256_add_ps( d_7, tmp );
The broadcast_ps instructions load 128 bits (4 ﬂoats) of B and repeats them
on the high and low halves of a 256-bit register. The shuffle instructions
can shue a vector in any combination within the high and low half of a 256-
bit register (but in the identical way in the high and low part). Therefore,
the combination of the two instructions can be employed to broadcast a B
elements to all components of a vector without the need to employ a broadcast
instruction every clock cycle.
The accumulation registers d_0 to d_7 contain the 16× 4 sub-matrix of D
d_0←

d00
d10
d20
d30
d40
d50
d60
d70

, d_1←

d01
d11
d21
d31
d41
d51
d61
d71

, d_2←

d02
d12
d22
d32
d42
d52
d62
d72

, d_3←

d03
d13
d23
d33
d43
d53
d63
d73

d_4←

d80
d90
da0
db0
dc0
dd0
de0
df0

, d_5←

d81
d91
da1
db1
dc1
dd1
de1
df1

, d_6←

d82
d92
da2
db2
dc2
dd2
de2
df2

, d_7←

d83
d93
da3
db3
dc3
dd3
de3
df3

where exadecimal indexes are employ to use a single character in the notation.
112 Optimizing gemm kernels on diﬀerent architectures
0
5
10
15
20
0 50 100 150 200 250 300
G
flo
ps
n
Intel Sandy-Bridge - dgemm
HPMPC_kernel_8x4
OpenBLAS
(a) Intel Sandy-Bridge dgemm.
0
5
10
15
20
25
30
35
40
45
0 50 100 150 200 250 300
G
flo
ps
n
Intel Sandy-Bridge - sgemm
HPMPC_kernel_16x4
OpenBLAS
(b) Intel Sandy-Bridge sgemm.
Figure 5.3: Performance test of gemm kernel for squared matrices of size n×n,
n ∈ [4, 300], on an Intel Sandy-Bridge, code compiled with gcc
4.8.4. Peak performance in double (single) precision is 8·2.9 = 23.2
Gﬂops (16 · 2.9 = 46.4 Gﬂops).
As in the dgemm case, the remaining 8 registers are used for intermediate results,
and to prefetch elements from A and B.
5.2.3.3 Results
The test machine is a laptop equipped with the Intel Core i5 2410M processor.
The processor has a base frequency of 2.3 GHz and a maximum turbo frequency
of 2.9 GHz. During all tests, the processor is found running at the maximum
turbo frequency. The processor incorporates 3 MB of L3 cache. In double preci-
sion (Figure 5.3a), the proposed dgemm routine attains a maximum performance
of 21.7 Gﬂops (93%), that compares with the 20.4 Gﬂops (88%) of OpenBLAS.
In this case, the proposed dgemm routine shows a good performance also for
small matrix sizes, but the OpenBLAS dgemm routine performs rather well. In
single precision (Figure 5.3b), the proposed sgemm routine attains a maximum
performance of 44.9 Gﬂops (97%), that compares with the 34.5 Gﬂops (74%) of
OpenBLAS. In this case, there is a large performance gap between the proposed
routine and the OpenBLAS counterpart, especially for small matrix sizes. The
reason is partly due to the fact that the sgemm kernel in OpenBLAS makes use
of a 8× 8 scheme, that in theory optimizes the A and B elements reuse but in
practice seems to saturate the shue execution unit.
5.2 x86_64 113
5.2.4 Intel Haswell
The Haswell architecture arrived to the market in 2013 as a "tock" (a new
micro-architecture). It is implemented on the same 22 nm FinFET of the Ivy-
Bridge architecture, that was a "tick" (shrink to a new process technology) of
the Sandy-Bridge micro-architecture. Haswell is found in 4th generation Intel
Core processors. The following Broadwell architecture (found in 5th generation
Intel Core processors) is a "tick" of Haswell, implementing basically the same
micro-architecture on Intel 14 nm FinFET process.
On the ISA side, Haswell presents many improvements. The changes most useful
to numerical code are the introduction of the AVX2 ISA (providing 256-bit wide
integer SIMD and introducing new FP instructions such as gather and broad-
cast) and the FMA ISA (containing fused-multiply-accumulate instructions).
The Haswell micro-architecture shares with its predecessor (Sandy-Bridge) the
fact that it is a 64-bit superscalar processor that can issue 4 instructions per
cycle. It supports two threads per physical core, speculative and out-of-order
execution, and register renaming on both GP and FP registers. There are 32
KB of both data and instruction 8-way associative L1 cache per core, and 256
KB of 8-way associative L2 cache per core. There is also a L3 cache shared
between CPU cores and GPU.
Among the new micro-architecture features there is the addition of a new exe-
cution port, used to execute ALU and branch instructions. The multiplication
port in Sandy-Bridge is extended to execute SIMD fused-multiply-accumulate
instructions. The addition port present in Sandy-Bridge is also extended to
execute a SIMD fused-multiply-accumulate instruction. This means that two
256-bit wide fused-multiply-accumulate can be issued at each clock cyce, dou-
bling the FP theoretical performance with respect to the Sandy-Bridge micro-
architecture. In order to sustain the improved FP performance, the bandwidth
between registers and L1 cache is doubled, and the Haswell micro-architecture
can sustain 2 256-bit loads and 1 256-bit store per clock cycle. The bandwidth
L1 and L2 cache is doubled too, and now equal to 64 Byte (256 bit) per clock
cycle. In some models it may be present a 4th level of cache.
In both single and double precision, the FMA instruction has a latency of 5
clock cycles (that is identical to the latency of multiplication) and a throughput
of 0.5. This means that, in order to fully hide latency of FMA instructions, at
least 10 registers have to be used to hold elements of the result sub-matrix C. As
a consequence, the gemm implementation scheme used in both Core and Sandy-
Bridge micro-architectures (where 8 registers are used to hold a sub-matrix of
C) can not give more that 80% of full FP throughput.
114 Optimizing gemm kernels on diﬀerent architectures
In the proposed implementation scheme for linear algebra in embedded opti-
mization, the panels from all matrices needs to have the same size (in this case
equal to the SIMD width): therefore, it is not possible to use exactly 10 registers.
Instead, a new gemm implementation scheme is employed, where 12 registers are
used to hold elements from the result sub-matrix C, and the remaining 4 regis-
ters are used to hold elements from the A and B matrix. In the gemm kernel, 3
panels from A and 1 panel from B are streamed at once. The 4 remaining reg-
isters are enough to hold 3 elements from the A matrix and 1 element from the
B matrix. No additional registers are necessary for intermediate results, thanks
to the use of FMA instruction. There are no registers left to prefetch elements
from A and B: therefore the 16 FP registers are barely enough to implement
this gemm scheme.
5.2.4.1 dgemm
Each 256-bit register can hold 4 double-precision FP numbers. In double preci-
sion, the fused-multiply-add instructions has a throughput of 2 instruction per
clock cycle. Therefore in double precision the full FP throughput of the Haswell
architecture is 16 ﬂops per clock cycle.
In the dgemm optimization, in the case of the Haswell architecture, at least 10
registers are needed to hide the latency of the fused-multiply-add instruction.
Given the constraint that the panel wise has to be the same for all matrices,
this means that 12 out of 16 registers must be used to hold a 12× 4 sub-matrix
of D. The panel height is bs = 4, identical the Sandy-Bridge case.
Similarly to the Sandy-Bridge case, the code can be safely written using intrin-
sics. The optimized code of an iteration over k is:
1: d_0 = _mm256_fmadd_pd( a_0, b_0, d_0 );
2: d_4 = _mm256_fmadd_pd( a_4, b_0, d_4 );
3: d_8 = _mm256_fmadd_pd( a_8, b_0, d_8 );
4: b_0 = _mm256_shuffle_pd( b_0, b_0, 0x5 );
5: d_1 = _mm256_fmadd_pd( a_0, b_0, d_1 );
6: d_5 = _mm256_fmadd_pd( a_4, b_0, d_5 );
7: d_9 = _mm256_fmadd_pd( a_8, b_0, d_9 );
8: b_0 = _mm256_permute2f128_pd( b_0, b_0, 0x1 );
9: d_3 = _mm256_fmadd_pd( a_0, b_0, d_3 );
10: d_7 = _mm256_fmadd_pd( a_4, b_0, d_7 );
5.2 x86_64 115
11: d_b = _mm256_fmadd_pd( a_8, b_0, d_b );
12: b_0 = _mm256_shuffle_pd( b_0, b_0, 0x5 );
13: d_2 = _mm256_fmadd_pd( a_0, b_0, d_2 );
14: a_0 = _mm256_load_pd( &A0[4] );
15: d_6 = _mm256_fmadd_pd( a_4, b_0, d_6 );
16: a_4 = _mm256_load_pd( &A1[4] );
17: d_a = _mm256_fmadd_pd( a_8, b_0, d_a );
18: b_0 = _mm256_load_pd( &B[4] );
19: a_8 = _mm256_load_pd( &A2[4] );
This scheme is identical the the one employed in the Sandy-Bridge case, with
two key diﬀerences. Firstly, the multiplication and the addition are fused, and
therefore there are no intermediate results. Furthermore, since 12 registers must
be employed as accumulation registers, there are only 4 registers left to hold
elements from A and B (there is on need to additional registers for intermediate
results). These 4 registers can barely hold the 3 elements of A and the element of
B that are reused as arguments of the fused-multiply-add. There are no extra
registers available to aggressively prefetch A and B elements. Therefore this
implementation scheme must heavily rely on the advanced core features such as
out-of-order computation, register renaming and hardware prefetch.
5.2.4.2 sgemm
Each 256-bit register can hold 8 single-precision FP numbers. In single preci-
sion, the fused-multiply-add instructions has a throughput of 2 instruction per
clock cycle. Therefore in single precision the full FP throughput of the Haswell
architecture is 32 ﬂops per clock cycle.
Also in the sgemm optimization, in the case of the Haswell architecture, at least
10 registers are needed to hide the latency of the fused-multiply-add instruction.
Given the constraint that the panel wise has to be the same for all matrices,
this means that 12 out of 16 registers must be used to hold a 24× 4 sub-matrix
of D. The panel height is bs = 8, identical the Sandy-Bridge case.
Similarly to the Sandy-Bridge case, the code can be safely written using intrin-
sics. The optimized code of an iteration over k is:
1: b_0 = _mm256_broadcast_ps( (__m128 *) &B[0] );
2: d_0 = _mm256_fmadd_ps( a_0, b_0, d_0 );
3: d_4 = _mm256_fmadd_ps( a_8, b_0, d_4 );
116 Optimizing gemm kernels on diﬀerent architectures
4: d_8 = _mm256_fmadd_ps( a_g, b_0, d_8 );
5: b_0 = _mm256_permute_ps( b_0, 0xb1 );
6: d_1 = _mm256_fmadd_ps( a_0, b_0, d_1 );
7: d_5 = _mm256_fmadd_ps( a_8, b_0, d_5 );
8: d_9 = _mm256_fmadd_ps( a_g, b_0, d_9 );
9: b_0 = _mm256_permute_ps( b_0, 0x4e );
10: d_2 = _mm256_fmadd_ps( a_0, b_0, d_2 );
11: d_6 = _mm256_fmadd_ps( a_8, b_0, d_6 );
12: d_a = _mm256_fmadd_ps( a_g, b_0, d_a );
13: b_0 = _mm256_permute_ps( b_0, 0xb1 );
14: d_3 = _mm256_fmadd_ps( a_0, b_0, d_3 );
15: a_0 = _mm256_load_ps( &A0[8] );
16: d_7 = _mm256_fmadd_ps( a_8, b_0, d_7 );
17: a_8 = _mm256_load_ps( &A1[8] );
18: d_b = _mm256_fmadd_ps( a_g, b_0, d_b );
19: a_g = _mm256_load_ps( &A2[8] );
Again, this scheme is identical the the one employed in the Sandy-Bridge case,
with the same two key diﬀerences that in the double precision case. There are no
extra registers available to aggressively prefetch A and B elements. Therefore
this implementation scheme must heavily rely on the advanced core features
such as out-of-order computation, register renaming and hardware prefetch.
5.2.4.3 Results
The test machine is a laptop equipped with the Intel Core i7 4800MQ processor,
that has a base frequency of 2.7 GHz and a maximum turbo frequency of 3.7
GHz, and it is equipped with 6 MB L3 cache. If AVX or AVX2 code is employed,
the turbo frequency is lowered to 3.3 GHz, as it is in these tests. At 3.3 GHz,
the full FP throughput per core is of 52.8 Gﬂops in double precision and of 105.6
Gﬂops in single precision. In double precision (Figure 5.4a) the proposed dgemm
routine reaches 49.4 Gﬂops (93.6%), while in single precision (Figure 5.4b) the
proposed sgemm routine reaches 99.4 Gﬂops (94.1%). As a comparison, the
OpenBLAS counterparts reach 40.9 Gﬂops (77.5%) and 76.6 Gﬂops (72.5%).
Even more important than absolute performance, the proposed implementation
scheme gives much better performance for small matrices, especially in single
precision.
As a note on the implementation of the potrf and trtri kernels, there the
5.2 x86_64 117
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
n
Intel Haswell - dgemm
HPMPC_kernel_12x4
OpenBLAS
(a) Intel Haswell dgemm.
0
20
40
60
80
100
0 50 100 150 200 250 300
G
flo
ps
n
Intel Haswell - sgemm
HPMPC_kernel_24x4
OpenBLAS
(b) Intel Haswell sgemm.
Figure 5.4: Performance test of gemm kernel for squared matrices of size n×n,
n ∈ [4, 300], on an Intel Haswell processor, code compiled with gcc
4.9.2. Peak performance in double (single) precision is 16 · 3.3 =
52.8 Gﬂops (32 · 3.3 = 105.6 Gﬂops).
fnmadd intrinsic is used in place of the fmadd intrinsic in the kernel loop. The
gcc compiler emits a combination of fmadd and xor instructions in place of the
fnmadd instructions, decreasing performance by about 20% compared to the
gemm kernel. This is solved by using a customized version of the gcc compiler,
as described in Appendix A.
5.2.5 Intel Skylake
The Skylake architecture arrived to the market in late 2015 as a "tock" (a
new micro-architecture). It is implemented on the same 14 nm FinFET of the
Broadwell architecture, that was a "tick" (shrink to a new process technology)
of the Haswell micro-architecture. Skylake is found in 6th generation Intel Core
processors. Skylake will be followed by another "tock", called Kaby Lake (that
will be found in the 7th generation Intel Core processors), expected in 2016, that
will be implemented in the same 14 nm FinFET process. At the time of writing,
there are no details about the new architecture, even if some speculations suggest
the possible widespread introduction of the AVX512 ISA, that is present only on
high-end Skylake Xeon processors. The reason for this double "tock" is probably
due to the diﬃculties in developing the 10 nm process technology. The "tick"
of Skylake should be called Cannonlake, and due sometimes in 2017.
On the ISA side, Skylake does not presents many improvements, and in partic-
118 Optimizing gemm kernels on diﬀerent architectures
ular there are no new ISAs improving FP code.
At the time of writing, there are not many details about the Skylake micro-
architecture. It is a 64-bit superscalar processor, and it appears that it can
issue 5 or 6 instructions per cycle. It supports two threads per physical core,
speculative and out-of-order execution, and register renaming on both GP and
FP registers. There are 32 KB of both data and instruction 8-way associative
L1 cache per core, and 256 KB of 4-way associative L2 cache per core. There
is also a L3 cache shared between CPU cores and GPU. In some models it may
be present a 4th level of cache.
A feature aﬀecting the gemm implementation scheme is that in both single and
double precision, the FMA instruction has a latency of 4 clock cycles (down from
5 cycles in the Haswell micro-architecture) and a throughput of 0.5. This means
that, in order to fully hide latency of FMA instructions, 8 accumulation registers
(instead of 10) should be enough to obtain full throughput. As a consequence, a
gemm implementation scheme where 8 registers are used to hold a sub-matrix of
C should give full FP throughput in the case of the Skylake micro-architecture.
This should slightly boost performance for small matrices. Compared to the
implementation scheme for Haswell, this should leave 4 registers free for more
aggressive prefetch of data, likely slightly improving peak performance too. Fur-
thermore, since Broadwell there is a radix-1024 divider: this reduces the latency
of divisions and square-roots instructions, further improving performance e.g.
in the factorization of small matrices.
The above speculations about the performance of a gemm implementation scheme
and of factorization routines could not be tested on a Skylake processor yet.
Note: at the time of printing the ﬁnal version of the thesis, a Skylake processor
has been quickly evaluated. A gemm kernel using 8 accumulation registers is
found to deliver about 80% of full throughput: even if this value is slightly
higher than in the Haswell case, the Haswell scheme using 12 accumulation
registers has to be employed also in the Skylake case in order to obtain full
throughput.
5.2.6 AMD K10
The K10, or 10h, is a micro-architecture designed by AMD. The ﬁrst products
based on this micro-architecture were released in 2007.
The K10 micro-architecture is an evolutionary step of the K8 micro-architecture.
It is a 64-bit superscalar processor that can issue 3 instructions per cycle. It
5.2 x86_64 119
supports speculative and out-of-order execution, and register renaming. There
are 64 KB 2-way associative L1 instruction cache and data cache, plus 512 KB
16-way associative exclusive L2 cache per core. Additionally, there are 2 MB
L3 cache shared among cores. The vector units width has been increased from
64-bit in the K8 to 128-bit in the K10 micro-architecture.
On the ISA side, it supports SSEx ISAs up to SSE3, plus the SSE4a ISA (not
supported by Intel architectures). For the purpose of linear algebra routines
implementation, the SSE, SSE2 and SSE3 ISAs are considered, that are common
to Intel architectures. Therefore, the gemm kernels developed for the Intel Core
micro-architecture works also for the AMD K10 micro-architecture.
However, an important micro-architectural diﬀerence between Intel Core and
AMD K10 is that in the latter the shues (both FP and universal) are per-
formed on the multiplication unit. Therefore, shue instructions compete with
multiplications for the same execution unit, reducing throughput. As a solu-
tion, in double precision the movddup instruction can be used to broadcast a
double-precision FP number to both the lower and the upper half of a 128-bit
register, without the need to use shue instructions. Since loads are performed
on a diﬀerent execution unit, this allows the multiplication instructions to reach
full throughput.
In single precision, this scheme can not be employed, since there is not an equiv-
alent to the movddup instruction until the introduction of the AVX instruction
set. In the optimized sgemm in OpenBLAS, this is solved by storing a sub-
matrix of B in a memory buﬀer such that each B element is repeated 4 times,
and therefore the 4 copies of the same element can be loaded on all vector
components by means of a simple load instruction. The buﬀer is reused in the
product with a large sub-matrix of A, therefore well hiding the cost of building
the B buﬀer. However, this implementation scheme can not be employed in
the proposed implementation scheme for linear algebra in embedded optimiza-
tion, since the matrices are not copied into buﬀers on-line. Therefore, the 8× 4
sgemm kernel developed for the Intel Core micro-architecture is employed (even
if sub-optimal) also for the AMD K10 architecture.
5.2.6.1 dgemm
The 4×4 dgemm kernel tailored to the AMD K10 micro-architecture is analogue
to the corresponding kernel developed for the Intel Core, with the diﬀerence that
shue instructions are replaced by movddup instructions. Furthermore, the code
is implemented in C code with intrinsics, that gives good performance thanks
to the out-of-order and register-renaming capabilities of the micro-architecture.
120 Optimizing gemm kernels on diﬀerent architectures
a_0 = _mm_load_pd(&A[0]);
a_2 = _mm_load_pd(&A[2]);
b_0 = _mm_loaddup_pd(&B[0]);
b_1_0 = b_0;
b_0 = _mm_mul_pd( a_0, b_0 );
d_0 = _mm_add_pd( d_0, b_0 );
b_1 = _mm_mul_pd( a_2, b_1 );
d_4 = _mm_add_pd( d_4, b_1 );
b_0 = _mm_loaddup_pd(&B[1]);
b_1 = b_0;
b_0 = _mm_mul_pd( a_0, b_0 );
d_1 = _mm_add_pd( d_1, b_0 );
b_1 = _mm_mul_pd( a_2, b_1 );
d_5 = _mm_add_pd( d_5, b_1 );
b_0 = _mm_loaddup_pd(&B[2]);
b_1 = b_0;
b_0 = _mm_mul_pd( a_0, b_0 );
d_2 = _mm_add_pd( d_2, b_0 );
b_1 = _mm_mul_pd( a_2, b_1 );
d_6 = _mm_add_pd( d_6, b_1 );
b_0 = _mm_loaddup_pd(&B[3]);
b_1 = b_0;
b_0 = _mm_mul_pd( a_0, b_0 );
d_3 = _mm_add_pd( d_3, b_0 );
b_1 = _mm_mul_pd( a_2, b_1 );
d_7 = _mm_add_pd( d_7, b_1 );
5.3 ARMv7A
The ARMv7A is a 32-bit RISC architecture. As most RISC architectures, it is a
load/store architecture (i.e. only load and store instructions can access memory,
all other instructions operate on registers) and it features simple addressing
modes. The ARMv7 architecture is divided into 3 proﬁles: Application (A),
Real-time (R) and Micro-controller (M) proﬁles. The A proﬁle is intended for
the highest performance applications, it employs a Memory Management Unit
(MMU), and it is inﬂuenced by multi-tasking OS requirements.
5.3 ARMv7A 121
The architecture deﬁnes 15 32-bit GP registers. In ARMv7A, there are two
ISA: ARM (32-bit encoding) and Thumb-2 (mixed of 16- and 32-bit encoding,
for higher code density). Regarding FP computation, there are two ISAs: VFP
(VFPv3 or VFPv4 in ARMv7A) and NEON (NEON or NEONv2 in ARMv7A).
The VFP ISA is essentially a scalar ISA, processing both single and double FP
numbers, while NEON is a SIMD one that can process many types of vector
integers and single-precision FP numbers. In most implementations of VFP an
in all implementations of NEON, there are 32 64-bit FP registers. Instructions
usually take 3 operands, and thus none of the input operands needs to the
overwritten.
Since version 0.2.9, OpenBLAS supports ARMv7A, but using only the scalar
VFP unit. The NEON code presented in the following is product of original
research and it outperforms the implementation in OpenBLAS.
5.3.1 ARM Cortex A9
The Cortex A9 is a 32-bit processor designed by ARM, on the market since 2010.
At that time, it was the higher performing processor by ARM. Compared to its
predecessor, the Cortex A8, the Cortex A9 adds out-of-order capabilities and
a multi-core design. On the FP side, Cortex A9 greatly improves performance
compared to Cortex A8: in fact, the VFP unit in the Cortex A8 is not pipelined,
and therefore many instructions are about an order of magnitude slower than
in the Cortex A9. The Cortex A9 can be found in the SoC equipping many
smartphones and tablets of a few years ago.
The Cortex A9 is a multicore design with up to 4 coherent cores. The Cortex A9
micro-architecture is superscalar with an issue capability of two instructions per
cycle (even if not all combinations of instructions can be co-issued, e.g. numer-
ical experiments shows that it can not co-issue ﬂoating-point instructions and
load instructions). It performs speculative and partially out-of-order execution
and register renaming in the GP registers (but not on the FP registers). There
can be 4-way associative 16, 32 or 64 KB of both data and instruction L1 cache
per core. There may be an external L2 cache shared among cores, with size
ranging from 128 KB to 8 MB. The cache line size is 32 byte.
The Cortex A9 can have two ﬂoating-point units: VFPv3 and NEON. The
former is a scalar unit that can process both single and double precision FP
numbers, and it is fully IEEE compilant. The latter is a vector unit that can
process small vectors of 4 single-precision FP numbers, but does not provide
support to double-precision FP numbers, nor is fully IEEE compilant, since it
only support the round-to-nearest mode. In the Cortex A9, both FP units are
122 Optimizing gemm kernels on diﬀerent architectures
optional. However, in most cases both of them are present, since they are used
in processing multimedia. The FP datapath is 64-bit wide.
5.3.1.1 dgemm
The vector NEON instruction set does not support double-precision ﬂoating-
point numbers. As a consequence, the scalar VFP instruction set has to be
used.
The Cortex A9 implements the VFPv3 version, that has a multiply-accumulate
instruction (MLA) but not a fused-multiply-accumulate instruction (FMA). The
diﬀerence between the two instructions is that in the multiply-accumulate in-
struction the result is rounded twice, once after the multiplication and once
after the addition, while in the fused-multiply-accumulate the result is rounded
only once, after the ﬁnal addition: the FMA is thus more accurate than MLA.
Therefore, the FMA is more accurate than MLA or than the sequence of mul-
tiplication and addition instructions.
The VFP unit has 32 scalar double-precision registers. Out of them, 16 can be
used as accumulation registers, to hold a 4 × 4 sub-matrix of C. Thus a 4 × 4
dgemm kernel is used, with panel high bs = 4. Of the remaining 16 registers, 8
are enough to hold a 4 elements from the A matrix a 4 elements from the B
matrix. This means that there are still 8 registers available to more aggressively
prefetch the following 4 elements from the A and B matrices.
The optimized code of an iteration over k is
1: fmacd d0, d16, d20
2: fmacd d1, d17, d20
3: fmacd d2, d18, d20
4: fmacd d3, d19, d20
5: fldd d20, [r1, #64]
6: fmacd d4, d16, d21
7: fmacd d5, d17, d21
8: fmacd d6, d18, d21
9: fmacd d7, d19, d21
10: fldd d21, [r1, #72]
11: fmacd d8, d16, d22
12: fmacd d9, d17, d22
5.3 ARMv7A 123
13: fmacd d10, d18, d22
14: fmacd d11, d19, d22
15: fldd d22, [r1, #80]
16: fmacd d12, d16, d23
17: fldd d16, [r0, #64]
18: fmacd d13, d17, d23
19: fldd d17, [r0, #72]
20: fmacd d14, d18, d23
21: fldd d18, [r0, #80]
22: fmacd d15, d19, d23
23: fldd d23, [r1, #88]
24: fldd d19, [r0, #88]
In the GCC inline assembly, fmacd is the FP multiply-accumulate instruc-
tion (ﬁrst argument is the accumulation register, the second and the third are
the multiplication factors), and fldd the FP load, both operating on double-
precision FP numbers. The Cortex A9 can issue a multiply-accumulate instruc-
tion every other cycle: thus by interleaving multiply-accumulate instructions
with load instructions, the load instruction can be performed in the idle cycle.
On average less than one instruction is issued every cycle, that is below the
maximum issue capability of 2 instructions per cycle.
In the above code, 64-bit registers d0-d15 are used to hold a 4 × 4 sub-matrix
of C: 
d0 d4 d8 d12
d1 d5 d9 d13
d2 d6 d10 d14
d3 d7 d11 d15
←

c00 c01 c02 c03
c10 c11 c12 c13
c20 c21 c22 c23
c30 c31 c32 c33

while registers d16-d23 are used to hold vectors from A and B matrices:
d16
d17
d18
d19
←

a0k
a1k
a2k
a3k
 ,

d20
d21
d22
d23
←

b0k
b1k
b2k
b3k
 ,
while at the following iteration over k the registers d24-d31 are used instead.
A performance test shows that the best performance is obtained for data ﬁtting
L1 cache: this hints at the fact that hardware prefetch is not implemented. If
software prefetch is employed, the high-performance is attained also for data
ﬁtting in L2 cache.
124 Optimizing gemm kernels on diﬀerent architectures
The large number of registers and the multiply-accumulate instruction makes
the optimization for this kernel easy.
5.3.1.2 sgemm
In single-precision, it is possible to choose between two FP units: VFP and
NEON. The former is a scalar unit (despite the name) and is fully IEEE compi-
lant; the latter is a vector unit that can perform some integer and ﬂoating-point
instruction (but e.g. no divisions nor square root) but it is not fully IEEE com-
pilant (e.g. it only supports round-to-nearest and always ﬂushes denormals to
zero). This is not an issue in case of MPC, thus the higher-performing vector
NEON unit is preferred.
The NEON instruction set supports a set of 32 double-word (or 64-bit) registers
(d0-d31), that can be seen as 16 quad-word (or 128-bit) registers (q0-q15). The
lower 16 double-word registers can be seen as 32 single-word (or 32-bit) registers
(s0-s31) and accessed individually. All instructions are ﬂexible and can operate
on vectors of 1, 2 or 4 packed single-precision FP numbers, corresponding to
the s, d, or q registers. This means e.g. that it is possible to operate on the
high double-word of a quad-word register, or on all single words of a quad-word
register in the set q0-q7. This ﬂexibility makes the code optimization much
easier than in the x86 architecture, where only the lower element or the whole
vector can be accessed and shue and permute instructions must be heavily
employed.
In the implementation of the sgemm kernel, out of the 16 quad-word registers,
8 can be used as accumulation registers to hold a 8× 4 sub-matrix of C. Thus
a 8 × 4 kernel is used, with panel height bs = 4. Of the remaining 8 registers,
3 are enough to hold two 4-wide vectors from matrix A and 4-wide vector from
matrix B needed at an iteration over k, and thus 6 registers can be used to
aggressively prefetch the vectors from A and B needed for two consecutive k
iterations.
The NEON version implemented in the Cortex A9 supports multiply-accumulate
instruction, but not fused-multiply-accumulate instruction. The multiply-accumulate
instruction is given in two variants, a vector-times-vector one (where all elements
of two registers are multiplied element-wise, and then added element-wise to the
accumulation register), and a vector-times-scalar one (where all elements of a
register are multiplied by a single element on a scalar register, and then added
element-wise to all elements of the accumulation register). This instruction is
rather powerful, and can be seen as the combination of a shue, a multiplication
and an addition in the x86 SSE instruction set.
5.3 ARMv7A 125
In the Cortex A9, the multiply-accumulate instruction has a throughput of
2, and thus the theoretical peak performance is of 4 ﬂops per cycle (a fused
multiply-add every other cycle). However, in practice the attained performance
in the sgemm kernel is rather lower. In fact, despite the instruction issue ca-
pability of 2 instructions per cycle, the Cortex A9 seems to have a single port
for FP and load/store instructions, and therefore these instructions can not be
co-issued. Even worse, it is well known that, due to the structure of the FP
pipeline, there is a performance penalty when mixing VFP and NEON instruc-
tions. The numerical tests performed during the optimization of the sgemm
kernel shows that there is a similar performance penalty in mixing multiply-
accumulate instructions with load instructions. This further limits the attain-
able performance, since load instructions can not be issued in the idle cycle
between two multiply-accumulate. The best performing code performs keeps
the multiply-accumulate instructions and the load instructions as separated as
possible.
The optimized code of two following iterations over k is
1: vmla.f32 q8, q2, d0[0]
2: vmla.f32 q9, q2, d0[1]
3: vmla.f32 q10, q2, d1[0]
4: vmla.f32 q11, q2, d1[1]
5: vmla.f32 q12, q4, d0[0]
6: vmla.f32 q12, q4, d0[1]
7: vmla.f32 q14, q4, d1[0]
8: vmla.f32 q15, q4, d1[1]
9: vmla.f32 q8, q3, d2[0]
10: vmla.f32 q9, q3, d2[1]
11: vmla.f32 q10, q3, d3[0]
12: vmla.f32 q11, q3, d3[1]
13: vmla.f32 q12, q5, d2[0]
14: vmla.f32 q13, q5, d2[1]
15: vmla.f32 q14, q5, d3[0]
16: vmla.f32 q15, q5, d3[1]
17: vld1.64 {d0, d1, d2, d3}, [r2:128]!
19: vld1.64 {d4, d5, d6, d7}, [r0:128]!
18: vld1.64 {d8, d9, d10, d11}, [r1:128]!
The vmla.f32 is the NEON multiply-accumulate operating on 32 bit FP num-
126 Optimizing gemm kernels on diﬀerent architectures
bers. The ﬁrst argument is the accumulation register, the second and the third
are the multiplication factors. The ﬁrst 2 arguments are quad-word registers, the
third is the scalar value broadcast in the multiplication. The vld1.64 instruc-
tion is the NEON load of contiguous 64-bit values: it can load up to 4 64-bit
registers (speciﬁed in the curly brackets), from the memory location starting
at the value in the GP register in squared brackets together with an optional
alignment hint (in this case 128 bit alignment); the exclamation mark is used
to increase the value of the GP register of the right quantity after the load.
The quad-word registers q8 to q15 are used to hold a 8× 4 sub-matrix of D, as
q8←

d00
d10
d20
d30
 , q9←

d01
d11
d21
d31
 , q10←

d02
d12
d22
d32
 , q11←

d03
d13
d23
d33

q12←

d40
d50
d60
d70
 , q13←

d41
d51
d61
d71
 , q14←

d42
d52
d62
d72
 , q15←

d43
d53
d63
d73

The quad-word registers q0 to q5 are used to hold elements of A and B for two
iterations, as
q0←

b0,k+0
b1,k+0
b2,k+0
b3,k+0
 , q1←

b0,k+1
b1,k+1
b2,k+1
b3,k+1
 ,
q2←

a0,k+0
a1,k+0
a2,k+0
a3,k+0

q3←

a0,k+1
a1,k+1
a2,k+1
a3,k+1

,
q4←

a4,k+0
a5,k+0
a6,k+0
a7,k+0

q5←

a4,k+1
a5,k+1
a6,k+1
a7,k+1

The quad-word registers q6 and q7 are not employed.
5.3.1.3 Results
The test machine is a development board called Wandboard Quad. The proces-
sor is the Freescale i.MX6 SoC, that has a quad-core Cortex A9 running at 1
GHz and 1 MB L2 cache. In this thesis, only single-thread code is considered.
As shown in Figure 5.5, the dgemm and sgemm kernels attain a best performance
respectively of 0.95 Gﬂops (95% of full FP throughput) and 2.74 Glops (68%).
As a comparison, OpenBLAS attains a best performance respectively of 0.92
Gﬂops (92%) and 1.51 Gﬂops (38%). In single precision, the much lower per-
formance of OpenBLAS is mainly due to the fact that it does not make use of
5.3 ARMv7A 127
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250 300
G
flo
ps
n
ARM Cortex A9 - dgemm
HPMPC_kernel_4x4
OpenBLAS
(a) Cortex A9 dgemm.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 50 100 150 200 250 300
G
flo
ps
n
ARM Cortex A9 - sgemm
HPMPC_kernel_8x4
OpenBLAS
(b) Cortex A9 sgemm.
Figure 5.5: Performance test of gemm kernel for squared matrices of size n×n,
n ∈ [4, 300], on an ARM Cortex A9, code compiled with gcc 4.6.3.
Peak performance in double (single) precision is 1 · 1 = 1 Gﬂops
(4 · 1 = 4 Gﬂops).
the vector NEON unit (since in Cortex A9 it is not fully IEEE compilant) but
rather of the scalar VFP unit.
5.3.2 ARM Cortex A15
The Cortex A15 is the highest performing 32-bit processor designed by ARM.
Originally designed for use in servers, it is present on the market since 2012.
It can use up to 1 TB of memory thanks to the 40-bit Large Physical Address
Extensions. It can be found in high-end smartphones and tables or in laptops,
either alone or in big.LITTLE conﬁguration with lower-power Cortex A7.
The Cortex A15 is a multicore design, with up to 4 coherent cores per cluster,
and up to 4 clusters per physical chip. The Cortex A15 micro-architecture is
superscalar with an issue capability of 3 instructions per clock cycle (and, as an
improvement over the Cortex A9, the Cortex A15 can co-issue FP instructions
and load instructions). It supports speculative and out-of-order execution, and
register renaming on both GP and FP registers. There can be up to 4 cores per
cluster, with 2-way associative 32 KB of L1 data and instruction caches, and up
to 4 MB of integrated L2 cache per cluster. The cache line size is 64 byte, twice
as much as the Cortex A9.
The Cortex A15 has two FP units per core: VFPv4 and NEONv2. The main
128 Optimizing gemm kernels on diﬀerent architectures
diﬀerence with respect to the VFPv3 and NEON units present in older Cortex
A9 is the presence of fused-multiply-accumulate instructions.
5.3.2.1 dgemm
In double-precision, the scalar VFPv4 units has to be used. The proposed
code is the same as the Cortex A9 (with the diﬀerence that half the prefetch
instructions are needed, since the cache line size is the double than in the Cortex
A9), even if a version using the fused-multiply-accumulate may be written.
Compared to Cortex A9, in the Cortex A15 the multiply-accumulate instruction
has a throughput of 1. Furthermore, a FP instruction and a load instruction can
be co-issued in the same clock cycle: thus the full (and actual) FP throughput is
doubled compared to the Cortex A9. The FP datapath is 128-bit wide, doubled
compared to Cortex A15.
5.3.2.2 sgemm
Also in the case of Cortex A15, in single-precision it is possible to choose between
VFP and NEON units. Since the latter is higher performing, a version of the
sgemm kernel using NEON instruction will be presented.
In Cortex A15, the NEON instruction set is implemented in the NEONv2 ver-
sion, that has also the more accurate fused-multiply-accumulate instruction.
However, this is implemented only in the vector-times-vector form, and tuning
experiments shows that it has a lower throughput compared to the multiply-
accumulate instruction. Therefore the latter is chosen, since it has an higher
throughput (1 instruction per clock cycle), and is implemented also in the vector-
times-scalar form.
In the Cortex A15, there is a set of 16 quad-word registers (q0-15), that can be
seen as a set of 32 double-word registers (d0-d31). As an improvement over Cor-
tex A9, in the Cortex A15 there are no penalty issues in mixing VFP and NEON
instructions, nor in mixing NEON and load instructions. Furthermore, tuning
experiments show that it can co-issue a NEON multiply-accumulate and a VFP
load, but it can not co-issue a NEON multiply-accumulate and a NEON load.
The best performing code is thus obtained by interleaving a NEON multiply-
accumulate and a VFP load.
The fact that there are no penalty issues in mixing multiply-accumulate and
5.3 ARMv7A 129
load instructions implies that it is possible to use even more registers to hold
a sub-matrix of C, reducing in this way the number of memory operations. In
particular, a 12×4 kernel is found having better performance than a 8×4 kernel.
In the 12× 4 kernel, 12 quad-word registers (q4-q15) are used as accumulation
register to hold a 12×4 sub-matrix of C, while the 4 remaining registers (q0-q3)
are used to hold three 4-wide vectors from the A matrix and one from the B
matrix.
The optimized code of two following iterations over k is
1: vmla.f32 q4, q1, d0[0]
2: vldr d6, [r2, #0]
3: vmla.f32 q5, q1, d0[1]
4: vldr d7, [r2, #8]
5: vmla.f32 q6, q1, d1[0]
6: vmla.f32 q7, q1, d1[1]
7: vldr d2, [r0, #0]
8: vmla.f32 q8, q2, d0[0]
9: vldr d3, [r0, #8]
10: vmla.f32 q9, q2, d0[1]
11: vmla.f32 q10, q2, d1[0]
12: vmla.f32 q11, q2, d1[1]
13: vldr d4, [r3, #0]
14: vmla.f32 q12, q3, d0[0]
15: vldr d5, [r3, #8]
16: vmla.f32 q13, q3, d0[1]
17: vldr d0, [r1, #0]
18: vmla.f32 q14, q3, d1[0]
19: vmla.f32 q15, q3, d1[1]
20: vldr d1, [r1, #8]
21: vmla.f32 q4, q1, d4[0]
22: vldr d6, [r2, #16]
23: vmla.f32 q5, q1, d4[1]
24: vldr d7, [r2, #24]
25: vmla.f32 q6, q1, d5[0]
26: vmla.f32 q7, q1, d5[1]
27: vldr d2, [%2, #16]
28: vmla.f32 q8, q0, d4[0]
29: vldr d3, [%2, #24]
130 Optimizing gemm kernels on diﬀerent architectures
30: vmla.f32 q9, q0, d4[1]
31: vmla.f32 q10, q0, d5[0]
32: vmla.f32 q11, q0, d5[1]
33: vldr d0, [%5, #16]
34: vmla.f32 q12, q3, d4[0]
35: vldr d1, [%5, #24]
36: vmla.f32 q13, q3, d4[1]
37: vldr d4, [%3, #16]
38: vmla.f32 q14, q3, d5[0]
39: vmla.f32 q15, q3, d5[1]
40: vldr d5, [%3, #24]
with lines 1-20 corresponding to the ﬁrst iteration over k and lines 21-40 corre-
sponding to the second iteration over k. The instruction vmla.f32 is the NEON
4-wide vector multiply-accumulate of single-precision FP numbers, and the in-
struction vldr is the VFP load. Notice that the role of registers q0 (i.e. d0-d1)
and q2 (i.e. d4-d5) is inverted in the second iteration, compared to the ﬁrst one:
this choice is made to separate as much as possible the loading of registers and
their following use.
5.3.2.3 Results
The test processor is one core of the NVIDIA Tegra K1 SoC, running at 2.3
GHz. As shown in Figure 5.6, the dgemm and sgemm kernels attain a best
performance respectively of 4.41 Gﬂops (95.8% of full FP throughput) and 16.33
Gﬂops (88.7%). Similarly to the Cortex A9 case, in double precision OpenBLAS
gives a similar performance, but in single precision it is much slower, since it
does not make use of the vector unit NEON.
5.3.3 ARM Cortex A7
The Cortex A7 is a low-power 32-bit processor designed by ARM. Present on the
market since 2013, it can be found alone in low-end smartphones and tablets,
or in combination with the Cortex A15 (big.LITTLE technology) in high-end
devices. It is fully feature compatible with Cortex A15, but the design focus in
on low-power consumption instead of high-performance.
The Cortex A7 micro-architecture is partially superscalar, being able to double-
issue only few combinations of instructions. It supports in-order execution,
5.3 ARMv7A 131
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 50 100 150 200 250 300
G
flo
ps
n
ARM Cortex A15 - dgemm
HPMPC_kernel_4x4
OpenBLAS
(a) Cortex A15 dgemm.
0
2
4
6
8
10
12
14
16
18
0 50 100 150 200 250 300
G
flo
ps
n
ARM Cortex A15 - sgemm
HPMPC_kernel_12x4
OpenBLAS
(b) Cortex A15 sgemm.
Figure 5.6: Performance test of gemm kernel for squared matrices of size n×n,
n ∈ [4, 300], on an ARM Cortex A15, code compiled with gcc 4.?.
Peak performance in double (single) precision is 2·2.3 = 4.6 Gﬂops
(8 · 2.3 = 18.4 Gﬂops).
without any register renaming. There can be up to 8 cache-coherent cores per
cluster, with 32 KB of both instruction and data L1 cache per core, and up to
1 MB integrated L2 cache. The cache line size is 64 byte for L1 data and L2
caches, and 32 byte for L1 instruction cache. It has the VFPv4 and NEONv2
FP units, and the FP datapath is 64 bit (same as Cortex A9).
5.3.3.1 dgemm
In double precision, the code for both dgemm is exactly the same as for Cortex
A15. However, Cortex A7 can only perform a double-precision MLA every 4
cycles: as a consequence the performance-per-cycle is half of Cortex A9 and a
quarter of Cortex A15, with a full FP throughput of 0.5 ﬂops per cycle.
5.3.3.2 sgemm
In single precision, the Cortex A7 can perform a VFP MLA every cycle, or a
NEON 4-wide MLA every 4 cycles (NEON can eﬀectively execute 32-bit per
cycle): so the full FP throughput is the same for VFP and NEON, namely 2
ﬂops per cycle. However, the Cortex A7 shows the same performance penalties
as the Cortex A9, and the penalty in mixing FP loads and MLA apply to both
VFP and NEON. In practice, using NEON it is possible to have a slightly better
132 Optimizing gemm kernels on diﬀerent architectures
0
0.1
0.2
0.3
0.4
0.5
0 50 100 150 200 250 300
G
flo
ps
n
ARM Cortex A7 - dgemm
HPMPC_kernel_4x4
OpenBLAS
(a) Cortex A7 dgemm.
0
0.5
1
1.5
2
0 50 100 150 200 250 300
G
flo
ps
n
ARM Cortex A7 - sgemm
HPMPC_kernel_8x4
OpenBLAS
(b) Cortex A7 sgemm.
Figure 5.7: Performance test of gemm kernel for squared matrices of size n×n,
n ∈ [4, 300], on an ARM Cortex A7, code compiled with gcc 4.?.
Peak performance in double (single) precision is 0.5·1 = 0.5 Gﬂops
(2 · 1 = 2 Gﬂops).
performance, so the sgemm kernel for Cortex A7 is the same as for Cortex A9,
with the diﬀerence that it uses half the prefetch instructions (being the cache
line twice as long).
5.3.3.3 Results
Our test machine is Cubieboard 2, a development board equipped with the
Allwinner A20 SoC: the CPU is a dual-core Cortex A7 @ 1.0 GHz, with 512 KB
of L2 cache. The full FP throughput is equal to 0.5 Gﬂops in double precision
and 2 Gﬂops in single precision. The proposed dgemm routine (Figure 5.7a)
reaches the 92% of the full FP throughput, while the sgemm routine (Figure 5.7b)
reaches the 72%. The dgemm routine in OpenBLAS has similar performance,
while the sgemm routine performs slightly worse: again, this is mainly due to
the use of the VFP unit in place of the NEON one.
5.4 ARMv8A
The ARMv8A is a 64-bit RISC architecture. It is the 64-bit version of the
ARMv7A ISA, and it introduces several enhancements.
5.4 ARMv8A 133
There are two main execution states: AArch64 and AArch32. The former sup-
ports the ARMv8A ISA, while the latter supports the old ARM and Thumb
ISA in ARMv7A. Therefore, ARMv8A processor can execute ARMv7A code in
AArch32 mode.
The ARMv8A ISA supports 31 64-bit GP registers, plus an additional register
that acts as zero register. The registers are called 'X' registers, and the lower
half of them is accessed as the 'W' registers.
On the FP size, both VFPv4 and NEONv2 now operate on a set of 32 128-
bit FP registers (called 'V' registers). Diﬀerently compared to the ARMv7A
architecture, the 64-bit 'D' registers are now the lower half of the 'V' registers,
and the 32-bit 'S' registers are the lower half of the 'D' registers. Therefore, 32
FP registers are available, regardless of the width of the instructions. This is
meant to simplify the implementation of processors, but removes the possibility
to operate on all single elements of the wide 128-bit registers. NEON gets fully
IEEE compilant, and adds support to double-precision FP numbers.
5.4.1 Cortex A57
The Cortex A57 is essentially the Cortex A15 with the ARMv8A ISA. Most of
the architectural features are unchanged: the core is out-of-order, superscalar
(that can decode 3 instructions per clock cycle) with speculative execution and
register renaming. The execution ports and the pipelines are mostly unchanged
too. The most relevant change is that the L1 instruction cache gets 48 KB and
3-way associative.
The processor is still able to perform a 128-bit wide FMA every clock cycle,
with a latency of 10 clock cycles. However, NEON now supports also double-
precision FP numbers, meaning that 4 single-precision or 2 double-precision
FMA can be performed every clock cycle. Therefore, in double precision the
full FP throughput is doubled with respect to the Cortex A15, while in single
precision it is unchanged.
Even if the processor is out-of-order, the code is written in assembly. This is
due to the fact that the compiler gcc 4.9 still does not produce high enough
quality code (even if the situation should improve with subsequent versions).
134 Optimizing gemm kernels on diﬀerent architectures
5.4.1.1 dgemm
As an improvement over Cortex A15, the vector NEON instructions can be used
also with double-precision FP numbers. Since the FMA latency is of 10 clock
cycles, at least 10 V registers must be employed. Therefore, a natural dgemm
kernel is 8×4 (with bs = 4), that makes use of 16 V registers to hold a 8×4 sub-
matrix of the result matrix. The remaining 16 registers are more than enough
to hold the A and B elements for two consecutive iterations over k, allowing for
an aggressive prefetch. Additionally, the use of prefetch instruction prfm (with
hint for persistent data in L1) is found to slightly improve performance.
The code for one iteration over k lools like
1: ld1 {v6.2d, v7.2d}, [x1], #32
2: fmla v16.2d, v0.2d, v4.2d[0]
3: ld1 {v10.2d, v11.2d}, [x3], #32
4: fmla v17.2d, v1.2d, v4.2d[0]
5: ld1 {v8.2d, v9.2d}, [x11], #32
6: fmla v24.2d, v2.2d, v4.2d[0]
7: prfm PLDL1KEEP, [x3, #64]
8: fmla v25.2d, v3.2d, v4.2d[0]
9: prfm PLDL1KEEP, [x1, #64]
10: fmla v18.2d, v0.2d, v4.2d[1]
11: prfm PLDL1KEEP, [x11, #64]
12: fmla v19.2d, v1.2d, v4.2d[1]
13: fmla v26.2d, v2.2d, v4.2d[1]
14: fmla v27.2d, v3.2d, v4.2d[1]
15: fmla v20.2d, v0.2d, v5.2d[0]
16: fmla v21.2d, v1.2d, v5.2d[0]
17: fmla v28.2d, v2.2d, v5.2d[0]
18: fmla v29.2d, v3.2d, v5.2d[0]
19: fmla v22.2d, v0.2d, v5.2d[1]
20: fmla v23.2d, v1.2d, v5.2d[1]
21: fmla v30.2d, v2.2d, v5.2d[1]
22: fmla v31.2d, v3.2d, v5.2d[1]
where fmla is the FMA instruction, ld1 is the register load instruction and prfm
is the prefetch hint instruction. The suﬃx .2d after the vector name speciﬁes
the number and size of the vector elements (2 double-precision FP numbers per
5.4 ARMv8A 135
vector).
5.4.1.2 sgemm
In single precision, the code is essentially the same as the Cortex A15, translated
to the new ISA and exploiting the availability of more registers. Therefore, the
sgemm kernel is 12× 4 (with bs = 4), meaning that 12 V registers are employed
to hold a 12 × 4 sub-matrix of the result matrix. The remaining 20 registers
are more than enough to hold the A and B elements for 4 consecutive iterations
over k, allowing for an aggressive prefetch.
Also in the single-precision case the use of prefetch instruction prfm (with hint
for persistent data in L1) is found to slightly improve performance.
The code for one iteration over k looks like
1: fmla v16.4s, v0.4s, v6.4s[0]
2: ld1 {v8.4s, v9.4s}, [x1], #32
3: fmla v17.4s, v0.4s, v6.4s[1]
4: ld1 {v14.4s, v15.4s}, [x3], #32
5: fmla v18.4s, v0.4s, v6.4s[2]
6: ld1 {v10.4s, v11.4s}, [x11], #32
7: fmla v19.4s, v0.4s, v6.4s[3]
8: ld1 {v12.4s, v13.4s}, [x14], #32
9: fmla v20.4s, v2.4s, v6.4s[0]
10: prfm PLDL1KEEP, [x1, #64]
11: fmla v21.4s, v2.4s, v6.4s[1]
12: prfm PLDL1KEEP, [x3, #64]
13: fmla v22.4s, v2.4s, v6.4s[2]
14: prfm PLDL1KEEP, [x11, #64]
15: fmla v23.4s, v2.4s, v6.4s[3]
16: prfm PLDL1KEEP, [x14, #64]
17: fmla v24.4s, v4.4s, v6.4s[0]
18: fmla v25.4s, v4.4s, v6.4s[1]
19: fmla v26.4s, v4.4s, v6.4s[2]
20: fmla v27.4s, v4.4s, v6.4s[3]
The instruction name is the same as in double precision. The suﬃx .4s after
136 Optimizing gemm kernels on diﬀerent architectures
0
1
2
3
4
5
6
7
8
0 50 100 150 200 250 300
G
flo
ps
n
ARM Cortex A57 - dgemm
HPMPC_kernel_8x4
(a) Cortex A57 dgemm.
0
2
4
6
8
10
12
14
16
0 50 100 150 200 250 300
G
flo
ps
n
ARM Cortex A57 - sgemm
HPMPC_kernel_12x4
(b) Cortex A57 sgemm.
Figure 5.8: Performance test of gemm kernel for squared matrices of size n×n,
n ∈ [4, 300], on an ARM Cortex A57, code compiled with gcc
4.?. Peak performance in double (single) precision is 4 · 2.15 = 8.6
Gﬂops (8 · 2.15 = 17.2 Gﬂops).
the vector name speciﬁes the number and size of the vector elements (4 single-
precision FP numbers per vector).
Given the large number of registers, other sgemm kernel conﬁgurations are possi-
ble, such as 8×8 (with both bs = 4 and bs = 8), however, the 12×4 conﬁguration
is preferred to make easier the reuse of code with Cortex A15.
5.4.1.3 Results
The test processor is one core of the NVIDIA Tegra X1 SoC (running at 2.15
GHz) in the Shield TV. The memory is 3 GB of LPDDR4-3200 RAM giving
25.6 GB of bandwidth. As shown in Figure 5.8, both in double and in single
precision the best performance is of about 87% of full FP throughput. The
ﬁgure is single precision is essentially identical to the Cortex A15 one.
At the time of writing, the latest stable release of OpenBLAS still does not
support Cortex A57. The support is being added to the latest code version on
GitHub, but the code is buggy and can not compile successfully.
5.5 PowerPC 137
5.5 PowerPC
PowerPC (standing for Performance Optimization With Enhanced RISC - Per-
formance Computing) is a RISC ISA created in 1991 by the Apple-IBM-Motorola
alliance. The ISA has evolved over time, and in 2006 has been renamed Power
ISA. PowerPC is largely based on the previous Power ISA developed by IBM.
5.5.1 PowerPC 603e
The PowerPC 603e is part of the second generation of processors implementing
the PowerPC ISA. The PowerPC 603 is the ﬁrst processor implementing the
complete 32-bit PowerPC architecture. It is designed for low cost and low power
consumption. It features 8 KB of L1 data and instruction caches. The PowerPC
603e is an enhanced version, featuring 16 KB of L1 data and instruction caches.
The PowerPC 603e core is superscalar, and it can issue and retire up to 3 in-
structions per clock cycle, while executing 5 instructions on 5 execution units. It
supports out-of-order execution and register renaming. There is a FPU operat-
ing on 32 64-bit registers, and implementing a fused-multiply-add instructions
(FMA). The FPU is fully IEEE 754-compliant for both single- and double-
precision operations. The PowerPC 603e does not have a SIMD unit.
Each FP register can hold either a single or a double precision FP. Therefore,
the implementation scheme is analogue for single and double precision: out of
the 32 FP registers, 16 are used as accumulation registers to hold a 4 × 4 sub-
matrix of the result matrix D. The remaining registers are used to hold and to
aggressively prefetch elements from A and B.
5.5.1.1 dgemm
In double precision, a FMA instruction can be issued every other clock cycle.
Therefore the full FP throughput is of 1 ﬂop per cycle.
Generic C code compiled with the gcc compiler seems to work rather well with
the PowerPC 603e processor. However, an optimized assembly version has also
been developed, since it may be useful in case of limited optimization options
available on embedded devices.
The optimized code for an iteration over k is
138 Optimizing gemm kernels on diﬀerent architectures
1: fmadd 0,16,20,0
2: fmadd 1,17,20,1
3: fmadd 2,18,20,2
4: fmadd 3,19,20,3
5: lfd 20,64(%3)
6: fmadd 4,16,21,4
7: fmadd 5,17,21,5
8: fmadd 6,18,21,6
9: fmadd 7,19,21,7
10: lfd 21,72(%3)
11: fmadd 8,16,22,8
12: fmadd 9,17,22,9
13: fmadd 10,18,22,10
14: fmadd 11,19,22,11
15: lfd 22,80(%3)
16: fmadd 12,16,23,12
17: lfd 16,64(%2)
18: fmadd 13,17,23,13
19: lfd 17,72(%2)
20: fmadd 14,18,23,14
21: lfd 18,80(%2)
22: fmadd 15,19,23,15
23: lfd 23,88(%3)
24: lfd 22,88(%3)
This code closely resembles the code for other RISC ISAs with scalar FP unit,
such as the code for double precision ARMv7A processors. The main diﬀerence
(a part from the instruction name) is the fact that PowerPC FMA instructions
have 4 arguments.
5.5.1.2 sgemm
In single precision, a FMA instruction can be issued every clock cycle, twice as
much as in double precision. Therefore the full FP throughput is of 2 ﬂops per
cycle.
Generic C code compiled with the gcc compiler seems to work rather well with
the PowerPC 603e processor. However, an optimized assembly version has also
5.5 PowerPC 139
been developed, since it may be useful in case of limited optimization options
available on embedded devices.
The optimized code for an iteration over k is
1: fmadds 0,16,20,0
2: fmadds 1,17,20,1
3: fmadds 2,18,20,2
4: fmadds 3,19,20,3
5: lfs 20,32(%3)
6: fmadds 4,16,21,4
7: fmadds 5,17,21,5
8: fmadds 6,18,21,6
9: fmadds 7,19,21,7
10: lfs 21,36(%3)
11: fmadds 8,16,22,8
12: fmadds 9,17,22,9
13: fmadds 10,18,22,10
14: fmadds 11,19,22,11
15: lfs 22,40(%3)
16: fmadds 12,16,23,12
17: lfs 16,32(%2)
18: fmadds 13,17,23,13
19: lfs 17,36(%2)
20: fmadds 14,18,23,14
21: lfs 18,40(%2)
22: fmadds 15,19,23,15
23: lfs 23,44(%3)
24: lfs 22,44(%3)
The code is basically identical to the double precision case.
5.5.1.3 Results
The PowerPC target platform is the ABB AC500 PM592-ETH programmable
logic controller (PLC), which has a Freescale MPC8247CVRTIEA microcon-
troller (SoC). The core is the G2_LE implementation of the MPC603e micro-
processor. The test PLC is equipped with 4MB RAM for user program memory
140 Optimizing gemm kernels on diﬀerent architectures
and 4MB integrated user data memory. In the tested PLC conﬁguration, it is
not possible to link a library, and therefore it is not possible to test the Open-
BLAS library. The proposed implementation scheme for linear algebra is much
simpler and therefore it is possible to embed it in the main program, without
need for libraries linking. The test range is limited to matrix size n between 4
and 64 (instead of up to 300 as with the other test machines): this is due to the
fact that performing test on the given PLC is much more time consuming.
The G2_LE core is a low-power (1.5W) 32-bit RISC processor running at 400
MHz. It is equipped with independent on-chip, 16 KB, 4-way set-associative,
physically addressed L1 caches for instructions and data, and also on-chip in-
struction and data memory management units (MMUs). The cache line size is
32 byte.
In double precision (Figure 5.9a), the full FP throughput is of 0.4 Gﬂops. A
triple-loop version can attain a good performance only for very small matrices,
and performance drops signiﬁcantly when the data has to be fetched from main
memory (and especially for matrix size multiple of 32, due to the 4-way associa-
tivity of cache). The use of a 4 × 4 kernel gives a slight performance boost for
matrices ﬁtting into cache, but more importantly, it helps considerably when
the memory footprint exceed cache size, since every element of A and B is used
4 times once in registers. Interestingly, the assembly coded kernel does not im-
prove performance: FMA and the large number of registers make optimization
easy, so gcc with -O2 already produces good code. Nevertheless, an optimized
assembly code helps in case the overall code cannot be compiled with optimiza-
tion ﬂags. There are no advantages using prefetch. The maximum performance
is 0.28 Gﬂops (70% of full FP throughput).
In single precision (Figure 5.9b), the full FP throughput is of 0.8 Gﬂops. The
results from the test are rather diﬀerent compared to double precision. The ﬁrst
impression is that the performance graphs are much ﬂatter, without the typical
performance peak for data ﬁtting in cache: the best attained performance is
0.349 Gﬂops (43.6% of full FP throughput). Performance tests point toward
the instruction fetching as the bottleneck: in fact, for n = 32, leaving only
FMAs in the kernel loop coded in assembly, the performance ramps up to 0.45
Gﬂops, but leaving only memory operations the kernel execution time halves
again, so memory movement is not the bottleneck either. The G2_LE core
reference manual reports that the core can sustain 2 instruction fetches per
clock cycle, and a memory load and a FMA can execute in parallel every clock
cycle. In practice, however, the fact that the combination of loads and FMAs is
slower than each of them alone is a strong argument that the core cannot co-issue
load and FMA. In this framework, the performance gain of kernels compared
to triple-loop is due to the lower number of memory instructions, rather than
memory movements.
5.6 Conclusion 141
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 10 20 30 40 50 60
G
flo
ps
n
PowerPC 603e - dgemm
HPMPC_kernel_4x4
(a) PowerPC 603e dgemm.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 10 20 30 40 50 60
G
flo
ps
n
PowerPC 603e - sgemm
HPMPC_kernel_4x4
(b) PowerPC 603e sgemm.
Figure 5.9: Performance test of gemm kernel for squared matrices of size n×n,
n ∈ [4, 64], on a PowerPC 603e code compiled with gcc 4.?. Peak
performance in double (single) precision is 1 · 0.4 = 0.4 Gﬂops
(2 · 0.4 = 0.8 Gﬂops).
5.6 Conclusion
This chapter contains a collection of gemm kernels optimized for several widespread
micro-architectures, in both single and double precision. Many relevant details
about the micro-architectures are given, and the implementation choices are
justiﬁed using architectural details whenever possible. Table 5.1 summarizes
key features of the considered micro-architectures as well as gemm implementa-
tion parameters such as optimal kernel size and panel height in the panel-major
matrix format.
The performance of the gemm kernel on the test machines spreads over more than
two orders of magnitude, between roughly 0.3 Gﬂops for the dgemm kernel on
the PowerPC PLC up to almost 100 Gﬂops for the sgemm kernel on the Haswell
laptop. The performance of the gemm kernel is a good indicator (and an upper
limit) to the performance of all level 3 BLAS and LAPACK routines, and of
optimization algorithms implemented on top of them.
1
4
2
O
p
tim
izin
g
g
em
m
kern
els
o
n
d
iﬀ
eren
t
arch
itectu
res
Table 5.1: Table of micro-architecture features related to the gemm implementation. (*) if no FMA instr. available, the
throughput is the higher of multiplication instruction and addition instruction throughputs. ($) if no FMA
instr. available, the latency is the sum of the multiplication instruction and the addition instruction latencies.
architecture precision ISA SIMD (*) FMA ($) FMA ﬂops FP kernel bs
width throughput latency /cycle registers size
Intel Bonnell single SSE 4 2 5+5 4 8 4× 4 4
(x86) double SSE2 1 2 5+5 1 8 2× 2 2
Intel Core single SSE 4 1 4+3 8 16 8× 4 4
(x86_64) double SSE3 2 1 5+3 4 16 4× 4 4
Intel Nehalem single SSE 4 1 4+3 8 16 8× 4 4
(x86_64) double SSE3 2 1 5+3 4 16 4× 4 4
Intel Sandy-Bridge single AVX 8 1 5+3 16 16 16× 4 8
(x86_64) double AVX 4 1 5+3 8 16 8× 4 4
Intel Haswell single AVX2-FMA3 8 1/2 5 32 16 24× 4 8
(x86_64) double AVX2-FMA3 4 1/2 5 16 16 12× 4 4
Intel Skylake single AVX2-FMA3 8 1/2 4 32 16 24× 4 8
(x86_64) double AVX2-FMA3 4 1/2 4 16 16 12× 4 4
AMD K10 single SSE 4 1 4+4 8 16 8× 4 4
(x86_64) double SSE3 2 1 4+4 4 16 4× 4 4
ARM Cortex A9 single NEON 4 2 11(10) 4 16 8× 4 4
(ARMv7A) double VFPv3 1 2 9 1 32 4× 4 4
ARM Cortex A15 single NEONv2 4 1 10? 8 16 12× 4 4
(ARMv7A) double VFPv4 1 1 9? 2 32 4× 4 4
ARM Cortex A7 single NEONv2 4 4 ? 2 16 8× 4 4
(ARMv7A) double VFPv4 1 4 ? 0.5 32 4× 4 4
ARM Cortex A57 single NEONv2 4 1 10 8 32 12× 4 4
(ARMv8A) double NEONv2 2 1 10 4 32 8× 4 4
PowerPC 603e single PowerPC 1 1 ? 2 32 4× 4 4
(PowerPC) double PowerPC 1 2 ? 1 32 4× 4 4
Chapter 6
Summary and
considerations about code
generation
The ﬁrst part of the thesis presents dense linear algebra implementation tech-
niques specially tailored to embedded optimization (and therefore for small-
medium size matrices).
The innermost loop in each linear algebra routine is implemented as a separate
kernel, hand-optimized for the speciﬁc architecture (ISA and micro-architecture).
In the implementation of the kernel, blocking for registers is employed to hide
the latency of FP instructions and to reuse matrix elements once in registers (de-
creasing the memory bandwidth requirements). Blocking for the diﬀerent cache
levels is not employed, given the focus on small-scale performance. The kernel
code explicitly targets the speciﬁc machine instructions by means of inline as-
sembly (for in-order architectures) or intrinsics (for out-of-order architectures);
both of them give a rather ﬁne control over the instruction choice, and the
former also gives a ﬁne control over the instruction scheduling and registers al-
location, while the latter leaves these aspects to the compiler (as they are non
critical in out-of-order architectures). In the case of the gemm routine, this kernel
closely resembles the inner kernel proposed by BLIS [86] as a simpliﬁcation of
GotoBLAS's kernel [44].
144 Summary and considerations about code generation
The key diﬀerence with respect to optimized BLAS libraries is the employment
of a special matrix format (named panel-major matrix format) that resembles
the innermost packing structure of the data buﬀer employed in optimized BLAS
routines. Namely, the data is stored in the same order as it is streamed by the
gemm kernel. This matrix format is assumed as the default matrix format for all
operands, and therefore no conversions of matrices between standard column-
major or row-major orders needs to be done on-line. This greatly improves
the performance for small matrices. However, this approach also has a few
drawbacks. Since the panel-major matrix format is assumed as the default
matrix format, all matrices need to have the same panel height bs. This poses
limitations on the choice of the optimal gemm kernel size: the number of rows
and columns of the result sub-matrix processed by the kernel must be an integer
multiple of the panel height. If the processed matrix is not squared, as if it has
size 8× 4 with panel height 4, it means that two panels from A and one from B
are streamed to compute C ← C + A · BT , while in optimized BLAS libraries
this would be optimally handled by employing 8 has panel height for A and 4
as panel height for B. Furthermore, there are diﬃculties when operating on
sub-matrices, that in the current implementation in HPMPC can be done to a
very small extent.
A specialized kernel is employed in the implementation of all required BLAS
and LAPACK routines. This diﬀers form the implementation of the LAPACK
library, where small matrices are handled by unblocked routines (implemented
employing level 2 BLAS, as e.g. for the potf2 routine for the Cholesky fac-
torization) and large matrices are handled by blocked routines (implemented
employing level 3 BLAS and unblocked LAPACK routines, as e.g. for the potrf
routine for the Cholesky factorization). In the proposed approach, there is no
distinction between blocked and unblocked routines, and only a single routine
is implemented for each LAPACK operation, employing specialized kernels hav-
ing the gemm kernel as their main loop. This means that LAPACK routines for
e.g. Cholesky factorization and triangular matrix inversion are implemented as if
they were level 3 BLAS routines. The main advantage of the proposed approach
is that it gives high performance also for small matrices, whereas standard LA-
PACK routines give good performance only for much larger matrices, since they
employ level 3 BLAS for the operations on sub-matrices.
When dealing with embedded devices, an important aspect is the code memory
size. This has not been discussed yet in this thesis: the reason for this is that
the code size in HPMPC is ﬁxed, since it is not code-generated. Furthermore,
if needed it is possible to trade-oﬀ code size with performance, e.g. employing
a smaller number of tailored kernels at the expense of a small reduction in
computational performance. However, this is not an automated process, and, if
needed, it requires some hand tuning.
6.1 Comparison with existing dense linear algebra implementations 145
Code generation has not been considered, being orthogonal to the implemen-
tation techniques presented in this thesis. However, if considered beneﬁcial it
could be easily added, as brieﬂy discussed in the following tests.
6.1 Comparison with existing dense linear alge-
bra implementations
This section adds HPMPC (the library containing the linear algebra and opti-
mization routines implemented using the proposed techniques) to the compar-
isons made in Section 2.4. The test problem is still the computation of the lower
triangular Cholesky factor in double precision (given by the dpotrf LAPACK
routine).
More precisely, two additional test are added, one for the Cholesky factorization
routine as implemented in the HPMPC library (i.e., in library form, working
for matrices of any size), and one for a code-generated version of the routine
in HPMPC. This code-generated version employs C preprocessing to remove
unnecessary branches and ﬁx the size of the two outer loops. The linear algebra
kernels have not been changed, thus keeping them suitable for code reuse. Be-
side the ﬂags to enable architecture-speciﬁc instructions, the resulting code is
compiled with the ﬂags -O3 -funroll-loops (found to be beneﬁcial in case of
code-generated code), whereas the library version in HPMPC is compiled with
the ﬂag -O2. Therefore, this routine employs code-generation a la FORCES, in
addition to all implementation techniques proposed in this thesis and employed
in HPMPC.
Figure 6.1 contains the ﬁnal comparisons of this ﬁrst part of the thesis. The main
result is that the Cholesky factorization routine in HPMPC gives consistently
better performance than all alternatives, on both the tested Ivy-Bridge and
Haswell architectures. For very small matrices, the code-generated triple loop
version gives slightly better performance that the library version in HPMPC,
with break-even point around 8. However, the code-generated version of the
routine in HPMPC gives the absolute better performance. For matrices of size
in the range 10-100, code generation does not add any noticeable performance
to the routine in HPMPC, that clearly stands out being 2-6 times faster than
all other routines. For larger matrices, the performance of optimized BLAS
libraries approaches the performance of the routine in HPMPC (and eventually
exceed it for even large matrices), while triple-loop based routines (irrespectively
of code generation) perform clearly worse.
As a conclusion, the implementation techniques presented in the ﬁrst part of
146 Summary and considerations about code generation
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dpotrf
HPMPC
HPMPC CodeGen
Netlib
Netlib CodeGen
MKL
OpenBLAS
(a) Ivy-Bridge
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dpotrf
HPMPC
HPMPC CodeGen
Netlib
Netlib CodeGen
MKL
OpenBLAS
(b) Haswell
10-8
10-7
10-6
10-5
10-4
10-3
10-2
101 102
tim
e 
[s]
matrix size n
dpotrf
HPMPC
HPMPC CodeGen
Netlib
Netlib CodeGen
MKL
OpenBLAS
(c) Ivy-Bridge
10-8
10-7
10-6
10-5
10-4
10-3
10-2
101 102
tim
e 
[s]
matrix size n
dpotrf
HPMPC
HPMPC CodeGen
Netlib
Netlib CodeGen
MKL
OpenBLAS
(d) Haswell
Figure 6.1: Performance test for the LAPACK dpotrf routine on an Intel
Core i7 3520M processor (Ivy Bridge micro-architecture, support-
ing the AVX ISA) and an Intel Core i7 4800MQ (Haswell micro-
architecture, supporting the AVX2 and FMA ISAs).
6.1 Comparison with existing dense linear algebra implementations 147
this thesis can be employed to implement dense linear algebra routines giving
excellent performance in the range of matrix sizes of interest in embedded opti-
mization. Their performance exceeds both the performance of optimized BLAS
libraries (especially for very small matrices) and the performance of the code-
generated linear algebra routines currently employed in embedded optimization
(especially for matrices with size larger than a few tens).
Code generation could be combined with the implementation techniques pre-
sented in the ﬁrst part of this thesis, but the performance improvements are
much smaller than in the case of triple-loop based linear algebra and generally
not worthwhile. Therefore, code generation is not necessary to achieve high-
performance for matrix sizes of interest in embedded optimization.
148 Summary and considerations about code generation
Part II
Algorithms for Unconstrained
MPC and MHE Problems

Chapter 7
Unconstrained MPC and
MHE problem formulations
Part II deals with eﬃcient solution methods for the unconstrained (linear) MPC
and MHE problems. These problems play a key role in optimal control. In
fact, many sub-problems in optimization algorithms for constrained and non-
linear MPC and MHE problems can be written as instances of unconstrained
MPC and MHE problems. Therefore, the availability of fast routines for these
unconstrained MPC and MHE problems allows the development of a whole class
of eﬃcient solvers for more complex optimization problems.
All solvers presented in later chapters in this part are based on the unconstrained
MPC and MHE problem formulations introduced in the remainder of the section.
7.1 Unconstrained MPC problem
In this thesis, the following formulation of the unconstrained MPC problem
(also known in literature as extended linear-quadratic control problem [51, 52])
152 Unconstrained MPC and MHE problem formulations
is considered
min
u,x
N−1∑
n=0
1
2
unxn
1
T Rn Sn rnSTn Qn qn
rTn q
T
n 0
unxn
1
+ 1
2
[
xN
1
]T [
QN qN
qTN 0
] [
xN
1
]
(7.1a)
s.t. xn+1 = Anxn +Bnun + bn, n = 0, . . . , N − 1 (7.1b)
x0 = xˆ0 (7.1c)
0 = DNxN + dN (7.1d)
This is an instance of equality-constrained quadratic program with a special
structure. The cost function (7.1a) contains N quadratic and linear terms in the
states and controls (stages 0 to N−1), plus a quadratic and linear terminal cost
on states at stage N . The system dynamic is described by the linear state-space
system (7.1b). In the control problem, the state at stage 0 is ﬁxed (7.1c).
Suﬃcient conditions for the existence and uniqueness of the solution of (7.1
are that the matrices
[
Rn Sn
STn Qn
]
and QN are positive semi-deﬁnite, and the
matrices Rn are positive deﬁnite [31].
This formulation diﬀers from the usual unconstrained MPC formulation thanks
to the additional state equality constraint on the last stage. This term can be
useful to enforce stability of the MPC formulation [63]. Some solution meth-
ods can deal with this terminal constraint at a very small computational cost
increase.
The size of problem (7.1) is deﬁned by the quantities: nx (state vector size),
nu (input vector size), nd (number of equality constraints on the last stage), N
(horizon length).
7.1.1 Marix formulation
The unconstrained MPC problem can be reformulated in the matrix form
min
y¯
1
2
y¯T H¯y¯ + g¯T y¯ + γ (7.2a)
s.t. A¯y¯ = b¯d (7.2b)
7.1 Unconstrained MPC problem 153
where (choosing N = 3 to simplify the notation)
y¯ =

u0
u1
u2
x1
x2
x3
 =
[
u¯
x¯
]
, H¯ =

R0
R1 S1
R2 S2
ST1 Q1
ST2 Q2
Q3
 =
[
R¯ S¯
S¯T Q¯
]
,
g¯ =

S0xˆ0 + r0
r1
r2
q1
q2
q3
 =
[
r¯
q¯
]
, b¯d =

A0xˆ0 + b0
b1
b2
d3
 = [b¯d¯
]
,
A¯ =

−B0 I
−B1 −A1 I
−B2 −A2 I
−D3
 = [−B¯ A¯D¯
]
.
(7.3)
Notice that in the MPC case the matrix A¯ is clearly invertible, since it is lower
triangular with 1 on all diagonal elements.
7.1.2 Optimality conditions
The KKT optimality conditions for the unconstrained MPC problem can be
found e.g. in [31].
The KKT system associated to the unconstrained MPC problem can be written
in the band-diagonal form (for N = 3)
R0 B
T
0
B0 −I
−I Q1 ST1 AT1
S1 R1 B
T
1
A1 B1 −I
−I Q2 ST2 AT2
S2 R2 B
T
2
A2 B2 −I
−I Q3 DT3
D3


u0
λ0
x1
u1
λ1
x2
u2
λ2
x3
λ3

=

−S0xˆ0 − r0
−A0xˆ0 − b0
−q1
−r1
−b1
−q2
−r2
−b2
−q3
−d3

(7.4)
154 Unconstrained MPC and MHE problem formulations
7.2 Unconstrained MHE problem
The aim of the MHE problem is the reconstruction of the state vectors xn,
process noise vectors wn and measurement noise vectors vn, given the plant
model, the measurement vectors yn for a window of past time instants n =
0, 1, . . . , N and an initial estimate of the state vector at time 0, x¯0, and relative
covariance matrix P˜0, summarizing the contribution given by the measurements
prior to time 0.
The unconstrained MHE problem is traditionally written as the equality-constrained
quadratic program
min
xn,wn,vn
N∑
n=0
1
2
(vn − v¯n)T R˜−1n (vn − v¯n) +
N−1∑
n=0
1
2
(wn − w¯n)T Q˜−1n (wn − w¯n)+
+
1
2
(x0 − x¯0)T P˜−10 (x0 − x¯0)
s.t. xn+1 = Anxn +Gnwn + fn, n = 0, . . . , N − 1
yn = Cnxn + vn
(7.5)
In this formulation, the inverse of the matrices in the cost function have a precise
statistical interpretation: R˜n is the covariance matrix of the measurement noise
vector vn, Q˜n is the covariance matrix of the process noise vector wn. The
vectors v¯n and w¯n are the expected values of the measurement and process
noises.
In this thesis, the following more general formulation of the MHE problem is
considered
min
xn,un
N−1∑
n=0
1
2
unxn
1
T Rn Sn rnSTn Qn qn
rTn q
T
n 0
unxn
1
+ 1
2
[
xN
1
]T [
QN qN
qTN 0
] [
xN
1
]
(7.6a)
s.t. xn+1 = Anxn +Bnun + bn, n = 0, . . . , N − 1 (7.6b)
0 = DNxN + dN (7.6c)
The MHE problem (7.5) can be rewritten in this formulation by setting
un = wn, Bn = Gn, bn = fn, DN = 0, dN = 0,
Q0 = C
T
0 R˜
−1
0 C0 + P˜
−1
0 , Qn = C
T
n R˜
−1
n Cn, Sn = 0, Rn = Q˜
−1
n ,
q0 = C
T
0 (v¯0 − y0)− P˜−10 x¯0, qn = CTn (v¯n − yn), rn = −w¯n.
7.2 Unconstrained MHE problem 155
Formulation (7.6) reﬂects the deterministic view of the MHE as the problem of
ﬁnding the optimal xn, wn and vn sequences in a least-square sense, with respect
to some cost function. This formulation looks like the MPC formulation (7.1),
with one key diﬀerence: in the MHE problem the initial state x0 is retained as
an optimization variable, while in the MPC problem it is ﬁxed to some initial
value xˆ0.
It is possible to consider additional state equality constraint on the last stage
(7.6c) and still maintain a problem structure that can be easily exploited in
some tailored solver at a very small computational cost increase. These equality
constraints are used to provide consistent feedback signal to the controller. They
are enforced only at the last stage to avoid Linear Independence Constraint
Qualiﬁcation (LICQ) problems.
The penalization of xn in place of vn in the cost function (7.6a) is useful to
account for QPs in non-linear MHE. The fact that the matrices in the cost
function appear as not-inverted makes straightforward the use of a solver for
this MHE formulation as a routine for constrained MHE (e.g., in an IPM these
matrices are updated to take into account constraints). In fact, the inversion
does not need to be performed explicitly, as it can be embedded in the solution
algorithms and therefore performed implicitly.
The size of problem (7.6) is deﬁned by the quantities: nx (state vector size), nu
(process noise vector size), nd (number of state equality constraints on the last
stage), N (horizon length).
7.2.1 Matrix formulation
The unconstrained MHE problem can be reformulated in the matrix form
min
y¯
1
2
y¯T H¯y¯ + g¯T y¯ + γ (7.7a)
s.t. A¯y¯ = b¯d (7.7b)
156 Unconstrained MPC and MHE problem formulations
where (choosing N = 3 to simplify the notation)
y¯ =

u0
u1
u2
x0
x1
x2
x3

=
[
u¯
x¯
]
, H¯ =

R0 S0
R1 S1
R2 S2
ST0 Q0
ST1 Q1
ST2 Q2
Q3

=
[
R¯ S¯
S¯T Q¯
]
,
g¯ =

r0
r1
r2
q0
q1
q2
q3

=
[
r¯
q¯
]
, b¯d =

b0
b1
b2
d3
 =
[
b¯
d
]
,
A¯ =

−B0 −A0 I
−B1 −A1 I
−B2 −A2 I
−D3
 =
[−B¯ A¯
D¯
]
.
(7.8)
Notice that in the MHE case the matrix A¯ is clearly not invertible, since the
ﬁrst block-row is zero.
7.2.2 Optimality conditions
The KKT system associated to the unconstrained MHE problem can be written
in the band-diagonal form (for N = 3)
Q0 S
T
0 A
T
0
S0 R0 B
T
0
A0 B0 −I
−I Q1 ST1 AT1
S1 R1 B
T
1
A1 B1 −I
−I Q2 ST2 AT2
S2 R2 B
T
2
A2 B2 −I
−I Q3 DT3
D3


x0
u0
λ0
x1
u1
λ1
x2
u2
λ2
x3
λ3

=

−q0
−r0
−b0
−q1
−r1
−b1
−q2
−r2
−b2
−q3
−d3

(7.9)
Chapter 8
Structure-exploiting
recursive factorizations of
the KKT matrix
In this chapter, two recursive methods for the solution of the MPC and MHE
problems are presented. Namely, a backward and a forward recursive factoriza-
tions of the KKT matrix are considered, together with the appropriate backward
and forward solutions. Both algorithms use a recursion to eﬃciently factorize
the KKT matrix stage-wise, and have a complexity linear in the horizon length
N .
The factorizations are derived for a generic linear optimal control problem
(OCP), where the dynamical system is described by a state-space model, and
the number of states and inputs can vary stage-wise. This allows to handle both
MPC and MHE using the same algorithms, reducing development and imple-
mentation eﬀorts. Nevertheless, each factorization is better suited to a speciﬁc
OCP.
The backward Riccati recursion can naturally handle the MPC problem, and it
has been widely used in literature [70, 66]. This recursion begins the factoriza-
tion at the last stage. It distinguishes between state and input variables, but
it can not exploit a diagonal Hessian of the cost function to reduce the com-
158 Structure-exploiting recursive factorizations of the KKT matrix
putational cost. This factorization can not directly handle additional equality
constraints at the last stage: therefore, if nd > 0, the backward Riccati recursion
has to be embedded into an Interior Point (IPM) or Active Set (AS) method,
or a Schur-complement approach has to be employed [70]. In the special case
of LTI problem, if the recursion matrix is initialized with the solution of the
discrete Riccati algebraic equation (DARE), then the factorized KKT matrix
has a constant structure stage-wise, and therefore it can be computed in time
constant in N (see Chapter 11).
The forward Schur-complement recursion can naturally handle the MHE prob-
lem. The recursion begins the factorization at the ﬁrst stage. This recursion
does not distinguish between state and input variables (and therefore can be
seen as more general), but it can exploit a diagonal Hessian of the cost function
to reduce the computational cost (that in general is higher than in the back-
ward Riccati recursion case). This factorization has the advantage of directly
handling additional equality constraints on the last stage, that can be useful to
enforce stability in MPC formulations [63], or to enforce consistent state estima-
tion in the MHE problem. In the MHE problem, this recursion has important
relations with the Kalman ﬁlter, and in particular may be used to compute the
information matrix of estimate. In the MPC problem, this recursion requires
regularization at the ﬁrst stage (as shown in Section 8.2.1.1), and therefore gives
a solution with lower accuracy.
8.1 Backward riccati recursion
The backward Riccati recursion is a well-known method for the solution of the
unconstrained MPC problem [70].
In its classical formulation, it reads
Pn = Qn +A
T
nPn+1An−
− (STn +ATnPn+1Bn)(Rn +BTnPn+1Bn)−1(Sn +BTnPn+1An). (8.1)
This recursion is initialized at the last stage with PN = QN . It can be in-
terpreted as a stage-wise recursive factorization of the KKT matrix [70]. If
the recursion is implemented using standard BLAS and LAPACK routines, the
computational cost to factorize the KKT matrix is of about
N(4n3x + 6n
2
xnu + 3nxn
2
u +
1
3n
3
u) (8.2)
ﬂops [31], that is linear inN and cubic in the number of stage variables (nx+nu).
8.1 Backward riccati recursion 159
There exist a similar recursion for the linear terms in the cost function
pn = qn +A
T
n (Pn+1bn + pn+1)−
− (STn +ATnPn+1Bn)(Rn +BTnPn+1Bn)−1(sn +BTn (Pn+1bn + pn+1)) (8.3)
that can be interpreted as a stage-wise backward solution of the KKT system.
The solution procedure is completed by the forward recursion consisting on the
computation of the optimal input un as time-varying aﬃne state feedback
un = Knxn + kn (8.4)
where
Kn = −(Rn +BTnPn+1Bn)−1(Sn +BTnPn+1An)
kn = −(Rn +BTnPn+1Bn)−1(sn +BTn (Pn+1bn + pn+1)),
and computation of the next state xn+1 by means of simulation
xn+1 = Anxb +Bnun + bn.
This forward recursion can be interpreted as a stage-wise forward substitution
of the KKT system.
The aim of the remainder of the section is the derivation of an eﬃcient for-
mulation for the backward Riccati recursion (that has a lower computational
complexity with respect to the classical formulation in (8.1)), and the presenta-
tion of implementations using either the standard BLAS interface or the linear
algebra routines in HPMPC.
8.1.1 Derivation
In this section, the backward Riccati recursion as a solution strategy for a linear
OCP is derived using two methods: recursive factorization of the KKT matrix
and dynamic programming. These two methods give diﬀerent interpretations
to the Riccati recursion. The interpretation as factorization of the KKT ma-
trix justiﬁes the use of techniques such as mixed precision computation [23] in
the context of MPC [33]. The derivation using dynamic programming gives a
formulation that naturally leads to an eﬃcient implementation.
160 Structure-exploiting recursive factorizations of the KKT matrix
8.1.1.1 Backward Riccati recursion as factorization of the KKT ma-
trix
The backward Riccati recursion can be seen as a block-wise recursive factoriza-
tion of the KKT matrix [70]. Here the factorization procedure is repeated for
pedagogical purposes.
The recursive factorization starts at the last stage. The recursion comprises
a recursive step (operating on the general stages) and a end-of-recursion step
(operating on the ﬁrst stage).
Recursive step The backward Riccati recursion can not naturally handle
equality constraints on the last stage. Therefore, it is assumed that nd = 0 and
that the last two stages are in the form
−I Qn STn ATn
Sn Rn B
T
n
An Bn −I
−I Pn+1


λn−1
xn
un
λn
xn+1
 = −

qn
rn
bn
pn+1
 . (8.5)
Notice that the identity matrix on the top-left corner links the factorization of
this block to the previous ones. The variable xn+1 and the fourth equation can
be eliminated by adding the fourth equation to the third equation multiplied by
Pn+1 −I Qn STn ATnSn Rn BTn
Pn+1An Pn+1Bn −I


λn−1
xn
un
λn
 = −
 qnrn
Pn+1bn + pn+1
 .
The variable λn and the third equation can be eliminated by adding the third
equation multiplied by ATn to the ﬁrst equation and by adding the third equation
multiplied by BTn to the second equation[−I Qn +ATnPn+1An STn +ATnPn+1Bn
Sn +B
T
nPn+1An Rn +B
T
nPn+1Bn
]λn−1xn
un
 = − [qn +ATn (Pn+1bn + pn+1)
rn +B
T
n (Pn+1bn + pn+1)
]
Finally, in the hypothesis that Rn is positive deﬁnite and then invertible, the
variable un and the second equation can be eliminated by means of the Schur
complement of the bottom-right block, as[−I Pn] [λn−1xn
]
= − [pn]
8.1 Backward riccati recursion 161
where Pn and pn have the same expressions as for the backward Riccati recursion
(8.1) and (8.3). This closes the recursion, since this equation is identical to the
last equation in (8.5).
End of recursion At the end of the recursion, the MPC and the MHE prob-
lems are handled in diﬀerent ways.
In the MPC case, the ﬁrst stage is in the formR0 BT0B0 −I
−I P1
u0λ0
x1
 = −
r0b0
p1

Using arguments analogue to the recursive step, this can be rewritten in the
form [
R0 +B
T
0 P1B0
] [
u0
]
= − [r0 +BT0 (P1b0 + p1)] (8.6)
In the hypothesis that R0 is positive deﬁnite and then invertible, the matrix in
(8.6) can be factorized using e.g. Cholesky factorization, and therefore complet-
ing the KKT matrix factorization.
In the MHE case, the ﬁrst stage is in the form
Q0 S
T
0 A
T
0
S0 R0 B
T
0
A0 B0 −I
−I P1


x0
u0
λ0
x1
 = −
r0b0
p1

Using arguments analogue to the recursive step, this can be rewritten in the
form[
Q0 +A
T
0 P1A0 S
T
0 +A
T
0 P1B0
S0 +B
T
0 P1A0 R0 +B
T
0 P1B0
] [
x0
u0
]
= −
[
q0 +A
T
0 (P1b0 + p1)
r0 +B
T
0 (P1b0 + p1)
]
(8.7)
In the hypothesis that the matrix
[
Q0 S
T
0
S0 R0
]
is positive deﬁnite and then in-
vertible, the matrix in (8.7) can be factorized using e.g. Cholesky factorization,
and therefore completing the KKT matrix factorization.
8.1.1.2 Backward Riccati as dynamic programming algorithm
In this section, the backward Riccati recursion is derived using a diﬀerent ap-
proach: dynamic programming. This derivation naturally gives an eﬃcient for-
mulation of the backward Riccati recursion. This formulation is characterized
162 Structure-exploiting recursive factorizations of the KKT matrix
by lower asymptotic complexity and better computational performance than the
classical backward Riccati recursion. This is obtained by exploiting symmetry
and positive semi-deﬁniteness of the recursion matrix Pn+1, and by propagat-
ing the Cholesky factor Ln+1 of the recursion matrix. The propagation of the
Cholesky factor of the recursion matrix justiﬁes the name of the routine: square-
root backward Riccati recursion. Furthermore, in this formulation the backward
factorization and the backward solution are merged into a single recursion, fur-
ther improving computational performance.
In the following, the concept of dynamic programming and its use to solve
the linear MPC problem is assumed known: see [22] for the use of dynamic
programming in optimal control.
Let us assume that the optimal stage cost at the generic stage n+ 1 is
V ∗n+1(xn+1) =
[
xTn+1 1
] [Pn+1 pn+1
pTn+1 pin+1
] [
xn+1
1
]
.
Inserting the expression of the state at stage n+ 1 as function of the state and
the input at stage n,
xn+1 =
[
Bn An bn
] unxn
1

the optimal stage cost can be written as function of the state and the input at
stage n,
V ∗n+1(xn, un) = X Tn ATnPn+1AnXn =
=
unxn
1
T BTn 0ATn 0
bn 1
[Pn+1 pn+1
pTn+1 pin+1
] [
Bn An bn
0 0 1
]unxn
1

In this expression, the state xn is assumed given, while the input un can be
freely chosen.
Note that the matrix in this expression can be computed eﬃciently by exploiting
symmetry and positive deﬁniteness of Pn+1. In fact, if the matrix Pn+1 is
positive deﬁnite, it can be factorized using Cholesky factorization, as
Pn+1 = Ln+1LTn+1 =
[
Ln+1,22
Ln+1,32 Ln+1,33
] [
LTn+1,22 L
T
n+1,32
LTn+1,33
]
and then the optimal stage cost becomes
V ∗n+1(xn, un) = X Tn ATnLn+1LTn+1AnXn =
= X Tn (ATnLn+1)(ATnLn+1)TXn
8.1 Backward riccati recursion 163
that can be build eﬃciently by using specialized linear algebra routines to exploit
the symmetry of the formulation and the fact that Ln+1 is a lower triangular
matrix.
The stage cost at the stage n (dropping in the last equation the index n in x,
u, A, B, b, Q, S, R, q, s, ρ and the index n+1 in P , p, pi)
Vn(xn, un) = ϕn(xn, un) + V
∗
n+1(xn, un) =
= X Tn (Qn + (ATnLn+1)(ATnLn+1)T )Xn =
=
ux
1
T  R+BTPB S +BTPA s+BT (Pb+ p)ST +ATPB Q+ATPA q +AT (Pb+ p)
sT + (Pb+ p)TB qT + (Pb+ p)TA ρ+ bTPb+ 2bT p+ pi
ux
1

is a function of xn (ﬁxed) and un (free), and it can be easily minimized with
respect to un in the following way. The matrix is positive deﬁnite (since it is
the sum of a positive deﬁnite matrix and a positive semi-deﬁnite matrix), and
then the stage cost can be factorized by using the Cholesky factorization of the
matrix,
Mn = Qn +ATnPn+1An =
Ln,11Ln,21 Ln,22
Ln,31 Ln,32 Ln,33
LTn,11 LTn,21 LTn,31LTn,22 LTn,32
LTn,33

obtaining the expression for the stage cost Vn(xn, un)
Vn(xn, un) =
LTn,11un + LTn,21xn + LTn,31LTn,22xn + LTn,32
LTn,33
T LTn,11un + LTn,21xn + LTn,31LTn,22xn + LTn,32
LTn,33
 =
=(LTn,11un + L
T
n,21xn + L
T
n,31)
T (LTn,11un + L
T
n,21xn + L
T
n,31)+
+ (LTn,22xn + L
T
n,32)
T (LTn,22xn + L
T
n,32) + Ln,33L
T
n,33.
Notice that un is present only in the ﬁrst term of the sum: this term is a square,
and then its minimum is zero, attained for the value of un
un = −(LTn,11)−1(LTn,21xn + LTn,31). (8.8)
The corresponding optimal value V ∗n (xn) of the stage cost is given by the re-
maining two terms of the sum:
V ∗n (xn) = (L
T
n,22xn + L
T
n,32)
T (LTn,22xn + L
T
n,32) + Ln,33L
T
n,33 =
=
[
xTn 1
] [Pn pn
pn pin
] [
xn
1
]
as in the classical formulation of the dynamic programming. Notice that the
procedure gives a factorization of the matrix Pn that can be used at the following
stage to eﬃciently compute ATn−1PnAn−1.
164 Structure-exploiting recursive factorizations of the KKT matrix
The value of un in (8.8) can be rewritten as
un = −(Rn +BTnPn+1Bn)−1
(
(Sn +B
T
nPn+1An)xn + rn +B
T
n (Pn+1bn + pn+1)
)
=
= Knxn + kn
that is the expression of un as time varying aﬃne state feedback given by the
classical Riccati recursion in (8.4). However, the procedure to compute un as in
(8.8) is more eﬃcient from a computational point of view. Also notice that the
recursion matrix Pn of the Riccati recursion is never computed explicitly in the
above solution procedure.
8.1.2 Implementation
In this section, the implementation of the square-root backward Riccati recur-
sion is presented. Firstly the algorithm is presented using standard BLAS and
LAPACK routines, and the computational complexity as number of ﬂops is
derived for the algorithm. Then the algorithm is presented using the custom
linear algebra routines proposed in Part I of the thesis. Finally, analogies of the
algorithm with respect to array algorithms are discussed.
8.1.2.1 Implementation using BLAS and LAPACK
The square-root backward Riccati recursion can be implemented using the stan-
dard linear algebra routines in BLAS and LAPACK. The algorithm is summa-
rized in Algorithm 1, where the name of the BLAS and LAPACK routines
employed in the implementation appears as a comment.
The number of states nx and the number of inputs nu are assumed to be time-
variant, and therefore in the following represented as C arrays of size N + 1. In
this framework, the diﬀerence between MPC and MHE problems is that MPC
problems have nx[0] = 0 and MHE problems have nx[0] 6= 0. All matrices in
Algorithm 1 are assumed to be of the proper size. As an example, in the case
of a MPC problem, the matrices S0 and L0,21 have size 0× nu[0], the matrices
Q0 and L0,22 have size 0× 0 and the vectors q0 and L0,32 have size 0.
The cost of this algorithm is (showing only terms cubic in the stage variable
sizes, and assuming that the number of states and inputs is constant stage-wise,
and equal to nx and nu respectively)
N
(
7
3n
3
x + 4n
2
xnu + 2nxn
2
u +
1
3n
3
u
)
+ 13n
3
x (8.9)
8.1 Backward riccati recursion 165
Algorithm 1 Square-root backward Riccati recursion - factorization and solu-
tion
1:
[
LN+1,22
LN+1,32 LN+1,33
]
← P1/2 . potrf
2: for n← N − 1, . . . , 0 do
3: ATnLn+1 ←
BTnATn
bTn
 · Ln+1,22 +
 00
Ln+1,32
 . trmm
4: Mn ← Qn + (ATnLn+1) · (ATnLn+1)T . syrk
5:
Ln,11Ln,21 Ln,22
Ln,31 Ln,32 Ln,33
←M1/2n . potrf
6: end for
7: if nx[0] > 0 then
8: x0 ← −L−T0,22 · LT0,32 . trsv
9: end if
10: for n← 0, . . . , N − 1 do
11: un ← −L−Tn,11 · (LTn,21 · xn + LTn,31) . gemv & trsv
12: xn+1 ←
[
Bn An
] · [un
xn
]
+ bn . gemv
13: λn+1 ← Ln+1,22 · (LTn+1,22 · xn+1 + LTn+1,32) . trmv & trmv
14: end for
166 Structure-exploiting recursive factorizations of the KKT matrix
ﬂops in the MHE case, and 73n
3
x+3n
2
xnu+nxn
2
u ﬂops less in the MPC case. That
is lower than the classical backward Riccati recursion cost (8.2). A diagonal
Hessian of the cost function could be exploited only at the last stage, in the
computation of the Cholesky facto LN+1,22 at line 1 of Algorithm 1, and at the
following trmm call in line 3, reducing the cost of 43n
3
x + n
2
xnu ﬂops. However,
this case is unlikely to happen in practice, since the backward Riccati recursion
is preferably employed in MPC problems, where often P is initialized to the
solution of the Discrete Algebraic Riccati Equation (DARE), that is generally
a dense matrix (unless a change in state-space representation that diagonalizes
P is employed).
Since typically in control nx > nu, the most expensive part of the backward
Riccati recursion algorithm (8.1) is the computation of the term ATnPn+1An,
where all matrices are squared of size nx×nx. If neither symmetry nor positive-
deﬁniteness is exploited (as in the classical backward Riccati recursion), this
term is computed using two calls to the BLAS routine gemm, for a total cost of
4n3x.
On the other hand, if both symmetry and positive-deﬁniteness are exploited as
in the square-root version, the matrix Pn+1 can be factorized using the Cholesky
factorization routine potrf at cost 13n
3
x, obtaining the lower triangular factor
Ln+1,22. The term A
T
nPn+1An is then computed as (A
T
nLn+1,22)(A
T
nLn+1,22)
T ,
where the computation of (ATnLn+1,22) requires n
3
x ﬂops using the specialized
routine trmm (i.e. exploiting the fact that the matrix Ln+1,22 is lower triangu-
lar), while the multiplication of a matrix by the transposed of the matrix itself
requires n3x ﬂops using the specialized routine syrk (i.e. exploiting the fact that
the result matrix is symmetric, and then only the lower triangular part needs
to be computed).
In Algorithm 1, this idea has been extended to the computation of all other terms
in the Riccati recursion, fully exploiting symmetry and positive-deﬁniteness to
reduce the ﬂops count.
The use of few big matrices as An of Qn packing together all data matrices re-
duces the number of function calls to linear-algebra routines (and relative over-
head), and improves performance (since optimized BLAS routines give higher
performance on larger matrices), even leaving the number of ﬂops unchanged.
Notice that the Cholesky factorization routine potrf provided by LAPACK
requires the input matrix to be symmetric and strictly positive-deﬁnite. A
suﬃcient condition for this to hold in Algorithm 1 is that matrices Qn and
P are symmetric and strictly positive-deﬁnite. This is a stricter requirement
than in the classical Riccati recursion algorithm, where the matrices Qn and P
need to be symmetric but only positive semi-deﬁnite, and only the matrices Rn
8.1 Backward riccati recursion 167
are required to be strictly positive-deﬁnite to guarantee the invertibility of the
matrices Rn + B
T
nPn+1Bn at all stages. The use of custom linear algebra can
relax the positive-deﬁniteness requirement also for Algorithm 1, if the backward
factorization and the backward substitutions are not merged, i.e. if Algorithms
2 and 3 are employed in place of Algorithm 1.
If several linear OCP having the same KKT matrix and diﬀerent RHS vector
need to be solved, then it is convenient to factorize the KKT matrix only once
by means of Algorithms 1 or 2 and reuse the factorized KKT matrix for all
subsequent solutions, by means of Algorithm 3.
Algorithm 2 Square-root backward Riccati recursion - factorization
1: LN+1,22 ← P 1/2N . potrf
2: for n← N − 1, . . . , 0 do
3: ATnLn+1 ←
[
BTn
ATn
]
· Ln+1,22 . trmm
4: Mn ← Qn + (ATnLn+1) · (ATnLn+1)T . syrk
5:
[
Ln,11
Ln,21 Ln,22
]
←M1/2n . potrf
6: end for
Algorithm 3 Square-root backward Riccati recursion - solution
1: pN ← p
2: for n← N − 1, . . . , 0 do
3: Pn+1bn ← Ln+1,22 · LTn+1,22 · bn + pn+1 . trmv & trmv
4:
[
ln
pn
]
←
[
rn
qn
]
+
[
BTn
ATn
]
· (Pn+1bn) . gemv
5:
[
ln
pn
]
←
[
L−1n,11ln
pn − Ln,21 · L−1n,11ln
]
. trsv & gemv
6: end for
7: if nx[0] > 0 then
8: x0 ← −L−T0,22 · L−10,22 · p0 . trsv
9: end if
10: for n← 0, . . . , N do
11: un ← −(LTn,11)−1(LTn,21 · xn + ln) . gemv & trsv
12: xn+1 ←
[
Bn An
] · [un
xn
]
+ bn . gemv
13: λn+1 ← Ln+1,22 · LTn+1,22 · xn+1 + pn+1 . trmv & trmv
14: end for
Considering only terms cubic in the stage variable sizes, the cost of Algorithm
2 is identical to the cost of Algorithm 1.
168 Structure-exploiting recursive factorizations of the KKT matrix
Considering only terms quadratic in the stage variable sizes, and assuming that
the number of states and inputs is constant stage-wise, and equal to nx and nu
respectively, the cost of Algorithm 3 is
N
(
8n2x + 8nxnu + 2n
2
u
)
ﬂops in the MPC case, and 2n2x more in the MHE case. In both the MPC and
MHE case, if the Lagrangian multipliers λn are not needed, then line 13 can be
omitted, saving N(2n2x) ﬂops.
8.1.2.2 Implementation using custom linear algebra
The use of custom linear algebra routines can improve the implementation of
the square-root backward Riccati recursion algorithm.
Panel-major matrix format The use of the panel-major matrix format as
default matrix format in the square-root backward Riccati recursion enables the
use of the linear algebra routines proposed in Part I of the thesis. In particular,
all matrices passed to the Riccati routine are assumed to be in the panel-major
matrix format, and the results of the internal operations are in this matrix
format as well. Therefore, the eﬃcient linear algebra routines for embedded
optimization can be used without the need to convert the matrices from row-
major or column-major format into the panel-major format. For small-scale
problems, this notably increases the performance with respect to standard BLAS
and LAPACK routines.
Positive semi-deﬁnite Hessian The classical backward Riccati recursion
requires the matrices P and Qn to be positive semi-deﬁnite, while the imple-
mentation of the square-root backward Riccati recursion in Section 8.1.2 requires
the matrices P and Qn to be strictly positive deﬁnite. It is well known that
it is possible to compute the Cholesky factor of a positive semi-deﬁnite matrix,
even if the result in not unique in case of singular matrix. The potrf routine
proposed in the ﬁrst part of the thesis can factorize singular matrices.
The square-root backward Riccati recursion as implemented in Algorithm 1
requires the Ln,22 matrices to be invertible (and therefore it requires the matrices
P and Qn to be strictly positive deﬁnite). In fact, the matrix Ln,22 is implicitly
used in the computation of Ln,32 in the Cholesky factorization in line 5, where
the operation Ln,32 = L
−1
n,22pn is implicitly performed.
8.1 Backward riccati recursion 169
However, the square-root backward Riccati recursion as implemented in Algo-
rithms 2 and 3 works ﬁne also if the matrix Ln,22 is singular. In fact, in this
case the Cholesky factorization Ln+1,22 of the recursion matrix Pn+1 is only
employed as a numerical trick to speedup the computation of the matrix[
Qn +A
T
nPn+1An S
T
n +A
T
nPn+1Bn
Sn +B
T
nPn+1An Rn +B
T
nPn+1Bn
]
and it is not employed in solution of linear systems.
Merging of linear algebra routines Custom routines merging two or more
standard BLAS linear algebra routines can be employed in the implementation
of Algorithms 1, 2 and 3. In particular, lines 4 and 5 of Algorithms 1 and 2
can be implemented using the single routine syrk_potrf proposed in Section
3.3.3. Similarly, line 11 of Algorithm 1, and lines 5 and 11 of Algorithm 3 can
be implemented using the single routine trsv proposed in Section 4.3.3. The
use of merged linear algebra routines gives better computational performance,
especially in case of small problems, by reducing the number of calls to linear
algebra kernels and increasing the size of the processed matrices and vectors.
8.1.2.3 Relation to array algorithms
The square-root backward Riccati recursion presented above is analogue to the
so called array algorithms, but with some important diﬀerences.
Array algorithm are widely used for their excellent numerical properties, being
employed in the implementation of Kalman ﬁlter [55] and Moving Horizon Esti-
mation (MHE) [46]. They propagate a square-root of the recursion matrix Pn,
preserving its symmetry and positive-deﬁniteness. The square-root is generally
computed using QR factorization. This gives them excellent numerical stability,
since the worse-conditioned normal matrix is never formed explicitly.
The QR factorization of a full-rank squared matrix A is A = Q · R, with Q
orthogonal matrix and R upper-triangular matrix. If all R diagonal elements
are required to be positive, then the factorization is unique. The R matrix can
be computed using Householder reﬂections at a cost of about 43n
3 ﬂops, where
n is the size of the A matrix.
The uniqueness of the factorization implies that R is equal to the upper Cholesky
factor the matrix AT ·A. The cost of this algorithm is again 43n3 ﬂops (n3 due to
the symmetric rank-n update and 13n
3 due to the Cholesky factorization), but
it is numerically less stable, since the normal matrix AT ·A is explicitly formed.
170 Structure-exploiting recursive factorizations of the KKT matrix
If the Cholesky factor R of the matrix B +AT ·A = RT ·R is needed, the cost
of the two algorithms is no longer the same. The algorithm based on symmetric
rank-n update and Cholesky factorization requires 43 ﬂops also in this case.
However, the algorithm based on the QR factorization is more expensive. In
fact, R is obtained from the QR factorization of the 'array' matrix[
A
B
1/2
]
= Q ·R
at the cost of 2(2n)n2 − 23n3 = 103 n3 ﬂops. This cost can reduced to 2n3 using
a custom QR factorization to exploit the fact that B
1/2 is upper-triangular. If
the Cholesky factor B
1/2 has to be computed, there are further 13n
3 ﬂops.
The proposed square-root backward Riccati recursion algorithm shares with
array algorithms the fact that the recursion matrix is the Cholesky factor of
Pn, thus automatically enforcing symmetry. The ﬂop count of the algorithm is
lower that the equivalent array algorithm (and of the classical backward Riccati
recursion as well). However, this comes at the cost of explicitly forming the
normal matrix, characterized by a larger condition number.
8.2 Forward Schur-complement recursion
The forward Schur-complement recursion exploits the stage-wise structure of
the KKT matrix of the linear MPC (7.4) or of the MHE problem (7.9). This
recursion can be seen as a recursive stage-wise factorization of the KKT matrix,
where the factorization begins at the top-left corner of the KKT matrix.
Therefore this recursion is naturally suited to solve the linear MHE problem,
since it can propagate information starting from the ﬁrst stage. Traditionally,
the linear MHE problem has been solved using a forward Riccati recursion,
where each stage of the recursion is analogue to the Covariance Kalman Filter.
On the contrary, each stage of the the forward Schur-complement recursion is
analogue to the Information Filter formulation of the Kalman ﬁlter [14, 65].
The main advantage of the use of the forward Schur-complement recursion over
the forward Riccati recursion is its generality (it can deal with the more general
MHE problem formulation (7.6), whereas the classical forward Riccati recursion
deals with the traditional MHE problem formulation (7.5)). Therefore, the
forward Schur-complement recursion can be used to solve the unconstrained
QPs arising in constrained and non-linear MHE in a straightforward way.
Furthermore, since the forward Schur-complement recursion is a general recur-
sion of the KKT matrix of the linear OCP, it can also be employed to solve
8.2 Forward Schur-complement recursion 171
MPC problems, even if it requires some care in this case. The fact that the
factorization starts at the ﬁrst stage distinguish the forward Schur-complement
recursion from the backward Riccati recursion, that begins the factorization at
the last stage, and that is therefore naturally suited to solve the linear MPC
problem. The forward Schur-complement recursion does not distinguish be-
tween state and input variables, and therefore it can be seen as more general
than the backward Riccati recursion.
In general, the forward Schur complement recursion requires a slightly larger
number of ﬂops compared to the backward Riccati recursion. However, in
case of diagonal Hessian, the computational complexity of the forward Schur-
complement recursion can be slightly reduced, and made linear in the number
of inputs nu. Furthermore, the recursion can directly and cheaply handle addi-
tional equality constraints on the last stage.
8.2.1 Derivation
This section contains the derivation of the forward Schur-complement recursion
as a recursive stage-wise factorization of the KKT matrix. As any recursive
algorithm, there are two key ingredients: the recursive step and the handling of
the end-of-recursion case (plus a starting step in the MPC case). Furthermore,
two cases are distinguished: dense and diagonal Hessian.
8.2.1.1 Starting step
The ﬁrst step is diﬀerent in the MPC and in the MHE cases. In particular,
in the MHE case the ﬁrst step is identical to the general recursive step, and
therefore it is not presented here. On the contrary, the MPC case requires some
care, since the Schur complement leads to a singular matrix.
MPC case In the MPC case, the top-left corner of the KKT matrix looks like
R0 B
T
0
B0 −I
−I Q1 . . .
...
. . .


u0
λ0
x1
...
 = −

r˜0
b˜0
q1
...
 .
Since the matrix R0 is assumed to be positive deﬁnite and therefore invertible,
it is possible to eliminate the variable u0 and the ﬁrst equation by means of the
172 Structure-exploiting recursive factorizations of the KKT matrix
Schur-complement, obtaining−B0R
−1
0 B
T
0 −I
−I Q1 . . .
...
. . .

λ0x1
...
 = −
b˜0 −B0R
−1
0 r˜0
q1
...
 .
Since the matrix R0 is assumed to be invertible, the Schur-complement matrix
B0R
−1
0 B
T
0 is invertible if and only if the matrix B0 has full row rank. In the
case nx > nu (common in control), the Schur-complement matrix can not be
invertible.
One way to proceed is the use of regularization, that has the side eﬀect that
the accuracy of the solution is in general lower than using a backward Riccati
recursion. Numerical evidence shows that a good value for the regularization
parameter ε is often the square root of machine epsilon. The accuracy of the
solution is then on the order of the square root of machine epsilon.
An alternative approach is the use of partial condensing (see Chapter 10), that
may be employed to increase the input vector size by condensing together several
stages. In that case, the matrix B0 is the controllability matrix, that has full
rank if the system is controllable.
In any case, deﬁned
P−11 = B0R
−1
0 B
T
0 + εI
for some ε ≥ 0 such that the matrix P1 is invertible, it is possible to eliminate
the variable λ0 and the ﬁrst equation by explicitly computing the inverse P1 of
the matrix P−11 ,[
Q1 + P1 . . .
...
. . .
][
x1
...
]
= −
[
q1 − P1(b˜0 −B0R−10 r˜0)
...
]
that can be rewritten in a more compact form as[
Σ1 . . .
...
. . .
][
x1
...
]
= −
[
σ1
...
]
.
The factorization can continue at the following stage, that has the same form
as the general stage, provided that the matrices Q1 and q1 are replaced by the
updated matrices Σ1 and σ1.
8.2 Forward Schur-complement recursion 173
8.2.1.2 Recursive step
The recursive step is analogue for MPC and MHE problems. However, two cases
will be distinguished, for diagonal and dense Hessian of the cost function.
Diagonal Hessian case The following derivation will assume that the ma-
trices Qn and Rn are diagonal and that the matrices Sn are zero. This can
be exploited to reduce the computational cost of the algorithm. The fact that
the matrices Sn, n = 0, . . . , N − 1 are zero allows for some analogy with the
Covariance Kalman ﬁlter algorithm for the MHE case.
The matrices Qn and Rn are further assumed to be positive deﬁnite, and the
matrices
[
An Bn
]
are assumed to have full row-rank. These are suﬃcient
conditions to guarantee the invertibility of all matrices in the recursive step. In
general, however, these are not necessary conditions.
In the MHE case, recalling the conversion between the traditional MHE formu-
lation (7.5) and the more general MHE formulation (7.6), the matrix Q0 can be
decomposed as
Q0 = Qˆ0 + P0 = C
T
0 R˜
−1
0 C0 + P˜
−1
0
where Qˆ0 is in the same form as the other matrices Qn, n > 0, in the cost
function, and the matrix P0 can be interpreted as the information matrix relative
to the initial state prediction x¯0. Similarly, the vector q0 can be decomposed as
q0 = qˆ0 + p0 = C
T
0 (v¯0 − y0)− P˜−10 x¯0.
These expressions are useful to highlight the contribution of the initial state
prediction and relative information matrix to the cost function expression, and
they will be retained through the recursion.
By introducing the deﬁnitions
Σ0 = Q0, σ0 = q0
the top-left corner of the KKT matrix is
Σ0 A
T
0
R0 B
T
0
A0 B0 −I
−I Q1 . . .
...
. . .


x0
u0
λ0
x1
...
 = −

σ0
r0
b0
q1
...
 . (8.10)
174 Structure-exploiting recursive factorizations of the KKT matrix
In general, the matrices Σn are assumed to be dense (since at the general stage
they are the sum of a diagonal matrix Qn with a generally dense matrix Pn).
Since the matrix Σ0 is invertible by hypothesis, the variable x0 can be eliminated
using the Schur complement of Σ0, obtaining
R0 B
T
0
B0 −A0Σ−10 AT0 −I
−I Q1 . . .
...
. . .


u0
λ0
x1
...
 = −

r0
b0 −A0Σ−10 σ0
q1
...
 .
In the MHE case, as a comparison with the Covariance Kalman Filter, notice
that by using the matrix inversion lemma, the vector
xˆ0 = −Σ−10 σ0 =
= (CT0 R˜
−1
0 C0 + P˜
−1
0 )
−1(CT0 R˜
−1
0 (y0 − v¯0) + P˜−10 x¯0) =
= (P˜0 − P˜0CT0 (R˜0 + C0P˜0CT0 )−1C0P˜0)(CT0 R˜−10 (y0 − v¯0) + P˜−10 x¯0) =
= x¯0 − P˜0CT0 (R˜0 + C0P˜0CT0 )−1(Cx¯0 + v¯0 − y0)
can be interpreted as the state estimation using measurements up to stage 0 in
the Kalman ﬁlter, and Σ0 as the relative information matrix. However, xˆ0 is
not computed explicitly in the information-ﬁlter recursion.
Similarly, since the matrix R0 is invertible by hypothesis, the variable u0 can
be eliminated, obtaining−A0Σ
−1
0 A
T
0 −B0R−10 BT0 −I
−I Q1 . . .
...
. . .

λ0x1
...
 =
−b0 +A0Σ
−1
0 σ0 +B0R
−1
0 r0
−q1
...
 .
Since the matrix R0 is assumed to be diagonal, the computation of the in-
verse R−10 is a cheap operation. Finally, since the matrix P
−1
1 = A0Σ
−1
0 A
T
0 +
B0R
−1
0 B
T
0 is invertible (due to the fact that Σ0 and R0 are invertible, and
the matrix
[
A0 B0
]
has full row rank), the variable λ0 can be eliminated by
explicitly computing the inverse P1 of the matrix P
−1
1 , obtaining
(Q1 + P1)x1 = −q1−P1(−b0 +A0Σ−10 σ0 +B0R−10 r0) = −q1 +P1x¯1 = −q1−p1,
that can be rewritten in the more compact form
Σ1x1 = −σ1. (8.11)
In general, the matrix Σ1 is dense, since in general the matrix P1 is dense.
8.2 Forward Schur-complement recursion 175
In the MHE case, as a comparison with the Covariance Kalman Filter, notice
that the vector x¯1 can be interpreted as the one-step-ahead state prediction in
the Kalman ﬁlter given the measurements up to stage 0, and P1 is the relative
information matrix. Furthermore, notice that the matrix P−11 = P˜1 has the
form (by using the matrix inversion lemma to compute Σ−10 )
P˜1 = P
−1
1 = B0R
−1
0 B
T
0 +A0Σ
−1
0 A
T
0 = B0Q˜0B
T
0 +A0(C
T
0 R˜
−1
0 C0 + P˜
−1
0 )A
T
0 =
= B0Q˜0B
T
0 +A0P˜0A
T
0 −A0P˜0CT0 (R˜0 + C0P˜0CT0 )−1C0P˜0AT0
that is the classical expression of the forward Riccati recursion of the Covari-
ance Kalman Filter. However, in the forward Schur-complement recursion, the
inversion of the matrix Σ0 is computed explicitly (similarly to the information
Kalman ﬁlter) instead of by means of the matrix inversion lemma (as in the
covariance Kalman ﬁlter).
Equation (8.11) closes the recursion, since now the top-left corner of the KKT
matrix is 
Σ1 A
T
1
R1 B
T
1
A1 B1 −I
−I Q2 . . .
...
. . .


x1
u1
λ1
x2
...
 =

−σ1
−r1
−b1
−q2
...
 .
that is in the same form as (8.10). The recursion can therefore be repeated at
the following stage.
Dense Hessian case The following derivation will not make hypothesis on the
sparsity structure of the matrices Qn, Rn and Sn: therefore they are assumed
to be dense. In the MHE case, no comparisons is made with the covariance
Kalman Filter.
The matrices
[
Qn S
T
n
Sn Rn
]
and QN are assumed to be positive deﬁnite, and the
matrices
[
An Bn
]
are assumed to have full row-rank. These are suﬃcient
conditions to guarantee the invertibility of all matrices in the recursion. In
general, however, these are not necessary conditions.
By introducing the deﬁnitions
Σ0 = Q0, σ0 = q0
176 Structure-exploiting recursive factorizations of the KKT matrix
the top-left corner of the KKT matrix is
Σ0 S
T
0 A
T
0
S0 R0 B
T
0
A0 B0 −I
−I Q1 . . .
...
. . .


x0
u0
λ0
x1
...
 = −

σ0
r0
b0
q1
...
 . (8.12)
Since the matrix
[
Σ0 S
T
0
S0 R0
]
is invertible by hypothesis, the variables x0 and u0
can be eliminated using the Schur complement of
[
Σ0 S
T
0
S0 R0
]
, obtaining

− [A0 B0] [Σ0 ST0S0 R0
]−1 [
AT0
BT0
]
−I
−I Q1 . . .
...
. . .

λ0x1
...
 =
= −

b0 −
[
A0 B0
] [Σ0 ST0
S0 R0
]−1 [
σ0
r0
]
q1
...
 .
Finally, since the matrix P−11 =
[
A0 B0
] [Σ0 ST0
S0 R0
]−1 [
AT0
BT0
]
is invertible (due
to the fact that the matrix
[
Σ0 S
T
0
S0 R0
]
is invertible, and the matrix
[
A0 B0
]
has full row rank), the variable λ0 can be eliminated, obtaining
(Q1 + P1)x1 = −
(
q1 − P1
(
b0 −
[
A0 B0
] [Σ0 ST0
S0 R0
]−1 [
σ0
r0
]))
=
= − (q1 + p1) ,
that can be rewritten in the more compact form
Σ1x1 = −σ1.
This closes the recursion, since now the top-left corner of the KKT matrix is
Σ1 S
T
1 A
T
1
S1 R1 B
T
1
A1 B1 −I
−I Q2 . . .
...
. . .


x1
u1
λ1
x2
...
 = −

σ1
r1
b1
q2
...
 .
8.2 Forward Schur-complement recursion 177
that is in the same form as (8.12). The recursion can therefore be repeated at
the following stage.
8.2.1.3 End of recursion
At the end of the recursion, two cases are distinguished, based on the presence
or not of equality constraints on the last stage.
Case nd = 0 If there are no equality constraints on the last stage variables,
the last stage looks like
ΣNxN = −σN .
This linear system of equations can be solved by means of Cholesky factorization
of the positive deﬁnite matrix ΣN and forward-backward substitutions.
In the MHE case, ΣN is the information matrix of the estimate xN , and it is
available at no extra computational cost.
Case nd > 0 If there are equality constraints on the last stage variables, the
last stage looks like [
ΣN D
T
N
DN
] [
xN
λN
]
= −
[
σN
dN
]
This can be solved by computing the Schur complement matrix of ΣN , that
gives the linear system of equations[
DNΣ
−1
N D
T
N
] [
λN
]
=
[
dN −DNΣ−1N σN
]
The matrix DNΣ
−1
N D
T
N is positive deﬁnite in the hypothesis that the matrix
ΣN is positive deﬁnite and that the constraint matrix DN has full row rank. In
that hypothesis, the system can be easily solved for λN by means of Cholesky
factorization followed by forward-backward substitutions.
Once the value of λN is known, the value of xN can be computed as
xN = Σ
−1
N (−σN −DTNλN ).
In the MHE case, the information matrix of the estimate in the null-space can
be computed as
ΣZ,N = Z
TΣNZ
178 Structure-exploiting recursive factorizations of the KKT matrix
where Z is a null-space matrix of DN [67]. If the D matrix ﬁxes the value
of some of the states (i.e. it consists of rows from an identity matrix), then
a suitable Z matrix is trivially a collection of the columns from an identity
matrix corresponding to the free states, and EZ,2 reduces to the elements of E2
corresponding to the free states.
8.2.2 Implementation
In this section, implementation of the forward Schur-complement recursion is
presented. Firstly the algorithm is presented using standard BLAS and LA-
PACK routines, and the computational complexity as number of ﬂops is derived
for the algorithm in both cases of dense and diagonal Hessian of the cost func-
tion. Then the algorithm is presented using the custom linear algebra routines
proposed it Part I of the thesis.
8.2.2.1 Implementation using BLAS and LAPACK
The forward Schur-complement recursion can be implemented using the stan-
dard BLAS and LAPACK routines. The algorithm is summarized in Algorithm
4, where the name of the BLAS and LAPACK routines employed in the imple-
mentation appreas as a comment.
In the case square-root backward Riccati recursion, the key operation is the
computation of Q+AT · P ·A, where Q is a positive semi-deﬁnite matrix. If all
matrices A, P and Q have size n, then the most eﬃcient way to compute this
operation is
Q+AT · P · A = Q+AT · (L · LT ) · A = Q+ (AT · L) · (AT · L)T (8.13)
where L is the lower Cholesky factor of P. Using specialized BLAS routines,
the cost of this operation is 13n
3 (potrf) + n3 (trmm) + n3 (syrk) = 73n
3 ﬂops.
In the case of the forward Schur-complement recursion, the key operation is the
computation of Q+A·P−1 ·AT , where Q is a positive deﬁnite matrix. Despite
the presence of a matrix inversion, this operation can be computed in the exact
same number of ﬂops as the operation in (8.13). In fact, the matrix inversion is
computed implicitly, as
Q+A·P−1 ·AT = Q+A· (L ·LT )−1 ·AT = Q+ (A·L−T ) · (A·L−T )T (8.14)
where again L is the lower Cholesky factor of P. Since the matrix L is triangu-
lar, the operation A · L−T can be computed eﬃciently using the routine trsm
8.2 Forward Schur-complement recursion 179
Algorithm 4 Forward Schur-complement recursion - factorization - dense Hes-
sian
1: Q ←
[
Q0 0
S0 R0
]
2: A ← [A0 B0]
3: L0 ← Q1/2 . potrf
4: AL0 ← A · L−T . trsm
5: Pinv ← AL0 · ALT0 . syrk
6: L← P 1/2inv . potrf
7: U1 ← L−T . trtri
8: for n← 1, . . . , N − 1 do
9: Σ← Qn + Un · UTn . lauum
10: Q ←
[
Σ 0
Sn Rn
]
11: A ← [An Bn]
12: Ln ← Q1/2 . potrf
13: ALn ← A · L−T . trsm
14: Pinv ← ALn · ALTn . syrk
15: L← P 1/2inv . potrf
16: Un+1 ← L−T . trtri
17: end for
18: Σ← QN + UN · UTN . lauum
19: LN ← Σ1/2 . potrf
20: if nd > 0 then
21: A ← [DN ]
22: ALN ← A · L−T . trsm
23: Pinv ← ALN · ALTN . syrk
24: LN ← P 1/2inv . potrf
25: end if
to solve a triangular system of linear equations with matrix RHS. Using spe-
cialized BLAS routines, the cost of this operation is 13n
3 (potrf) + n3 (trsm)
+ n3 (syrk) = 73n
3 ﬂops. This makes the forward Schur-complement recursion
competitive with respect to the square-root backward Riccati recursion.
The following analysis of the computational cost includes only terms that are
cubic in the stage variable sizes, and assumes the number of states and inputs
to be constant stage-wise, and equal to nx and nu respectively. In case of dense
Hessian of the cost function, the computational cost of Algorithm 4 is of
N( 103 n
3
x + 4n
2
xnu + 2nxn
2
u +
1
3n
3
u) +
1
3n
3
x + ndn
2
x + n
2
dnx +
1
3n
3
d
ﬂops for the MHE case, and 43n
3
x + 3n
2
xnu + nxn
2
u ﬂops less for the MPC case.
180 Structure-exploiting recursive factorizations of the KKT matrix
If the Hessian of the cost function is sparse, the computational cost can be
reduced, and this is true also stage-wise (i.e. if the Hessian matrices are sparse
only at some stages). Namely:
 a zero Sn matrix can be exploited at all stages, to reduce the computational
cost by 3n2xnu + nxn
2
u ﬂops per stage.
 if Sn is a zero matrix, a diagonal Rn matrix can be exploited at all stages
to reduce the computational cost by additional nxn
2
u +
1
3n
3
u per stage.
 a diagonal Qn matrix can be exploited only at the ﬁrst stage, since after-
wards it is updated with a dense matrix, as in lines 9 and 18 of Algorithm
4. In this case, the cost of the ﬁrst iteration can be reduced by additional
4
3n
3
x ﬂops if S0 = 0, or additional
4
3n
3
x + n
2
xnu ﬂops if S0 6= 0.
Notice that BLAS and LAPACK do not have support for diagonal matrices, and
therefore custom routines need to be employed for that. In conclusion, if the
Hessian of the cost function is diagonal at all stages, the computational cost of
Algorithm 4 is of
N( 103 n
3
x + n
2
xnu)− n3x + ndn2x + n2dnx + 13n3d
ﬂops for the MHE case, and n3x ﬂops less for the MPC case. Therefore in case
of diagonal Hessian of the cost function, the computational complexity of the
factorization part of the forward Schur-complement recursion is linear in nu.
The algorithm for the solution of the KKT system given the factorization of
the KKT matrix is presented in Algorithm 5, where the name of the BLAS and
LAPACK routines employed in the implementation appears as a comment.
Algorithm 5 consists of forward and backward substitutions. Again, triangular
matrices are exploited by means of specialized routines. The following analysis
of the computational cost includes only terms that are quadratic in the stage
variable sizes, and assumes the number of states and inputs to be constant stage-
wise, and equal to nx and nu respectively. In case of dense Hessian of the cost
function, the computational cost of Algorithm 5 is of
N
(
10n2x + 8nxnu + 2n
2
u
)
+ 2n2x + 4nxnd + 2n
2
d
ﬂops for the MHE case, and 6n2x + 4nxnu ﬂops less for the MPC case.
If the Hessian of the cost function is sparse, the computational cost can be
reduced, and this is true also stage-wise (i.e. if the Hessian matrices are sparse
only at some stages). Namely:
8.2 Forward Schur-complement recursion 181
Algorithm 5 Forward Schur-complement recursion - solution - dense Hessian
1:
[
q¯0
r¯0
]
← L−10 ·
[
σ0
r0
]
. trsv
2: b¯0 ← b0 −AL0 ·
[
q¯0
r¯0
]
. gemv
3: for n← 1, . . . , N − 1 do
4: σn ← qn − Un · UTn · b¯n−1 . trmv
5:
[
q¯n
r¯n
]
← L−1n ·
[
σn
rn
]
. trsv
6: b¯n ← bn −ALn ·
[
q¯n
r¯n
]
. gemv
7: end for
8: σN ← qN − UN · UTN · x¯N . trmv
9: if nd = 0 then
10: xN ← −L−TN · L−1N · σN . trsv
11: else
12: b¯N ← dN −ALN · L−1N · σN . gemv & trsv
13: λN ← L−TN · L−1N · b¯N . trsv
14: xN ← L−TN ·
(−σN −ALTN · λN) . gemv & trsv
15: end if
16: for n← N − 1, . . . , 0 do
17: λn ← Un · UTn · (b¯n − xn+1) . trmv
18:
[
xn
un
]
← L−Tn ·
(
−
[
q¯n
r¯n
]
−ALTn · λn
)
. gemv & trsv
19: end for
 a zero Sn matrix can be exploited at all stages, to reduce the computational
cost by 4nxnu ﬂops per stage.
 if Sn is a zero matrix, a diagonal Rn matrix can be exploited at all stages
to reduce the computational cost by additional 2n2u per stage.
 a diagonal Qn matrix can be exploited only at the ﬁrst stage, since after-
wards it is updated with a dense matrix, as in lines 9 and 18 of Algorithm
4. In this case, the cost of the ﬁrst iteration can be reduced by additional
2n2x ﬂops.
Again, BLAS and LAPACK do not have support for diagonal matrices, and
therefore custom routines need to be employed for that. In conclusion, if the
Hessian of the cost function is diagonal at all stages, the computational cost of
Algorithm 4 is of
N
(
10n2x + 4nxnu
)
+ 4nxnd + 2n
2
d
182 Structure-exploiting recursive factorizations of the KKT matrix
ﬂops for the MHE case, and 4n2x ﬂops less for the MPC case. Therefore in case of
diagonal Hessian of the cost function, the computational complexity also of the
solution part forward Schur-complement recursion is linear in nu, and therefore
so it is the computational complexity of the entire recursion.
8.2.2.2 Implementation using BLAS and LAPACK
The use of custom linear algebra routines can improve the implementation of
the forward Schur-complement recursion algorithm.
Panel-major matrix format The use of the panel-major matrix format as
the default matrix format in the forward Schur-complement recursion enables
the use of the linear algebra routines proposed in Part I ef the thesis. In particu-
lar, all matrices passed to the Schur-complement recursion routine are assumed
to be in the panel-major matrix format, and the results of the internal operations
are in this matrix format as well. Therefore, the eﬃcient linear algebra routines
for embedded optimization can be used without the need to convert the matrices
from row-major or column-major into the panel-major format. For small-scale
problems, this notably increases the computational performance with respect to
the use of standard BLAS and LAPACK routines.
Merging of linear algebra routines Custom routines merging two or more
standard BLAS or LAPACK linear algebra routines can be employed in the
implementation of Algorithms 4 and 5. In particular, lines 3 and 4, 12 and 13,
19 and 22 of Algorithm 4 can be implemented using the single routine potrf
proposed in Section 3.3.3: this routine merges the standard LAPACK routine
potrf and the standard BLAS routine trsm. Additionally, lines 9 to 12 of
Algorithm 4, 18 to 19 can be implemented using the single routine lauum_potrf
proposed in Section 3.3.3: this routine merges the standard LAPACK routines
lauum and potrf and the standard BLAS routine trsm. The gemv and trsv
operations in line 1 and 2, 5 and 6, 12, 14, 18 of Algorithm 5 can be implemented
using the single routine trsv proposed in Section 4.3.3. The used of merged
linear algebra routines gives better computational performance, especially in
case of small problems, by reducing the number of calls to linear algebra kernels
and increasing the size of processed matrix and vectors.
8.3 Comparison of structure-exploiting factorizations 183
8.3 Comparison of structure-exploiting factoriza-
tions
This section contains the comparison of the implementation of the structure-
exploiting recursive factorization algorithms. Several tests are performed: the
backward Riccati recursion is compared to the forward Schur-complement re-
cursion (for both full and diagonal Hessian of the cost function) when both re-
cursions are implemented using the custom linear algebra routines in HPMPC.
Furthermore, diﬀerent implementations of each single recursion are compared,
namely when implemented employing linear algebra routines in HPMPC or in
BLAS and LAPACK libraries (reference BLAS and LAPACK 3.5.0 from Netlib,
MKL 11.3 and OpenBLAS 0.2.15). Finally, the tests are performed on both
an Intel Ivy-Bridge processor (supporting the AVX ISA) and an Intel Haswell
processor (supporting the AVX2 and FMA ISAs). The two test machines have
the same memory conﬁguration, namely 8 GB of DDR3/DDR3L memory in
dual-channel conﬁguration (for a total data width of 128 bits), running at 1600
MHz, that gives a maximum bandwidth of 25.6 GB/s. Therefore, the diﬀerence
in performance is solely due to the processors.
The test problem is a LTI MPC test problem, the unconstrained version of
the widely-employed mass-spring system [90]. Flush to zero of denormals is
enabled: this avoids the large decrease of performance that would otherwise
happen when the matrix exponential operation in the discretization of the test
matrices produces denormal numbers. The exact nature of the test problem
does not aﬀect the computational times, provided that denormals are ﬂushed
to zero, and that no Inf or NaN are produced in the computations. Since the
complexity of all algorithms is linear in N , the horizon length is ﬁxed to N = 10
(or to N = 100 in a few tests). The number of states nx is varied between 4 and
300 in steps of 4 (plus the sizes nx = 6 and nx = 10 to increase the resolution for
small-scale systems), and the number of inputs nu is equal to half of the number
of states (and therefore varied between 2 and 150 in steps of 2). In the case of
the MPC problem, (at least) the ﬁrst stage of the forward Schur-complement
recursion requires regularization. Therefore, a regularization of ε = 10−9 is
employed at all stages in the forward Schur-complement recursion algorithms.
8.3.1 Comparison on Intel Ivy-Bridge micro-architecture
The test processor is the same Intel i7 3520M used for the tests in Sections 3.5.1
and 4.4.1. The Ivy-Bridge micro-architecture supports the AVX ISA, and it can
perform a 256-bit-wide multiplication and a 256-bit addition every clock cycle.
184 Structure-exploiting recursive factorizations of the KKT matrix
8.3.1.1 Comparison of algorithms in HPMPC
Figure 8.1 contains the results of timing and performance tests for the developed
recursive factorizations and solutions.
Figure 8.1a compares the KKT matrix factorization time for an horizon of
N = 10. For the tested problem sizes, the backward Riccati recursion is always
faster than the forward Schur-complement recursion in case of dense Hessian of
the cost function. In case of diagonal Hessian of the cost function, for large-
scale problems the forward Schur-complement recursion is slightly faster than
the backward Riccati recursion. Figure 8.1c compares the computational perfor-
mance of the KKT matrix factorization routines for an horizon of N = 10. The
backward Riccati recursion has slightly better performance than the forward
Schur-complement recursion (in both cases of dense and diagonal Hessian of the
cost function). This is due to the fact that in the forward Schur-complement
recursion a larger fraction of ﬂops is due to LAPACK routines (that generally
attain a slightly lower computational performance, especially for small matri-
ces, see Section 3.5). Figure 8.1e compares the computational performance of
the KKT matrix factorization routines for an horizon of N = 100. The ﬁgure
is very similar to the one for N = 10 (Figure 8.1e), with the only diﬀerence
that the performance of routines (and especially the forward Schur-complement
routines) is slightly lower for nx larger than about 40, with the performance
penalty decreasing as nx increases. The behavior is due to the fact that, for nx
smaller than about 40, all the data structure can ﬁt in L3 cache at once, and
therefore the single matrices do not need to be streamed from main memory.
For nx larger than about 40, each single matrix needs to be streamed from main
memory, but then it ﬁts in cache when the matrix elements are reused in level
3 BLAS and LAPACK routines. Therefore, for large values of nx the cost to
stream each matrix from memory is amortized over a large number of ﬂops, and
the performance is almost indistinguishable from the case N = 10.
Figure 8.1b compares the KKT system solution time, once the KKT matrix
is factorized, for N = 10. For the test problem sizes, the backward Riccati
recursion is always faster. The diﬀerence is small for small problems, but it
gets larger for nx larger than about 100, especially in the case of dense Hessian
of the cost function. The performance plots in Figure 8.1d (for N = 10) can
explain this. The performance is very similar for all routines for small-scale
problems, but for xn larger than about 100, the performance decreases quickly
for the forward Schur-complement recursion, especially in the case of dense
Hessian of the cost function. A comparison with the performance level of level
2 BLAS routines in Section 4.4 reveals that for nx smaller than about 100, the
performance is about at the same level as the performance when data is streamed
from level 3 cache, while for large nx it settles at a much lower level, since the
8.3 Comparison of structure-exploiting factorizations 185
10-6
10-5
10-4
10-3
10-2
10-1
101 102
tim
e 
[s]
nx (with nu=nx/2)
KKT recursions, N=10 - factorization phase
back. Ric.
forw. Schur - full
forw. Schur - diag.
(a) Factorization time.
10-6
10-5
10-4
10-3
10-2
10-1
101 102
tim
e 
[s]
nx (with nu=nx/2)
KKT recursions, N=10 - solution phase
back. Ric.
forw. Schur - full
forw. Schur - diag.
(b) Solution time.
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
KKT recursions, N=10 - factorization phase
back. Ric.
forw. Schur - full
forw. Schur - diag.
(c) Factorization Gﬂops.
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
KKT recursions, N=10 - solution phase
back. Ric.
forw. Schur - full
forw. Schur - diag.
(d) Solution Gﬂops.
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
KKT recursions, N=100 - factorization phase
back. Ric.
forw. Schur - full
forw. Schur - diag.
(e) Factorization Gﬂops.
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
KKT recursions, N=100 - solution phase
back. Ric.
forw. Schur - full
forw. Schur - diag.
(f) Solution Gﬂops.
Figure 8.1: KKT recursive factorization tests on Intel i7 3520M (Intel Ivy-
Bridge micro-architecture, supporting the AVX ISA).
186 Structure-exploiting recursive factorizations of the KKT matrix
data is streamed from main memory. Figure 8.1d contains the performance plot
for N = 100. In this case, the performance drops for a much smaller value
of nx. Since in level 2 BLAS routines there is no reuse of matrix elements,
once the performance drops due to the streaming of matrix data from main
memory, the performance keeps being low as nx increases. In the tested LTI
case, the performance of the backward Riccati recursion solution is slightly
higher since the dynamical system matrices are re-used in consecutive iterations,
and therefore already present in cache (at least for all tested problem sizes; for
values of nx large enough each such that each single matrix does not ﬁt in L3
cache, the performance drops also for the backward Riccati recursion solution
in the LTI case). All other matrices in the backward Riccati recursion solution
and all matrices in the forward Schur-complement recursion solution change at
each stage, preventing any reuse of matrix data in cache. The same happens for
all solvers in case of time-variant test problems.
The KKT system solution routine is implemented using exclusively level 2 BLAS
routines, and therefore there is no reuse of matrix elements within the single
operations. The same matrices can be reused at consecutive stages (for the
dynamical system matrices in case of LTI problems, as in the test), or at the
same stage in the backward and forward recursions (in all cases). Therefore, if
the problem is small enough such that the entire problem data ﬁts in some cache
level, the performance is analogue to the one of the level 2 BLAS routines for
matrices ﬁtting in that cache level. As a consequence, if the problem matrices are
time-variant, or if the horizon length N is large, the performance drop happens
for smaller values of nx.
8.3.1.2 Comparison of algorithms between HPMPC and optimized
BLAS
In Figure 8.2, the backward Riccati recursion has been implemented using linear
algebra routines from HPMPC or from diﬀerent BLAS and LAPACK versions.
Figure 8.2a contains the results for the factorization time. For small-scale prob-
lems, all BLAS version perform similarly, while HPMPC gives a nice perfor-
mance boost. For larger-scale problems, the performance advantage of HPMPC
over optimized BLAS libraries gets smaller, while it keeps constant to about
5 times as fast as the reference BLAS and LAPACK. Both optimized BLAS
libraries (MKL and OpenBLAS) perform very similarly, with OpenBLAS be-
ing slightly better than MKL. This is conﬁrmed also by the analysis of the
performance plot in Figure 8.2c. The performance of the factorization routine
employing the linear algebra routines in HPMPC increases steeply for small ma-
trices, and keeps steady around 80-85% of full FP throughput for larger problem
8.3 Comparison of structure-exploiting factorizations 187
10-6
10-5
10-4
10-3
10-2
10-1
101 102
tim
e 
[s]
nx (with nu=nx/2)
back_ric_trf, N=10
Netlib
HPMPC
MKL
OpenBLAS
(a) Backward Riccati recursion factorization
time.
10-6
10-5
10-4
10-3
10-2
10-1
101 102
tim
e 
[s]
nx (with nu=nx/2)
back_ric_trs, N=10
Netlib
HPMPC
MKL
OpenBLAS
(b) Backward Riccati recursion solution time.
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
back_ric_trf, N=10
Netlib
HPMPC
MKL
OpenBLAS
(c) Backward Riccati recursion factorization
Gﬂops.
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
back_ric_trs, N=10
Netlib
HPMPC
MKL
OpenBLAS
(d) Backward Riccati recursion solution Gﬂops.
Figure 8.2: Backward Riccati recursion tests on Intel i7 3520M (Intel Ivy-
Bridge micro-architecture, supporting the AVX ISA).
188 Structure-exploiting recursive factorizations of the KKT matrix
sizes. Employing linear algebra routines from optimized BLAS and LAPACK
libraries, the performance increases as the problem size increases. On the con-
trary, reference linear algebra routines give low performance (about 15-18% of
full FP throughput) for all problem sizes.
Figure 8.2b contains the results for the solution time. In this case, the linear
algebra routines in HPMPC still give a nice performance boost for small ma-
trices, but as the problem size increases the performance gets similar for all
implementations. This is due to the fact that the solution time is bounded by
the time needed to stream data from main memory. In the implementation of
the solution algorithm, optimized BLAS versions give little advantage over the
reference BLAS version. The performance plot in Figure 8.2d conﬁrms this,
showing as the performance gets low for all implementations as the problem
size increases. Nonetheless, even if partially counter-balanced by a much lower
performance, the solution time is still about one order of magnitude lower than
the factorization time in case of large-scale problems.
In Figure 8.3, the forward Schur-complement recursion has been implemented
using linear algebra routines from HPMPC or from diﬀerent BLAS and LA-
PACK versions. Only the case of full Hessian of the cost function is considered.
Figure 8.3a contains the results for the factorization time. Overall, the results
are very similar to the ones for the backward Riccati recursion, with the diﬀer-
ence that the factorization times are slightly higher. This is partially due to the
higher ﬂop count, and partially to the slightly lower performance of the factor-
ization routine, as shown in Figure 8.3c. The lower performance is due to the
fact that these routines contains more LAPACK routines (triangular matrix fac-
torizations and inversions), that typically can attain a lower performance. The
linear algebra routines in HPMPC still give the best performance, while the
use of reference BLAS and LAPACK very low performance. About optimized
BLAS versions, this time MKL slightly outperforms OpenBLAS, likely due to
the extremely poor performance of the triangular matrix inversion routine in
OpenBLAS.
Figure 8.3b contains the results for the solution time. Again, the results are very
similar to the ones for the backward Riccati recursion in Figure 8.2b. The linear
algebra routines in HPMPC give some performance improvements for small-
scale problems, while the performance is very similar for all implementations
for large-scale problems.
8.3 Comparison of structure-exploiting factorizations 189
10-6
10-5
10-4
10-3
10-2
10-1
101 102
tim
e 
[s]
nx (with nu=nx/2)
forward_schur_dense_trf, N=10
Netlib
HPMPC
MKL
OpenBLAS
(a) Forward Schur-complement recursion fac-
torization time.
10-6
10-5
10-4
10-3
10-2
10-1
101 102
tim
e 
[s]
nx (with nu=nx/2)
forward_schur_dense_trs, N=10
Netlib
HPMPC
MKL
OpenBLAS
(b) Forward Schur-complement recursion solu-
tion time.
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
forward_schur_dense_trf, N=10
Netlib
HPMPC
MKL
OpenBLAS
(c) Forward Schur-complement recursion fac-
torization Gﬂops.
0
5
10
15
20
25
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
forward_schur_dense_trs, N=10
Netlib
HPMPC
MKL
OpenBLAS
(d) Forward Schur-complement recursion solu-
tion Gﬂops.
Figure 8.3: Forward Schur-complement recursion tests on Intel i7 3520M (Intel
Ivy-Bridge micro-architecture, supporting the AVX ISA).
190 Structure-exploiting recursive factorizations of the KKT matrix
8.3.2 Comparison on Intel Haswell micro-architecture
The test processor is the same Intel i7 4800MQ used for the tests in Sections
3.5.2 and 4.4.2. The Haswell micro-architecture supports the AVX2 and FMA
ISAs, and it can perform two 256-bit-wide fused-multiplication-addition every
clock cycle.
Qualitatively, the results on the Haswell processor are similar than on the Ivy-
Bridge processor, with the main diﬀerences outlines here.
8.3.2.1 Comparison of algorithms in HPMPC
In case of the factorization routines, Figure 8.4a is similar to the Ivy-Bridge
counterpart in Figure 8.1a, with the backward Riccati recursion begin up to
twice as fast for small-scale problems, and with the diagonal Hessian version
of the forward Schur-complement recursion being slightly faster for large-scale
problems. Comparing the performance plots (Figures 8.4c and 8.4e), in the case
of the Haswell processor the diﬀerence in performance between the backward
Riccati recursion and the forward Schur-complement recursion is slightly larger,
especially in case of long horizon (Figure 8.4e, where a performance drop is
clearly visible when the data footprint exceeds L3 cache size. The performance
of the forward Schur-complement recursion is more sensitive to cache size due
to the larger memory usage.
In case of the solution routines, the main diﬀerence between the Ivy-Bridge
processor and the Haswell processor is the fact that the drop in performance
happens for larger matrix sizes, since the L3 cache size is 6 MB on the Haswell
processor and 4 MB on the Ivy-Bridge processor. The performance levels are
similar on the two processor, since the performance is mainly due to the band-
width of L3 cache and of main memory, that is similar for the two processors.
8.3.2.2 Comparison of algorithms between HPMPC and optimized
BLAS
In the comparison with other BLAS and LAPACK implementations, there is
some noteworthy diﬀerence with respect to the Ivy-Bridge processor.
In the factorization routines, HPMPC and MKL roughly double performance
with respect to the Ivy-Bridge processor, and therefore the performance ration
8.3 Comparison of structure-exploiting factorizations 191
10-6
10-5
10-4
10-3
10-2
10-1
101 102
tim
e 
[s]
nx (with nu=nx/2)
KKT recursions, N=10 - factorization phase
back. Ric.
forw. Schur - full
forw. Schur - diag.
(a) Factorization time.
10-6
10-5
10-4
10-3
10-2
10-1
101 102
tim
e 
[s]
nx (with nu=nx/2)
KKT recursions, N=10 - solution phase
back. Ric.
forw. Schur - full
forw. Schur - diag.
(b) Solution time.
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
KKT recursions, N=10 - factorization phase
back. Ric.
forw. Schur - full
forw. Schur - diag.
(c) Factorization Gﬂops.
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
KKT recursions, N=10 - solution phase
back. Ric.
forw. Schur - full
forw. Schur - diag.
(d) Solution Gﬂops.
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
KKT recursions, N=100 - factorization phase
back. Ric.
forw. Schur - full
forw. Schur - diag.
(e) Factorization Gﬂops.
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
KKT recursions, N=100 - solution phase
back. Ric.
forw. Schur - full
forw. Schur - diag.
(f) Solution Gﬂops.
Figure 8.4: KKT recursive factorization tests on Intel i7 4800MQ (Intel
Haswell micro-architecture, supporting the AVX2 and FMA ISAs).
192 Structure-exploiting recursive factorizations of the KKT matrix
between the two is about the same. OpenBLAS however shoes a lower per-
formance, especially in the case of the forward Schur-complement recursion in
Figure 8.6c, due to the worse state of the AVX2 and FMA ISAs support. Also
the reference BLAS and LAPACK versions give a lower performance than in
the Ivy-Bridge case, arriving about 10-12% of full FP throughput, showing once
again that it is increasingly hard for generic code to give high performance on
modern processors. The reference BLAS and LAPACK versions are about 6-8
times slower than the HPMPC versions.
In the solution routines, there is an interesting diﬀerence with respect to the
Ivy-Bridge ones. Namely, the reference BLAS version gives a much lower perfor-
mance compared to all the other implementations, also for large-scale problems.
This is due to the fact that all reference level 2 BLAS routines where the matrix
is transposed perform extremely poorly on the Haswell machine when the FMA
ISA is employed (compare e.g. the performance of the 'N' and 'T' variants of the
dgemv routine in Figures 4.2a and 4.2b). All other versions perform similarly to
the Ivy-Bridge case, giving a similar absolute performance (since L3 cache and
memory bandwidth are similar), and therefore a lower fraction of the peak FP
throughput (since it doubled in the Haswell micro-architecture with respect to
the Ivy-Bridge micro-architecture).
8.4 Conclusion
This section presented two structure-exploiting recursive factorizations of the
KKT matrix of the unconstrained (linear) MPC and MHE problems. The back-
ward Riccati recursion begins the factorization at the last stage, and it can
naturally handle unconstrained MPC problems, but it can not handle addi-
tional equality constraints at the last stage. The forward Schur-complement
recursion begins the factorization at the ﬁrst stage, and it can naturally handle
unconstrained MHE problems (while it may requires regularization to handle
unconstrained MPC problems), and furthermore it can explicitly handle addi-
tional equality constraints at the last stage.
The derivation and implementation (for both standard BLAS and LAPACK
routines, and tailored routines in HPMPC) is presented in details. Further-
more, many tests are performed to deeply investigate the performance of the
implemented recursions, and their relation to the employed linear algebra rou-
tines.
The main ﬁndings of the tests for the recursive KKT matrix factorization rou-
tines are:
8.4 Conclusion 193
10-6
10-5
10-4
10-3
10-2
10-1
101 102
tim
e 
[s]
nx (with nu=nx/2)
back_ric_trf, N=10
Netlib
HPMPC
MKL
OpenBLAS
(a) Backward Riccati recursion factorization
time.
10-6
10-5
10-4
10-3
10-2
10-1
101 102
tim
e 
[s]
nx (with nu=nx/2)
back_ric_trs, N=10
Netlib
HPMPC
MKL
OpenBLAS
(b) Backward Riccati recursion solution time.
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
back_ric_trf, N=10
Netlib
HPMPC
MKL
OpenBLAS
(c) Backward Riccati recursion factorization
Gﬂops.
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
back_ric_trs, N=10
Netlib
HPMPC
MKL
OpenBLAS
(d) Backward Riccati recursion solution Gﬂops.
Figure 8.5: Backward Riccati recursion tests on Intel i7 4800MQ (Intel Haswell
micro-architecture, supporting the AVX2 and FMA ISAs).
194 Structure-exploiting recursive factorizations of the KKT matrix
10-6
10-5
10-4
10-3
10-2
10-1
101 102
tim
e 
[s]
nx (with nu=nx/2)
forward_schur_dense_trf, N=10
Netlib
HPMPC
MKL
OpenBLAS
(a) Forward Schur-complement recursion fac-
torization time.
10-6
10-5
10-4
10-3
10-2
10-1
101 102
tim
e 
[s]
nx (with nu=nx/2)
forward_schur_dense_trs, N=10
Netlib
HPMPC
MKL
OpenBLAS
(b) Forward Schur-complement recursion solu-
tion time.
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
forward_schur_dense_trf, N=10
Netlib
HPMPC
MKL
OpenBLAS
(c) Forward Schur-complement recursion fac-
torization Gﬂops.
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
forward_schur_dense_trs, N=10
Netlib
HPMPC
MKL
OpenBLAS
(d) Forward Schur-complement recursion solu-
tion Gﬂops.
Figure 8.6: Forward Schur-complement recursion tests on Intel i7 4800MQ
(Intel Haswell micro-architecture, supporting the AVX2 and FMA
ISAs).
8.4 Conclusion 195
 The main result is that the diﬀerence as computation time between dif-
ferent recursive KKT matrix factorization algorithms is smaller than the
diﬀerence between diﬀerent implementations.
The considered factorization routines share the same asymptotic complex-
ity, but with diﬀerent coeﬃcients. However, the diﬀerence in performance
for diﬀerent linear algebra implementations can account for up to an order
of magnitude, that is much larger than the ratio between the ﬂop count
for the considered factorization routines.
 Regarding the KKT system solution algorithms, the performance is heav-
ily aﬀected by the data memory footprint, as there is no reuse of ma-
trix elements in level 2 BLAS routines, and the performance is bounded
by the time to stream data. Diﬀerence between implementations is sig-
niﬁcant only as long as the data memory footprint does not exceed L2
cache size (with the exception of reference BLAS on the Haswell micro-
architecture, that gives noticeably worse performance). Therefore, for the
small-scale problems typical of embedded optimization, implementation of
level 2 BLAS routines can play an important role.
 In case of well-optimized routines, computational throughput is the main
factor aﬀecting the performance of the factorization routines, whereas the
cache size and the memory bandwidth are the main factors aﬀecting the
performance of the solution routines. The latency of FP instructions af-
fects the performance for small-scale problems.
 The backward Riccati recursion is better suited than the forward Schur-
complement recursion to be used as a routine in solvers for constrained
MPC. In fact, the backward Riccati factorization routine has some per-
formance advantage for small-scale problems, while the backward Riccati
solution has some performance advantage for large-scale problems.
 For small-scale problems, the time required to factorize the KKT matrix
is only slightly larger than the time required to solve the KKT system
once the KKT matrix is factorized. This has important consequences
on the choice of optimization algorithms for constrained MPC and MHE
problems, since for small-scale problems it would be preferable the choice
of optimization algorithms requiring a small amount of combined KKT
matrix factorization and solutions (as e.g. in the IPMs), while for large-
scale problems it would be preferable the choice of optimization algorithms
requiring a small number of KKT matrix factorizations (as e.g. ﬁrst order
methods such as ADMM), since the solution time is about an order of
magnitude smaller.
As a ﬁnal note, the ﬁndings of these tests have wider scope than the considered
recursive factorization algorithms. In the remainder of the thesis, the tests will
196 Structure-exploiting recursive factorizations of the KKT matrix
be limited to a single processor, namely the Intel core i7 3520M, supporting the
AVX ISA. Furthermore, comparisons with BLAS and LAPACK libraries will
not be repeated for other algorithms, the results being analogue.
Chapter 9
Condensing methods
Condensing is traditionally referred to a solution method for the MPC problem,
where the states are removed from the problem formulation by using the state-
space equation to reformulate them as a function of the initial state (datum)
and the inputs (retained as optimization variables). This leads to a smaller but
dense and unstructured optimization problem, that can be solved using general-
purpose methods: e.g. Cholesky factorization for the unconstrained problem,
active set algorithms for the constrained problem.
However, condensing can have a diﬀerent interpretation: condensing can be seen
as a technique that transforms a MPC problem with horizon length N into a
MPC problem with horizon length 1, at the expense of increasing the input
vector size from nu to Nnu. In this interpretation, suggested by the recently
proposed work on partial condensing [18], the fact that the states are removed
from the problem formulation is just accidental, and due to the fact that the
initial state is datum in the MPC problem.
This second interpretation of condensing can naturally be applied to the MHE
problem: in fact, in this case the initial state is an optimization variable in
both the original and the condensed MHE problems. Therefore the condensed
formulation of a MHE problem with horizon N , nx states and nu input is just
a MHE problem with horizon 1, nx states and Nnu inputs.
198 Condensing methods
All numerical tests in this section are performed on a laptop equipped with
the Intel Core i7 3520M @ 3.6 GHz in turbo mode, running Linux 14.04. All
algorithms are implemented using matrices in the panel-major format, and linear
algebra routines part of HPMPC, and described in Part I of this thesis.
As a ﬁnal note, in the computation of the ﬂop counts, it is assumed that addi-
tions and multiplications are performed through a FMA pipeline, and therefore
that they have a cost of 2 ﬂops. This reﬂects the common implementation of
these instructions on modern architectures, where generally additions, multipli-
cations and FMA instructions have the same throughput.
9.1 Condensing methods for MPC
Let us assume that N = 3. Equation (7.2b) can be rewritten as
x¯ = A¯−1B¯u¯+ A¯−1b¯ = Γuu¯+ Γx,b (9.1)
where the matrices x¯, u¯, A¯, B¯, b¯ are
x¯ =

x0
x1
x2
x3
 , u¯ =
u0u1
u2
 ,
A¯ =

I
−A0 I
−A1 I
−A2 I
 , B¯ =

.
B0
B1
B2
 , b¯ =

xˆ0
b0
b1
b2
 .
Notice that x0 is considered an optimization variable, and therefore it is part
of the x¯ vector, even if its value is constrained to xˆ0. This choice simpliﬁes
the derivation of some algorithms, even if it may be convenient to drop x0
from the optimization variables to improve the computational time of practical
implementations.
Furthermore, notice that the matrix A¯ is invertible, sparse (namely block bi-
diagonal, containing O(N) non-zeros elements), and furthermore it is lower
triangular. The matrix A¯−1N = A¯
−1 (where the index N means that the matrix
is related to a MPC problem with horizon lengthN) can be computed recursively
by means of the explicit formula for the inversion of a lower triangular matrix[
X
Y Z
]−1
=
[
X−1
−Z−1Y X−1 Z−1
]
(9.2)
9.1 Condensing methods for MPC 199
as
A¯−1N =
[
A¯N−1
−AN−1EN−1 I
]−1
=
[
A¯−1N−1
AN−1EN−1A¯−1N−1 I
]
(9.3)
where En is the matrix
En =
[
0 . . . 0 I
]
(9.4)
of size nx × ((n+ 1) · nx). Therefore the expression EnA¯−1n is the last block-row
of the matrix A¯−1n .
Notice that the matrix A¯−1 is dense (namely lower triangular, containing O(N2)
non-zeros elements), and for N = 3 it looks like
A¯ =

I
−A0 I
−A1 I
−A2 I

−1
=

I
A0 I
A1A0 A1 I
A2A1A0 A2A1 A2 I
 (9.5)
Therefore the matrices Γu and Γx,b are
Γu =
 B0A1B0 B1
A2A1B0 A2B1 B2
 , Γx,b =

xˆ0
A0xˆ0 + b0
A1(A0xˆ0 + b0) + b1
A2(A1(A0xˆ0 + b0) + b1) + b2
 .
Inserting (9.1) in the cost function
φ = 12
(
x¯T Q¯x¯+ x¯T S¯T u¯+ u¯T S¯x¯+ u¯T R¯u¯
)
+ q¯T x¯+ r¯T u¯+ 12ρ
where the matrices Q¯, S¯, R¯, q¯, s¯, deﬁned in (7.3), are
Q¯ =

Q0
Q1
Q2
Q3
 , S¯ =
S0 S1
S2
 , R¯ =
R0 R1
R2
 ,
q¯ =

q0
q1
q2
q3
 , r¯ =
r0r1
r2
 ,
the cost function is rewritten as
φ = 12 u¯
THRu¯+ g
T
r u¯+ ρρ. (9.6)
200 Condensing methods
where
HR = Γ
T
u Q¯Γu + Γ
T
u S¯
T + S¯Γu + R¯
gr = Γ
T
u Q¯Γx,b + S¯Γx,b + Γ
T
u q¯ + r¯
ρρ =
1
2 (Γ
T
x,bQ¯Γx,b + 2q¯
TΓx,b + ρ)
Condensing algorithms can be useful in several cases: to provide the Hessian
or the factorized Hessian to gradient and active set methods, as a way to solve
unconstrained MPC problems, possibly embedded in interior-point methods.
Therefore, in this section several algorithms will be considered: three algorithms
to compute the condensed Hessian matrix HR, two algorithms to compute the
factorization of the condensed Hessian matrix, ﬁve combined algorithms to com-
pute the Cholesky factor of the Hessian matrix, and ﬁnally two algorithms to
solve the condensed MPC problem. All these algorithms are characterized by
diﬀerent asymptotic complexity, and therefore are better suited for diﬀerent
problem instances.
9.1.1 Condensing algorithms for MPC
This section contains algorithms to build the condensed Hessian matrix for the
MPC problem.
In the ﬁrst part, eﬃcient algorithms to build Γu and Γx,b are presented, that
exploit the sparsity of the matrix A¯.
Then three diﬀerent approaches for the computation of the condensed Hessian
matrix are presented, and each approach leads to a diﬀerent asymptotic com-
plexity. The ﬁrst approach is the classical way to build the condensed Hessian
matrix, and it has a computational complexity O(N3) and O(n2x). The second
approach has been recently proposed in [31] and [16], and it has a computational
complexity O(N2) and O(n2x). The third approach is, to my knowledge, novel,
and it employs a recursion that resembles the backward Riccati recursion, giving
a complexity O(N2) and O(n3x), but with a smaller quadratic term in N .
Afterwards, the same three approaches are employed for the computation of the
gradient vector. In this case, however, the last two approaches have the same
computational complexity.
Finally, the three approaches for the computation of the condensed Hessian
matrix are compared, both as number of ﬂops and as execution time of the
algorithms implemented using linear algebra routines in HPMPC.
9.1 Condensing methods for MPC 201
9.1.1.1 O(N2) computation of Γu
The matrices Γu and Γx,b can be eﬃciently computed in time O(N2) and O(N)
respectively. This can be obtained by exploiting the structure of the matrix A¯.
The key idea is to avoid the explicit computation of the matrix A¯−1, and in-
stead directly compute Γu and Γx,b using a structure-exploiting lower triangular
system solution (i.e. a forward substitution).
For N = 3, the generic system to be solved is in the form
y0
y1
y2
y3
 =

I
−A0 I
−A1 I
−A2 I


x0
x1
x2
x3
 =

x0
x1 −A0x0
x2 −A1x1
x3 −A2x2

that gives 
x0
x1
x2
x3
 =

I
−A0 I
−A1 I
−A2 I

−1 
y0
y1
y2
y3
 =

y0
y1 +A0x0
y2 +A1x1
y3 +A2x2
 . (9.7)
Therefore the solution can clearly be computed recursively in time O(N), as
the solution algorithm requires 2Nn2x ﬂops, where nx × nx is the size of each
matrix An. In case of a matrix instead of a vector at the right-hand-side, the
same algorithm can be applied column-wise.
The computation of Γx,b requires the application of the algorithm to a single
column at the right-hand-side, and therefore it has a cost of about 2Nn2x ﬂops.
The algorithm for the computation of Γx,b is presented in Algorithm 6.
Algorithm 6 Computation of Γx,b
1: Γx,b[0]← xˆ0
2: for i← 0, . . . , N − 1 do
3: Γx,b[i+ 1]← Ai · Γx,b[i] + bi
4: end for
In the computation of Γu, it is possible to exploit the fact that the last elements
of each row are 0. The total number of ﬂops can be computed as
2n2xnu
N−1∑
n=0
n ≈ N2n2xnu.
202 Condensing methods
The algorithm for the computation of Γu is presented in Algorithm 7.
Algorithm 7 Computation of Γu
1: Γu[1, 0]← B0
2: for i← 1, . . . , N − 1 do
3: Γu[i+ 1, 0 : i− 1]← Ai · Γu[i, 0 : i− 1]
4: Γu[i+ 1, i]← Bi
5: end for
If the condensing algorithm is implemented using matrices in the panel-major
format, then the computation of ΓTu is preferred over the computation of Γu.
In fact, in the computation of ΓTu the condensing algorithms operate on block-
columns, that are all properly aligned in memory.
The algorithm for the computation of ΓTu is presented in Algorithm 8.
Algorithm 8 Computation of ΓTu
1: ΓTu [0, 1]← BT0
2: for i← 1, . . . , N − 1 do
3: ΓTu [0 : i− 1, i+ 1]← ΓTu [0 : i− 1, i] ·ATi
4: ΓTu [i, i+ 1]← BTi
5: end for
Multiplication of matrices on the left by A¯−T can be computed eﬃciently us-
ing a structure-exploiting upper triangular system solution (i.e. a backward
substitution). For N = 3, the generic system to be solved is in the form
y0
y1
y2
y3
 =

I −AT0
I −AT1
I −AT2
I


x0
x1
x2
x3
 =

x0 −AT0 x1
x1 −AT1 x2
x2 −AT2 x3
x3

that gives
x0
x1
x2
x3
 =

I −AT0
I −AT1
I −AT2
I

−1 
y0
y1
y2
y3
 =

y0 +A
T
0 x1
y1 +A
T
1 x2
y2 +A
T
2 x3
y3
 . (9.8)
The cost to compute each column is the same that in the case of the structured
forward substitution, and equal to about 2Nn2x ﬂops.
9.1 Condensing methods for MPC 203
Multiplication of matrices on the right by A¯−1 can be computed eﬃciently as the
transpose of the multiplication of matrices on the left by A¯−T , i.e. by computing
a row at a time as
[
xT0 x
T
1 x
T
2 x
T
3
]
=
[
yT0 + x
T
1 A0 y
T
1 + x
T
2 A1 y
T
2 + x
T
3 A2 y
T
3
]
. (9.9)
9.1.1.2 O(N3) and O(n2x) computation of HR
The key operation in the condensing method is the computation of the matrix
ΓTu Q¯Γu. One natural algorithm is
ΓTu Q¯Γu = Γ
T
u · (Q¯ · Γu)
where the Γu matrix has been previously computed. If the algorithm is imple-
mented the using the implementation techniques employed in HPMPC, a better
algorithm is
ΓTu Q¯Γu = (Γ
T
u · Q¯) · (ΓTu )T
where the matrix ΓTu is precomputed. In this variant, it is possible to operate
on block-column sub-matrices (that are properly memory aligned). The matrix-
matrix products are computed exploiting the block-triangular structure of the
Γu matrix.
The matrix ΓTu · Q¯ is
ΓTu · Q¯ =
 BT0 Q1 BT0 AT1 Q2 BT0 AT1 AT2 Q3BT1 Q2 BT1 AT2 Q3
BT2 Q3

and it can be computed one block-column at a time, at a cost of about N2n2xnu
ﬂops. Notice that, if the matrices Qi are diagonal, this cost can be reduced to
about N2nxnu ﬂops, linear in nx.
Once computed the matrix ΓTu · Q¯, the lower-triangular part of the product
(ΓTu · Q¯) · Γu is
(ΓTu · Q¯) · Γu =
T00 ∗ ∗T10 T11 ∗
T20 T21 T22

204 Condensing methods
where
T00 = B
T
0 Q1B0 +B
T
0 A
T
1 Q2A1B0 +B
T
0 A
T
1 A
T
2 Q3A2A1B0
T10 = B
T
1 Q2A1B0 +B
T
1 A
T
2 Q3A2A1B0
T20 = B
T
2 Q3A2A1B0
T11 = B
T
1 Q2B1 +B
T
1 A
T
2 Q3A2B1
T21 = B
T
2 Q3A2B1
T22 = B
T
2 Q3B2
and it can be computed using either high- or low-rank updates.
In the case of high-rank updates, the matrix ΓTu Q¯Γu is computed one block of
size nu × nu at a time. This requires N(N−1)2 calls to the gemm BLAS routine
and N calls to the syrk BLAS routine. The rank of the updates ranges between
nx and Nnx. If the algorithm is implemented using matrices in panel-major
format, this version has the issue that blocks of row index larger than 1 may be
not properly aligned in memory. The cost of the algorithm is
2nxn
2
u
N∑
m=1
m∑
n=1
n ≈ 2nxn2u
N∑
m=1
1
2m
2 ≈ 13N3nxn2u
ﬂops, exploiting symmetry.
In the case of low-rank updates, the matrix ΓTu Q¯Γu is computed by multiplying
the i-th block-column of ΓTu Q¯ by the i-th block-row of Γu by means of N calls
to the syrk BLAS routine. The rank update is ﬁxed to nx, while the size of the
updated sub-matrix ranges from nu to Nnu. If the algorithm is implemented
using matrices in panel-major format, this version has the advantage that all
updated sub-matrices are at the top-left corner, and therefore properly aligned
in memory. The cost of the algorithm is the same than in the high-rank update
case:
nxn
2
u
N∑
n=1
n2 ≈ 13N3nxn2u
ﬂops, exploiting symmetry.
The block-diagonal of the HR matrix is initialized with the Ri matrices. The
strictly block-lower-triangular part of the HR matrix is initialized with the ma-
trix S¯ · Γu = (ΓTu · S¯T )T , that is computed using the gemm routine as
S¯ · Γu =
 S1B0
S2A1B0 S2B1

9.1 Condensing methods for MPC 205
at the cost of about N2nxn
2
u ﬂops.
The O(N3) condensing algorithm for the low-rank case is summarized in Al-
gorithm 9. Besides the cost to compute ΓTu , the cost of the algorithm is of
about
1
3N
3nxn
2
u +N
2n2xnu +N
2nxn
2
u ≈ 13N3nxn2u +N2n2xnu
ﬂops if the matrices Qi and Si are full and of about.
1
3N
3nxn
2
u
ﬂops if the matrices Qi are diagonal and the matrices Si are zero.
Algorithm 9 Computation of HR, O(N3) and O(n2x) algorithm
Require:
ΓTu
1: for i← 0, . . . , N − 1 do
2: (ΓTu Q¯)[0 : i, i+ 1]← ΓTu [0 : i, i+ 1] ·Qi+1
3: end for
4: for i← 0, . . . , N − 1 do
5: HR[i, i]← Ri
6: end for
7: for i← 0, . . . , N − 2 do
8: HR[i+ 1, 0 : i]← (ΓTu [0 : i, i+ 1] · STi+1)T
9: end for
10: for i← 0, . . . , N − 1 do
11: HR[0 : i, 0 : i]← HR[0 : i, 0 : i] + (ΓTu Q¯)[0 : i, i+ 1] · (ΓTu [0 : i, i+ 1])T
12: end for
9.1.1.3 O(N2) and O(n2x) computation of HR
By means of the recursive expression for the A¯−1 = A¯−1N matrix in (9.3), it is
possible to write the analogue recursive expression for the Γu = Γu,N matrix
Γu,N = A¯
−1
N B¯N =
[
A¯−1N−1
AN−1EN−1A¯−1N−1 I
] [
B¯N−1
BN−1
]
=
=
[
A¯−1N−1B¯N−1
AN−1EN−1A¯−1N−1B¯N−1 BN−1
]
=
[
Γu,N−1
AN−1EN−1Γu,N−1 BN−1
]
(9.10)
where En is deﬁned in (9.4) and the expression EnΓu,n is the last block-row of
the matrix Γu,n, and similarly the expression Γ
T
u,nETn is the last block-column
of the matrix ΓTu,n.
206 Condensing methods
By means of (9.10), it is possible to investigate the inner structure of the ex-
pression ΓTu Q¯Γu = Γ
T
u,N Q¯NΓu,N as
(ΓTu,N Q¯N )Γu,N =
=
[
ΓTu,N−1 Γ
T
u,N−1ETN−1ATN−1
BTN−1
] [
Q¯N−1
QN
] [
Γu,N−1
AN−1EN−1Γu,N−1 BN−1
]
=
=
[
ΓTu,N−1Q¯N−1 Γ
T
u,N−1ETN−1ATN−1QN
BTN−1QN
] [
Γu,N−1
AN−1EN−1Γu,N−1 BN−1
]
=
=
[
ΓTu,N−1Q¯N−1Γu,N−1 + Γ
T
u,N−1ETN−1ATN−1QNAN−1EN−1Γu,N−1 ∗
(BTN−1QNAN−1)EN−1Γu,N−1 BTN−1QNBN−1
]
Deﬁned D˜N−1 = BTN−1QNBN−1 and M˜N−1 = B
T
N−1QNAN−1, the last block-
row of the matrix ΓTu,N Q¯NΓu,N can be computed at the cost of 2Nnxn
2
u ﬂops
as (using N = 3 to make notation easier)
M˜2A1B0 M˜2B1 D˜2
 . (9.11)
The top-left block of ΓTu,N Q¯NΓu,N has the structure
ΓTu,N−1Q¯N−1Γu,N−1 + Γ
T
u,N−1ETN−1ATN−1QNAN−1EN−1Γu,N−1 =
=
(
ΓTu,N−1Q¯N−1 + Γ
T
u,N−1ETN−1ATN−1QNAN−1EN−1
)
Γu,N−1 =
(9.12)
that has the same structure of the matrix (ΓTu,N Q¯N )Γu,N . In fact, the expression
for the matrix ΓTu,N−1Q¯N−1 is obtained from the expression for the matrix
ΓTu,N Q¯N with N − 1 in place of N
ΓTu.N−1Q¯N−1 =
[
ΓTu,N−2Q¯N−2 Γ
T
u,N−2ETN−2ATN−2QN−1
BTN−2QN−1
]
and the matrix ΓTu,N−1ETN−1ATN−1QNAN−1EN−1 is in the form
ΓTu,N−1ETN−1ATN−1QNAN−1EN−1 =
=
[
ΓTu,N−2 Γ
T
u,N−2ETN−2ATN−2
BTN−2
] [
0
I
]
ATN−1QNAN−1
[
0 I
]
=
=
[
0 ΓTu,N−2ETN−2ATN−2ATN−1QNAN−1
BTN−2A
T
N−1QNAN−1
] (9.13)
where only the last block-column has non-zero elements, and therefore the ma-
trix ΓTu,N−1Q¯N−1 + Γ
T
u,N−1ETN−1ATN−1QNAN−1EN−1 can be computed as an
9.1 Condensing methods for MPC 207
update of the last block-column of the matrix ΓTu,N−1Q¯N−1. For N = 3, the
matrix ΓTu,N−1Q¯N−1 + Γ
T
u,N−1ETN−1ATN−1QNAN−1EN−1 is
ΓTu,N−1Q¯N−1 + Γ
T
u,N−1ETN−1ATN−1QNAN−1EN−1 =
=
[
. BT0 Q1 B
T
0 A
T
1 Q2 +B
T
0 A
T
1 A
T
2 Q3A2
BT1 Q2 +B
T
1 A
T
2 Q3A2
]
where it can be seen that only the last block-column has been updated. The
recursion thus can close.
Notice that, exploiting the recursive form of ΓTu,N Q¯N , it holds[
ΓTu,N−2ETN−2ATN−2ATN−1QN
BTN−2A
T
N−1QN
]
= ΓTu,N−1ETN−1ATN−1QN
that are the ﬁrst N − 1 blocks of the last block-column of the matrix ΓTu,N Q¯N .
Therefore the matrix in (9.13) can be computed at a cost of 2Nn2xnu ﬂops,
besides the cost to compute ΓTu Q¯ (equal to about N
2n2xnu ﬂops if Q¯ is dense,
and to about N2nxnu if Q¯ is block-diagonal). Notice that at the following
recursion steps, the matrices to be computed have decreasing size N − 1, . . . , 1,
and summing up over all recursion steps, the computational cost is of about
N2n2xnu ﬂops (beside the cost to compute Γ
T
u Q¯). Therefore this update avoids
the computation of O(n3x) operations. In Section 9.1.1.4 a diﬀerent update is
proposed, that makes use of O(n3x) operations to decrease the O(N2) terms in
the computational complexity.
Furthermore, notice that the computation of the matrix S¯Γu can be embedded
in the computation of the matrix ΓTu Q¯Γu at no extra cost. In fact, the last
block-row of the matrix S¯Γu + Γ
T
u Q¯Γu can be computed easily by using
MN−1 = SN−1 + M˜N−1 = SN−1 +BTN−1QNAN−1 (9.14)
in place of M˜N−1 in (9.11). Similarly, the last block-diagonal element of the
matrix Q¯+ ΓTu Q¯Γu can be computed easily by using
DN−1 = RN−1 + D˜N−1 = RN−1 +BTN−1QNBN−1 (9.15)
in place of D˜N−1 in (9.11).
At the end of the algorithm, the (lower triangular part) condensed Hessian
matrix HR shows the structure
HR =
 D0 ∗ ∗M1B0 D1 ∗
M2A1B2 M2B1 D2
 .
208 Condensing methods
where
D0 = R0 +B
T
0 Q1B0 +B
T
0 A
T
1 Q2A1B0 +B
T
0 A
T
1 A
T
2 Q3A2A1B0
D1 = R1 +B
T
1 Q2B1 +B
T
1 A
T
2 Q3A2B1
M1 = S1 +B
T
1 Q2A1 +B
T
1 A
T
2 Q3A2A1
D2 = R2 +B
T
2 Q3B2
M2 = S2 +B
T
2 Q3S2
This O(N2) and O(n2x) condensing algorithm is summarized in Algorithm 10.
Beside the cost to compute ΓTu , the cost of the algorithm is of about
2N2n2xnu +N
2nxn
2
u
ﬂops if the matrices Qi are dense, and of about
N2n2xnu +N
2nxn
2
u
ﬂops if the matrices Qi are diagonal.
Algorithm 10 Computation of HR, O(N2) and O(n2x) algorithm
Require:
ΓTu
1: ΓTw[0 : N − 1, N ]← ΓTu [0 : N − 1, N ] ·QN
2: for i← N − 1, . . . , 1 do
3: ΓTw[0 : i, i]← ΓTw[0 : i, i+ 1] ·Ai
4: Di ← Ri +BTi · (ΓTw[i, i+ 1])T
5: HR[i, i]← Di
6: Mi ← Si + ΓTw[i, i]
7: HR[i, 0 : i− 1]← (ΓTu [0 : i− 1, i] ·MTi )T
8: ΓTw[0 : i− 1, i]← ΓTw[0 : i− 1, i] + ΓTu [0 : i− 1, i] ·Qi
9: end for
10: D0 ← R0 +BT0 · (ΓTw[0, 1])T
11: HR[0, 0]← D0
9.1.1.4 O(N2) and O(n3x) computation of HR
It is possible to compute the matrix ΓTu Q¯Γu by means of a diﬀerent algorithm,
that makes use of O(n3x) operations in order to reduce the O(N2) terms in the
9.1 Condensing methods for MPC 209
computational complexity of the algorithm. Namely, the update in (9.12) is
performed as
ΓTu,N−1Q¯N−1Γu,N−1 + Γ
T
u,N−1ETN−1ATN−1QNAN−1EN−1Γu,N−1 =
ΓTu,N−1
(
Q¯N−1 + ETN−1ATN−1QNAN−1EN−1
)
Γu,N−1.
The matrix ETN−1ATN−1QNAN−1EN−1 is zero everywhere but in the bottom-
right block. Therefore the update reduces to the update of the bottom-right
element of Q¯N−1, that is
PN−1 = QN−1 +ATN−1PNAN−1. (9.16)
where PN = QN . This is an operation with computational cost O(n3x), and it
can be computed eﬃciently in 73n
3
x ﬂops as
PN−1 = QN−1 + (ATN−1LN )(ATN−1LN )T
where LN is the lower triangular Cholesky factor of PN .
Summing up over the N stages gives a computational complexity of 73Nn
3
x, that
replaces a term 2N2n2xnu in the computational cost of Algorithm 10. Besides
the cost to compute ΓTu (equal to about N
2n2xnu ﬂops), the cost of the algorithm
is of about
N2nxn
2
u +
7
3Nn
3
x + 3Nn
2
xnu +Nnxn
2
u ≈ N2nxn2u + 73Nn3x
ﬂops. The algorithm is summarized in Algorithm 11.
9.1.1.5 O(N2) computation of gr
Similar arguments apply to the computation of the right hand side gr. The key
operation of the algorithm is the computation of
ΓTu · (Q¯Γx,b + q¯). (9.17)
If the matrix Γu (assumed to be precomputed) is used, this operation can be
done in N2nxnu ﬂops, that is quadratic in N and linear in nx. The algorithm
for the computation of the right-hand-side gr is summarized in Algorithm 12.
Besides the cost to compute Γx,b, the cost of the algorithm is of about
N2nxnu + 2Nn
2
x + 2Nnxnui ≈ N2nxnu + 2Nn2x
ﬂops if the matrices Qi and Si are dense, and of about
N2nxnu
ﬂops if the matrices Qi are diagonal and the matrices Si are zero, that is linear
in nx.
210 Condensing methods
Algorithm 11 Computation of HR, O(N2) and O(n3x) algorithm
Require:
ΓTu
1: PN ← QN
2: for i← N − 1, . . . , 1 do
3: Li+1 ← P 1/2i+1
4:
[
BTi
ATi
]
L ←
[
BTi
ATi
]
· Li+1
5:
[
Di
MTi Pi
]
←
[
Ri
STi Qi
]
+
([
BTi
ATi
]
L
)
·
([
BTi
ATi
]
L
)T
6: HR[i, i]← Di
7: HR[i, 0 : i− 1]← (ΓTu [0 : i− 1, i] ·MTi )T
8: end for
9: L1 ← P 1/21
10: BT0 L ← BT0 · L1
11: D0 ← R0 +
(
BT0 L
) · (BT0 L)T
12: HR[0, 0]← D0
Algorithm 12 Computation of gr, O(N2) algorithm
Require:
ΓTu , Γx,b
1: for i← 0, . . . , N − 1 do
2: gr[i]← ri + Si · Γx,b[i]
3: end for
4: for i← 0, . . . , N do
5: (Q¯Γx,b + q¯)[i]← qi +Qi · Γx,b[i]
6: end for
7: for i← 0, . . . , N − 1 do
8: gr[0 : i]← gr[0 : i] + ΓTu [0 : i, i+ 1] · (Q¯Γx,b + q¯)[i+ 1]
9: end for
9.1 Condensing methods for MPC 211
9.1.1.6 O(N) computation of gr - (1)
If the structure of the matrix Γu = A¯
−1B¯ is exploited, it is possible to compute
(9.17) as
ΓTu · (Q¯Γx,b + q¯) = B¯T · (A¯−T · (Q¯Γx,b + q¯))
trading oﬀ an increase of the computational complexity in nx with a reduction
in N . In fact, the operation A¯−T · (Q¯Γx,b + q¯) can be computed in 2Nn2x ﬂops
using (9.8). The multiplication by B¯ can then be computed in 2Nnxnu ﬂops
exploiting the fact that it is block-diagonal. The algorithm for the computation
of the right-hand-side gr is summarized in Algorithm 13. Besides the cost to
compute Γx,b, the cost of the algorithm is of about
4Nn2x + 4Nnxnu
ﬂops if the matrices Qi and Si are dense, and of about
2Nn2x + 2Nnxnu
ﬂops if the matrices Qi are diagonal and the matrices Si are zero, that is still
quadratic in nx.
Algorithm 13 Computation of gr, O(N) algorithm - (1)
Require:
Γx,b
1: for i← 0, . . . , N − 1 do
2: gr[i]← ri + Si · Γx,b[i]
3: end for
4: for i← 0, . . . , N do
5: (Q¯Γx,b + q¯)[i]← qi +Qi · Γx,b[i]
6: end for
7: t[N ]← (Q¯Γx,b + q¯)[N ]
8: for i← N − 1, . . . , 0 do
9: t[i]← (Q¯Γx,b + q¯)[i] +ATi · t[i+ 1]
10: end for
11: for i← 0, . . . , N − 1 do
12: gr[i]← gr[i] +BTi · t[i+ 1]
13: end for
9.1.1.7 O(N) computation of gr - (2)
The approach of the O(N2) and O(n3x) method to condense the Hessian matrix
can be applied to the condensing of the gradient vector. This algorithm employs
212 Condensing methods
the Pi and Mi matrices computed in Algorithm 11, and therefore its use makes
sense only in connection to that Hessian condensing algorithm. Furthermore,
the two algorithms could be merged into a single one, similarly to the fact that
the backward substitution can be merged with the backward Riccati recursion
(see Algorithm 1). For N = 3, at the end of the algorithm the gradient vector
looks like
gr =
 m0 +M0xˆ0m1 +M1(A0xˆ0 + b0)
m2 +M2(A1(A0xˆ0 + b0) + b1)

where
m0 = r0 +B
T
0 (P1b0 + P1)
m1 = r1 +B
T
1 (P2b1 + P2)
m2 = r2 +B
T
2 (P3b2 + P3)
where in turn
p1 = q1 +A
T
1 (P2b1 + p2)
p2 = q2 +A
T
2 (P3b2 + p3)
p3 = q3
and the matrices Pi and Mi are deﬁned in the Hessian condensing algorithm.
The algorithm is presented in Algorithm 14, and it has a computational com-
plexity of
4Nn2x + 4Nnxnu
ﬂops, irrespective of the fact that the matrices Qi are dense or diagonal.
9.1.1.8 Comparison of condensing algorithms for MPC
In this section the three condensing algorithms for the computation of HR are
compared, both in terms of ﬂops and in terms of running times of a practical
implementation. The algorithms for the computation of gr are not compared,
since this is generally not the key operation.
In the comparison in therm of ﬂops, two cases are considered:
 dense Ri, Si and Qi matrices (dense Hessian of the cost function);
9.1 Condensing methods for MPC 213
Algorithm 14 Computation of gr, O(N) algorithm - (2)
Require:
Γx,b, Pi, Mi
1: pN ← qN
2: for i← N − 1, . . . , 1 do
3: ti ← Pi+1 · bi + pi+1
4: pi ← qi +ATi · ti
5: mi ← ri +BTi · ti
6: gr[i]← mi +Mi · Γx,b[i]
7: end for
8: t0 ← P1 · b0 + p1
9: m0 ← r0 +BT0 · t0
10: gr[0]← m0 +M0 · Γx,b[0]
Table 9.1: Comparison of condensing algorithms in terms of ﬂops. In all cases,
additional N2n2xnu ﬂops are needed to compute Γ
T
u .
algorithm dense cost function diagonal cost function
Algorithm 9 13N
3nxn
2
u +N
2n2xnu
1
3N
3nxn
2
u
Algorithm 10 2N2n2xnu +N
2nxn
2
u N
2n2xnu +N
2nxn
2
u
Algorithm 11 N2nxn
2
u +
7
3Nn
3
x N
2nxn
2
u +
7
3Nn
3
x
 diagonal Ri and Qi matrices and zero Si matrix (diagonal Hessian of the
cost function).
The number of ﬂops for the three condensing algorithms for these two cases
are reported in table 9.1. As a rule of thumb, Algorithm 11 is advantageous for
smaller values of the ratio N/nx, while the Algorithm 9 is advantageous for larger
values of the ratio N/nx. Algorithm 10 has the better asymptotic complexity for
all dimensions, but with slightly larger coeﬃcients, so this algorithm should work
reasonably well in all cases. Algorithms 9 and 10 have a reduced computational
complexity in case of diagonal Hessian of the cost function, while Algorithm 11
has the same computational complexity.
In the comparison in terms of running times, the algorithms are implemented
using matrices in panel-major format, and using the linear algebra routines
in HPMPC. The results are in Figure 9.1. Numerical tests conﬁrm that in
general Algorithm 11 is advantageous for smaller values of the ratio N/nx, while
Algorithm 9 is advantageous for larger values of the ratio N/nx. Algorithm 10
is always the second best. In case of diagonal Hessian of the cost function (case
(2)), it is possible to reduce the computational cost of Algorithms 9 and 10, but
214 Condensing methods
not of Algorithm 11, that therefore is convenient only for larger values of the
ratio N/nx.
9.1.2 Factorization algorithms for MPC
In this section two Hessian factorization algorithms are reviewed. The ﬁrst al-
gorithm is the classical Cholesky factorization algorithm, that is commonly em-
ployed to factorize the positive-deﬁnite condensed Hessian matrix. The second
algorithm is a structure-exploiting Cholesky factorization of the reverse con-
densed Hessian matrix HˆR, that is the permutation of the condensed Hessian
HR such that the input vector is
uˆ =
uN−1...
u0
 (9.18)
The two algorithms have very diﬀerent computational complexity, and can be
combined (with some limitations) with the Hessian condensing algorithms pre-
sented in Section 9.1.1.
9.1.2.1 O(N3) Cholesky factorization of HR
Once built the HR matrix using the Hessian condensing Algorithms 9, 10 or
11, it is trivially possible to factorize it using the standard Cholesky factoriza-
tion. More precisely, besides the cost to build the condensed Hessian HR using
Algorithm 9, 10 or 11, the cost of the algorithm is of about
1
3N
3n3u
ﬂops, and therefore cubic in both N and nu, and not dependent on nx. This is
the classical way to factorize HR.
9.1.2.2 O(N) Cholesky factorization of HˆR
As shown in the paper [34], it is possible to exploit the structure still present
in the reversed condensed Hessian matrix HˆR to compute a structure-exploiting
Cholesky factorization that has a computational complexity of O(N) ﬂops in-
stead of the computational complexity of O(N3) ﬂops of the classical Cholesky
9.1 Condensing methods for MPC 215
10-6
10-5
10-4
10-3
10-2
10-1
100 101 102
tim
e 
[s]
N
Hessian cond - nx=16, nu=8 - dense Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(a) N varying, dense cost function.
10-6
10-5
10-4
10-3
10-2
10-1
100 101 102
tim
e 
[s]
N
Hessian cond - nx=16, nu=8 - diag Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(b) N varying, diagonal cost function.
10-5
10-4
10-3
10-2
100 101 102
tim
e 
[s]
nx
Hessian cond - N=30, nu=2 - dense Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(c) nx varying, dense cost function.
10-5
10-4
10-3
10-2
100 101 102
tim
e 
[s]
nx
Hessian cond - N=30, nu=2 - diag Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(d) nx varying, diagonal cost function.
Figure 9.1: Comparison of Hessian condensing algorithms for MPC: Algo-
rithm 9 with cost O(N3) and O(n2x) (blue), Algorithm 10 with
cost O(N2) and O(n2x) (red), Algorithm 11 with cost O(N2) and
O(n3x) (green). The time to compute ΓTu is included in the total
solution time.
216 Condensing methods
factorization of the condensed Hessian matrix. As a drawback, the computa-
tional complexity on nx is increased, but asymptotically the cost of the factor-
ization algorithm is completely hidden by the cost of the Hessian condensing
algorithm.
The main diﬀerence between this algorithm and the use of the classical Cholesky
factorization of the condensed HessianHR is the fact that the condensed Hessian
matrix is factorized while build, and therefore the HR matrix in not computed
explicitly and it is not available at the end of the algorithm. The key idea
is that the correction part of the Cholesky factorization (accounting for the
O(N3) term) is replaced by a downgrade of the matrices Qi at a cost of Nn2xnu,
similarly to what happens in the Riccati recursion. However this structure-
exploiting factorization needs to start from the last stage N − 1. Since the
Cholesky factorization operates from the top-left corner, and the condensed
Hessian is traditionally written with the matrices of the ﬁrst stage 0 at the top-
left corner and the matrices of the last stage N − 1 at the bottom-right corner,
there are two options to implement this algorithm
 computation of a reverse Cholesky factorization, that gives the lower tri-
angular factor LR s.t. L
T
R · LR = HR. This factorization starts from the
bottom-right corner, but there are no standard LAPACK routines for it.
 computation of the classical Cholesky factorization of the permutation HˆR
of the condensed Hessian HR such that the input vector is (9.18). In this
way, the factorization can start from the top-left corner using the standard
LAPACK routine dpotrf.
In the following, the second approach is considered, due to the use of standard
linear algebra routines.
The structure-exploiting factorization procedure boils down to the following
three steps, that have to be performed at each stage, starting from the last
stage N − 1 down to the ﬁrst stage 0:
 computation of the (lower triangular) Cholesky factorization of Dn in
(9.15) as Λn = D
1/2
n .
 computation of the system solution with Mn in (9.14) as right hand side,
as LTn = M
T
n Λ
−T
n . This step replaces the solution step in the classical
Cholesky factorization, that accounts for the O(N2) term in the compu-
tational complexity.
9.1 Condensing methods for MPC 217
 downgrade of the Qn matrix as Q
∗
n = Qn − LTnLn, that is used in the
following stage to compute Dn−1 and Mn−1. This step replaces the cor-
rection step in the classical Cholesky factorization, that accounts for the
O(N3) term in the computational complexity.
The lower triangular Cholesky factor LˆR of the matrix HˆR is then computed as
(using N = 3 to make notation easier)
LˆR =
 Λ2BT1 LT2 Λ1
BT0 A
T
1 L
T
2 B
T
0 L
T
1 Λ0
 (9.19)
in the same way as the matrix HˆR is computed as
HˆR =
 D2BT1 MT2 D1
BT0 A
T
1 M
T
2 B
T
0 M
T
1 D0
 .
Besides the cost of the Hessian condensing algorithm, the computational com-
plexity of the factorization algorithm is of about
Nn2xnu +Nnxn
2
u +
1
3Nn
3
u
ﬂops. When this is added to the computational complexity of the Hessian con-
densing algorithm, asymptotically the ﬁrst two terms are totally hidden by the
O(N2n2xnu) and O(N2nxn2u) terms. Therefore, the only term adding to the
asymptotic complexity is O(Nn3u), that is greatly reduced compared to the
O(N3n3u) computational complexity of the classical Cholesky factorization of
the condensed Hessian.
Since the factorization procedure is embedded in the condensing procedure, a
suitable condensing algorithm has to be chosen. Namely, Algorithms 10 and
11 can be used, since they compute explicitly the quantities Dn and Mn. On
the other hand, Algorithm 9 can not be used, since it does not provide these
quantities, and a modiﬁcation of Algorithm 9 that explicitly provides Dn and
Mn would increase the computational complexity in term of nx, making the
algorithm unattractive with respect to e.g. Algorithm 10. Furthermore notice
that, even if Qn is diagonal, in general Q
∗
n is not diagonal for n 6= N , and
therefore only the versions of the Hessian condensing algorithms considering Qn
as dense can be used.
218 Condensing methods
10-7
10-6
10-5
10-4
10-3
10-2
100 101 102
tim
e 
[s]
N
Hessian fact - nx=16, nu=8
N3 alg
N alg
(a) N varying.
10-6
10-5
10-4
100 101 102
tim
e 
[s]
nx
Hessian fact - N=30, nu=2
N3 alg
N alg
(b) nx varying.
Figure 9.2: Comparison of Hessian factorization algorithms for MPC:
Cholesky factorization of HR with cost
1
3N
3n3u (red), structure ex-
ploiting Cholesky factorization of HˆR with cost Nn
2
xnu+Nnxn
2
u+
1
3Nn
3
u (blue).
9.1.2.3 Comparison of factorization algorithms for MPC
In Figure 9.2 there is a comparison of the computational complexity of the two
Hessian factorization algorithms, besides the computational complexity to build
the condensed Hessian matrix using Algorithms 9, 10 or 11. From the pictures
it is clear that the classical Cholesky factorization is eﬃcient for small values of
the ratio N/nx, while the structure-exploiting Cholesky factorization is eﬃcient
for large values of the ratio N/nx.
9.1.3 Condensing and factorization algorithms for MPC
In this section the combination of Hessian condensing and factorization algo-
rithms is presented. All Hessian condensing algorithms (possibly tailored to a
diagonal Hessian of the cost function) can be combined with the classical O(N3)
Cholesky factorization. However, only Algorithm 10 and 11 in the version for
dense Hessian of the cost function can be combined with the structure-exploiting
O(N) Cholesky factorization. The feasible combinations are summarized in Ta-
ble 9.2.
In the remainder of the section, the two feasible combinations of Hessian con-
densing algorithms with the structure-exploiting O(N) Cholesky factorization
algorithm are presented in detail. Finally, diﬀerent combinations of Hessian
9.1 Condensing methods for MPC 219
Table 9.2: Feasible combinations of Hessian factorization algorithms and Hes-
sian factorization algorithms in the MPC case.
condensing algorithms
Algorithm 9 Algorithm 10 Algorithm 11
factorization dense diag dense diag dense diag
algorithms Qn Qn Qn Qn Qn Qn
O(N3) x x x x x x
O(N) x x
condensing and factorization algorithms for the MPC problem are compared.
9.1.3.1 O(N2) and O(n2x) computation of the Cholesky factor LˆR
The Hessian condensing Algorithm 10 can be easily combined with the O(N)
structure-exploiting reversed Hessian factorization algorithm for the computa-
tion of LˆR. The resulting algorithm is summarized in Algorithm 15. Notice that
the matrix ΓTu internally used by the Algorithm 15 is not permuted, i.e. it is the
same as in Algorithm 10: this simpliﬁes the implementation if the matrices are
in panel-major format, since the top-left corner of all sub-matrices used in the
computation is properly aligned in memory. However, the algorithm builds the
lower Cholesky factor LˆR of the reversed Hessian HˆR: the permutation is per-
formed in line 8 (diagonal bocks) and in the for loop in lines 11-13 (oﬀ-diagonal
blocks).
Besides the cost to compute ΓTu (equal to about N
2n2xnu ﬂops), the overall
computational complexity of the algorithm is of about
2N2n2xnu+N
2nxn
2
u+Nn
2
xnu+Nnxn
2
u+
1
3Nn
3
u ≈ 2N2n2xnu+N2nxn2u+ 13Nn3u
ﬂops. This means that, asymptotically, the additional cost of Algorithm 15 with
respect to Algorithm 10 is equal to 13Nn
3
u (in place of
1
3N
3n3u obtained using the
classical Cholesky factorization). Notice that, even if Qi is diagonal, in general
Q∗i is not diagonal for i 6= N , and therefore (except for the last stage) it is not
possible to reduce the computational cost in case of diagonal Hessian of the cost
function.
9.1.3.2 O(N2) and O(n3x) computation of the Cholesky factor LˆR
Alternatively, the Hessian condensing Algorithm 11 can be combined with the
O(N) structure-exploiting reversed Hessian factorization algorithm for the com-
220 Condensing methods
Algorithm 15 Computation of the lower Cholesky factor LˆR of HˆR, O(N2)
and O(n2x) algorithm
Require:
ΓTu
1: Q∗N ← QN
2: ΓTw[0 : N − 1, N ]← ΓTu [0 : N − 1, N ] ·Q∗N
3: for i← N − 1, . . . , 1 do
4: ΓTw[0 : i, i]← ΓTw[0 : i, i+ 1] ·Ai
5: Di ← Ri +BTi · (ΓTw[i, i+ 1])T
6: Λi ← D1/2i
7: LˆR[N − 1− i,N − 1− i]← Λi
8: Mi ← Si + ΓTw[i, i]
9: LTi ←MTi · Λ−Ti
10: for j ← 0, . . . , i− 1 do
11: LˆR[N − 1− j,N − 1− i]← ΓTu [j, i] · LTi
12: end for
13: Q∗i ← Qi − LTi · Li
14: ΓTw[0 : i− 1, i]← ΓTw[0 : i− 1, i] + ΓTu [0 : i− 1, i] ·Q∗i
15: end for
16: D0 ← R0 +BT0 · (ΓTw[0, 1])T
17: Λ0 ← D1/20
18: LˆR[N − 1, N − 1]← Λ0
9.1 Condensing methods for MPC 221
putation of LˆR. The resulting algorithm is summarized in Algorithm 16. Notice
that by using Q∗N−1 in place of QN−1 in equation (9.34), the classical backward
Riccati recursion is obtained:
PN−1 = Q∗N−1 +A
T
N−1PNAN−1 = QN−1 +A
T
N−1PNAN−1 − LTN−1LN−1.
Therefore it is possible to use the classical backward Riccati recursion to com-
pute a structure-exploiting Cholesky factorization of the reversed condensed
Hessian HˆR [19]. It is possible to embed the computation of Λn, Ln and Q
∗
n
with the Cholesky factorization of Pn, in the same way as in the backward Ric-
cati recursion implementation in Algorithm 2: this is done in line 4 of Algorithm
16.
Besides the cost to compute ΓTu (equal to about N
2n2xnu ﬂops), the computa-
tional complexity of the algorithm is of about
N2nxn
2
u +
7
3Nn
3
x + 4Nn
2
xnu + 2Nnxn
2
n +
1
3Nn
3
u ≈ N2nxn2u + 73Nn3x + 13Nn3u
ﬂops. Notice that the third and the fourth term are excluded from the ﬁnal
approximation of the computational cost, since asymptotically they are totally
hidden by the O(N2n2xnu) and O(N2nxn2u) terms. Also notice that, beside the
term N2n2xnu coming from the computation of Γ
T
u and the term N
2nxn
2
u coming
from the build of the factor LˆR as in (9.19), the computational complexity is
identical to the backward Riccati recursion one, even if computed summing up
the complexity of Algorithm 11 and of the O(N) reversed Hessian factorization
algorithm.
9.1.3.3 Comparison of condensing and factorization algorithms for
MPC
In Figure 9.3 there is a comparison of algorithms for the computation of the
Cholesky factor of the condensed Hessian HR or of the reversed condensed Hes-
sian HˆR. Namely, three algorithms are compared:
 Algorithm 9 + classical O(N3) condensed Hessian Cholesky factorization.
This combines the Hessian condensing algorithm and the Hessian factor-
ization algorithm performing better for small values of the ratio N/nx. The
overall algorithm has computational complexity O(N3) and O(n2x).
 Algorithm 15. It is a combination of the Hessian condensing algorithm 10
and of the structure-exploitingO(N) reversed condensed Hessian Cholesky
factorization. The overall algorithm has computational complexity O(N2)
and O(n2x).
222 Condensing methods
Algorithm 16 Computation of the lower Cholesky factor LˆR of HˆR, O(N2)
and O(n3x) algorithm
Require:
ΓTu
1: LN ← Q1/2N
2: for i← N − 1, . . . , 1 do
3:
[
BTi
ATi
]
L ←
[
BTi
ATi
]
· Li+1
4:
[
Λi
LTi Li
]
←
([
Ri
STi Qi
]
+
([
BTi
ATi
]
L
)
·
([
BTi
ATi
]
L
)T)1/2
5: LˆR[N − 1− i,N − 1− i]← Λi
6: for j ← 0, . . . , i− 1 do
7: LˆR[N − 1− j,N − 1− i]← ΓTu [j, i] · LTi
8: end for
9: end for
10: BT0 L ← BT0 · L1
11: Λ0 ←
(
R0 +
(
BT0 L
) · (BT0 L)T)1/2
12: LˆR[N − 1, N − 1]← Λ0
 Algorithm 16. It is a combination of the Hessian condensing algorithm 11
and of the structure-exploitingO(N) reversed condensed Hessian Cholesky
factorization, both peforming well for large values of the ratio N/nx. The
overall algorithm has computational complexity O(N2) and O(n3x).
The computational complexity of all other algorithms in Table 9.2 fall above or
between the computational complexity of the considered algorithms, so they are
not of much interest.
In the comparison in terms of running times, the algorithms are implemented
using the matrices in panel-major format, and using the linear algebra routines
in HPMPC. The results are in Figure 9.3.
Numerical tests conﬁrms that the combination of the Hessian condensing Al-
gorithm 9 and of the classical O(N3) Cholesky factorization performs well for
small values of the ratio N/nx. Furthermore, this combination of algorithms can
exploit a diagonal cost function to decrease the computational complexity. On
the contrary, the remaining two algorithms can not exploit a diagonal cost func-
tion. Algorithm 16 performs well for large values of the ratio N/nx. Algorithm
15 has a good asymptotic complexity in terms of both N and nx, and it is the
9.1 Condensing methods for MPC 223
10-6
10-5
10-4
10-3
10-2
10-1
100 101 102
tim
e 
[s]
N
Hessian cond fact - nx=16, nu=8 - dense Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(a) N varying, dense cost function.
10-6
10-5
10-4
10-3
10-2
10-1
100 101 102
tim
e 
[s]
N
Hessian cond fact - nx=16, nu=8 - diag Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(b) N varying, diagonal cost function.
10-5
10-4
10-3
10-2
100 101 102
tim
e 
[s]
nx
Hessian cond fact - N=30, nu=2 - dense Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(c) nx varying, dense cost function.
10-5
10-4
10-3
10-2
100 101 102
tim
e 
[s]
nx
Hessian cond fact - N=30, nu=2 - diag Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(d) nx varying, diagonal cost function.
Figure 9.3: Comparison of Hessian condensing and factorization algorithms
for MPC: Algorithm 9 + O(N3) Cholesky factorization, with cost
O(N3) and O(n2x) (blue), Algorithm 15 with cost O(N2) and
O(n2x) (red), Algorithm 16 with cost O(N2) and O(n3x) (green).
The time to compute ΓTu is included in the total solution time.
224 Condensing methods
second best in almost all tests.
9.1.4 Solution algorithms for MPC
If the aim of the condensing procedure is the solution of the KKT system of the
MPC problem, a solution procedure must be employed after the completion of
the KKT matrix factorization procedure.
The classical way to do so is to use the e.g. lower triangular Cholesky factor
LR of the condensed Hessian to perform a forward-backward substitution with
gr as right hand side. The lower Cholesky factor can be computed using one of
the presented methods.
Alternatively, the structure still present in the Cholesky factor can be exploited
to reduce the computational cost, as proposed in the paper [34].
9.1.4.1 O(N2) solution
Once computed the lower Cholesky factor LR of the condensed Hessian matrix
HR and the condensed gradient gr, it is possible to compute the minimizer of
the cost function (9.6) by setting its gradient to zero, as
HRu+ gr = 0 ⇒ u = −L−TR L−1R gr
The forward and backward substitutions can be performed using the dense linear
algebra routine trsv in BLAS, at a total cost of 2N2n2u ﬂops, that is quadratic
on both N and nu and not dependent on nx.
9.1.4.2 O(N) solution
If the O(N) factorization Algorithm is employed to compute the lower triangular
Cholesky factor LˆR of the permuted Hessian matrix HˆR as in Algorithms 15
or 16, it is possible to exploit the structure still present in LˆR to reduce the
computational complexity in N of the solution algorithm.
Even more, as shown in the paper [34], it is not even necessary to explicitly
build the lower Cholesky factor LˆR, reducing the factorization cost by N
2nxn
2
u
ﬂops (and e.g. making Algorithm 16 linear in N , with a complexity identical
9.1 Condensing methods for MPC 225
to the backward Riccati recursion). In fact, only the matrices Λi and Li are
employed in the solution algorithm.
The lower Cholesky factor LˆR in (9.19) has the structure
LˆR =
 Λ2BT1 LT2 Λ1
BT0 A
T
1 L
T
2 B
T
0 L
T
1 Λ0
 = Λˆ + ΓˆTu LˆT = Λˆ + BˆT Aˆ−T LˆT
where
Λˆ =
Λ2 Λ1
Λ0
 , LˆT =
LT2 LT1
 ,
BˆT =
BT2 BT1
BT0
 , Aˆ−T =

I
−AT2 I
−AT1 I
−AT0 I

−1
.
By deﬁning the vector yˆ = LˆTRuˆ, the forward substitution is in the form
LˆRyˆ =
(
Λˆ + BˆT Aˆ−T LˆT
)
yˆ = −gˆr
that gives y with the recursion
yˆ = −Λˆ−1
(
gˆr + Bˆ
T Aˆ−T LˆT yˆ
)
that for N = 3 looks likey2y1
y0
 =
−Λ−12 (g2)−Λ−11 (g1 +BT1 LT2 y2)
−Λ−10
(
g0 +B
T
0 A
T
1 L
T
2 y2 +B
T
0 L
T
1 y1
)
 .
Notice that this is a backward recursion with respect to the indexes of the data
matrices, due to the permutation of the condensed Hessian.
The backward substitution is in the form
LˆTRuˆ =
(
ΛˆT + LˆAˆ−1Bˆ
)
uˆ = yˆ
that gives the recursion
uˆ = Λˆ−T
(
yˆ − LˆAˆ−1Bˆuˆ
)
226 Condensing methods
that for N = 3 looks likeu2u1
u0
 =
Λ−T2 (y2 − L2B1u1 − L2A1B0u0)Λ−T1 (y1 − L1B0u0)
Λ−T0 (y0)
 .
Notice that this is a forward recursion with respect to the indexes of the data
matrices, due to the permutation of the condensed Hessian.
The algorithm is summarized in Algorithm 17. The computational cost of the
algorithm is of 4Nn2x + 8Nnxnu + 2Nn
2
u ﬂops, plus enabling a reduction of the
cost of the factorization Algorithms 15 and 16 of N2nxn
2
u ﬂops compared with
the O(N2) solution algorithm.
Algorithm 17 Computation of the solution of the condensed system, O(N)
algorithm
Require:
Λi, Li
1: tN−1 ← 0
2: yN−1 ← −Λ−1N−1 (gN−1)
3: for i← N − 2, . . . , 0 do
4: ti ← LTi+1yi+1 +ATi+1ti+1
5: yi ← −Λ−1i
(
gi +B
T
i ti
)
6: end for
7: u0 ← Λ−T0 (y0)
8: t0 ← B0u0
9: for i← 1, . . . , N − 1 do
10: ui ← Λ−Ti (yi − Liti−1)
11: ti ← Biui +Aiti−1
12: end for
9.2 Condensing methods for MHE
Assuming that nd = 0 (i.e. that there are no equality constraints on the last
stage), equation (7.7b) is
A¯x¯ = B¯u¯+ b¯ (9.20)
9.2 Condensing methods for MHE 227
where the matrices x¯, u¯, A¯, B¯, b¯, deﬁned in (7.8), are (for N = 3)
x¯ =

x0
x1
x2
x3
 , u¯ =
u0u1
u2
 ,
A¯ =

.
−A0 I
−A1 I
−A2 I
 , B¯ =

.
B0
B1
B2
 , b¯ =
b0b1
b2
 .
Notice that also in the MHE case x0 is considered an optimization variable, and
therefore it is part of the x¯ vector, but its value is not constrained. Furthermore,
notice that in the MHE case the matrix A¯ is not invertible, due to the lack of
the identity matrix corresponding to the initial state constraint. Invertibility is
a key feature in the condensing methods for MPC, and in the MHE case it can
be recovered by using diﬀerent approaches.
Approach 1 Invertibility can be recovered by adding an additional state space
equation for a previous stage. Deﬁned the state at stage −1 as x−1 = 0 and
chosen B−1 = I, the state at stage 0 can be written as
x0 = A−1x−1 +B−1u−1 + b−1
= A−10 + Ix0 + 0
where the value of the matrix A−1 is irrelevant, and the input at stage −1 is
equal to the state at stage 0, u−1 = x0. By disregarding the state x−1 in the
state vector, equation (9.20) can be rewritten as
x¯ = A¯−1B¯v¯ + A¯−1b¯ .= Γv v¯ + Γb
where
A¯ =

I
−A0 I
−A1 I
−A2 I
 , B¯ =

I
B0
B1
B2
 , v¯ =

x0
u0
u1
u2

and the matrix A¯ is clearly invertible.
228 Condensing methods
Approach 2 Alternatively, the matrix A¯ can be directly split such that
A¯x¯ =

.
−A0 I
−A1 I
−A2 I


x0
x1
x2
x3
 =
=

I
−A0 I
−A1 I
−A2 I


x0
x1
x2
x3
−

I
0
0
0
x0 = A¯x¯− Eˆ0x0.
By means of this deﬁnition, equation (9.20) can be rewritten as
x¯ = A¯−1Eˆ0x0 + A¯−1B¯u¯+ A¯−1b¯ = Γxx0 + Γuu¯+ Γb (9.21)
= A¯−1B¯v¯ + A¯−1b¯ = Γv v¯ + Γb (9.22)
where
B¯ = [Eˆ0 B¯] =

I
B0
B1
B2
 , v¯ = [x0u¯
]
=

x0
u0
u1
u2
 .
Equation (9.21) keeps the initial state and the input vectors separated, while
equation (9.22) merges them in a single vector, and it is formally identical to
(9.1).
The former interpretation is useful when the condensed MHE problem is consid-
ered as a MHE problem with horizon length 1 and input size Nnu, since it allows
to clearly identify the diﬀerent components of the condensed cost function.
The latter interpretation suggests that it is possible to adapt all condensing
algorithms developed for the MPC case to the MHE case, provided that the
free initial state x0 is considered as an extra input variable, corresponding to
stage -1. Therefore, if the condensing algorithms are coded with the option to
have stage-varying number of variables, it is possible to use them in the MHE
case straight away. Otherwise, the analogy of equations (9.1) and (9.22) can be
exploited to develop condensing algorithms tailored to the MHE case. This will
be done in the following, where detailed complexity analysis are performed also
for the MHE case.
The matrix A¯−1 = A¯−1N (where the index N means that the matrix is related to
a MHE problem with horizon length N) can be computed recursively by means
of the explicit formula for the inverse of a lower triangular matrix (9.2) as
A¯−1N =
[ A¯N−1
−AN−1EN−1 I
]−1
=
[ A¯−1N−1
AN−1EN−1A¯−1N−1 I
]
(9.23)
9.2 Condensing methods for MHE 229
where the En matrix is deﬁned in (9.4). Notice that the matrix A¯−1 is dense
(namely lower triangular, containing O(N2) non-zeros elements), and for N = 3
it looks like
A¯−1 =

I
−A0 I
−A1 I
−A2 I

−1
=

I
A0 I
A1A0 A1 I
A2A1A0 A2A1 A2 I
 . (9.24)
Therefore the matrices Γv and Γb are
Γv =

I
A0 B0
A1A0 A1B0 B1
A2A1A0 A2A1B0 A2B1 B2
 , Γb =
 b0A1b0 + b1
A2(A1b0 + b1) + b2
 .
(9.25)
Inserting (9.22) in the cost function expression
φ = 12
(
x¯T Q¯x¯+ x¯T S¯T u¯+ u¯T S¯x¯+ u¯T R¯u¯
)
+ q¯T x¯+ r¯T u¯+ 12ρ (9.26)
where the matrices Q¯, S¯, R¯, q¯, s¯ are analogue to the ones deﬁned in (7.8),
padded with zeros to take into account the fact that x0 is considered also as
input variable,
Q¯ =

Q0
Q1
Q2
Q3
 , S¯ =

.
S0
S1
S2
 ,
R¯ =

.
R0
R1
R2
 , q¯ =

q0
q1
q2
q3
 , r¯ =
r0r1
r2
 ,
the cost function is rewritten as
φ = 12 v¯
THRv¯ + γTr v¯ + ρρ. (9.27)
where
HR = ΓTv Q¯Γv + ΓTv S¯T + S¯Γv + R¯
γr = Γ
T
v Q¯Γb + S¯Γb + Γ
T
v q¯ + r¯
ρρ =
1
2 (Γ
T
b Q¯Γb + 2q¯
TΓb + ρ¯).
On the other hand, inserting (9.21) in the cost function expression (9.26) where
the original deﬁnitions in (7.8) for the matrices Q¯, S¯, R¯, q¯, s¯ are employed, gives
230 Condensing methods
the equivalent formulation
φ = 12
[
xT0 u¯
T
] [ ΓTx Q¯Γx ΓTu Q¯Γx + S¯Γx
ΓTx Q¯Γu + Γ
T
x S¯
T ΓTu Q¯Γu + Γ
T
u S¯
T + S¯Γu + R¯
] [
x0
u¯
]
+
+
[
ΓTx Q¯Γb + Γ
T
x q¯
ΓTu Q¯Γb + S¯Γb + Γ
T
u q¯ + r¯
]T [
x0
u¯
]
+ 12
(
ΓTb Q¯Γb + 2q¯
TΓb + ρ¯
)
=
= 12
[
xT0 u¯
T
] [HQ HTS
HS HR
] [
x0
u¯
]
+
[
gTq g
T
r
] [x0
u¯
]
+ ρρ
(9.28)
where the components of the cost function associated with the states or the
inputs are clearly recognizable. This gives the equations
HR =
[
HQ H
T
S
HS HR
]
, γr =
[
gq
gr
]
.
Notice that the expressions for the Hessian matrix HR is formally identical to
HR (and to the equivalent matrix in the MPC case), provided that Γv replaces
Γu. Despite the fact that the initial state x0 is retained as an optimization
variable, the expression for the gradient γr is also formally identical to the
expression for gr (and to the equivalent vector in the MPC case). On the other
hand, the expressions for HQ, HS and gq are not used in the condensing of the
MPC problem, since there the initial state is considered datum and not retained
as an optimization variable.
In the following, the algorithms are derived using the formal analogy between
HR in the MHE problem and HR in the MPC problem, and matrix partition is
employed to emphasize the HQ, HS , HR sub-matrices. The same applies to gr
and to the working matrices.
9.2.1 Condensing algorithms for MHE
9.2.1.1 O(N2) computation of Γv
Also in the MHE case the matrices Γv and Γb can be eﬃciently computed in
time O(N2) and O(N) respectively, by exploiting the structure of the matrix
A¯−1. Similarly to the MPC case, this can be done by means of (9.7), that in
the MHE case also includes the matrix A0.
The computation of Γb requires about 2Nx
2
x ﬂops (identical to the MPC case),
and the algorithm is presented in Algorithm 18.
9.2 Condensing methods for MHE 231
Algorithm 18 Computation of Γb
1: Γb[0]← 0
2: for i← 0, . . . , N − 1 do
3: Γb[i+ 1]← Ai · Γb[i] + bi
4: end for
The computation of Γv requires about N
2n2xnu + 2Nn
3
x ﬂops (that is 2Nn
3
x
higher than in the MPC case, and cubic in nx). The algorithm is presented in
Algorithm 19.
Algorithm 19 Computation of Γv
1: Γv[0, 0]← I
2: for i← 0, . . . , N − 1 do
3: Γv[i+ 1, 0 : i]← Ai · Γv[i, 0 : i]
4: Γv[i+ 1, i+ 1]← Bi
5: end for
9.2.1.2 O(N3) computation of HR
This section contains the adaptation to the MHE case of the O(N3) and O(n2x)
algorithm for the MPC case. However, in the MHE case it is not possible to
avoid the O(n3x) terms in the computation cost.
The key operation in the condensing method is the computation of the matrix
ΓTv Q¯Γv. In the MPC case, the algorithm
ΓTv Q¯Γv = Γ
T
v · (Q¯ · Γv)
has been employed in order to avoid O(n3x) terms. Since it is not possible to do
so in the MHE case, a better algorithm is
ΓTv Q¯Γv = Γ
T
v (L¯QL¯
T
Q)Γv = (Γ
T
v · L¯Q) · (ΓTv · L¯Q)T
since it preserves symmetry and reduces the computational cost. The matrix
Γv is pre-computed. The matrix-matrix products are computed exploiting the
block-triangular structure of Γv.
The matrix L¯Q is block-diagonal and it contains the lower Cholesky factors Li
232 Condensing methods
of the Qi matrices,
L¯Q =

L0
L1
L2
L3

and it can be computed at a cost of 13Nn
3
x ﬂops using the LAPACK routine
potrf. Notice that, if the matrices Qi are diagonal, this cost can be reduces to
about Nnx ﬂops, that is linear in nx.
The matrix ΓTu · L¯Q is
ΓTv · L¯Q =

L0 A
T
0 L1 A
T
0 A
T
1 L2 A
T
0 A
T
1 A
T
2 L3
BT0 L1 B
T
0 A
T
1 L2 B
T
0 A
T
1 A
T
2 L3
BT1 L2 B
T
1 A
T
2 L3
BT2 L3

and it can be computed one block-column at a time, at a cost of about 12N
2n2xnu+
Nn3x ﬂops, using the BLAS routine trmm. Notice that, if the matrices Qi are di-
agonal, this cost can be reduced to aboutN2nxnu+2Nn
2
x ﬂops, that is quadratic
in nx.
Once computed the matrix ΓTu · L¯Q, the lower triangular part of the product
(ΓTu · L¯Q) · (ΓTu · L¯Q)T is
(ΓTu · L¯Q) · (ΓTu · L¯Q)T =

T−1,−1 ∗ ∗ ∗
T0,−1 T0,0 ∗ ∗
T1,−1 T1,0 T1,1 ∗
T2,−1 T2,0 T2,1 T2,2

where
T−1,−1 = Q0 +AT0 Q1A0 +A
T
0 A
T
1 Q2A1A0 +A
T
0 A
T
1 A
T
2 Q3A2A1A0
T0,−1 = BT0 Q1A0 +B
T
0 A
T
1 Q2A1A0 +B
T
0 A
T
1 A
T
2 Q3A2A1A0
T0,0 = B
T
0 Q1B0 +B
T
0 A
T
1 Q2A1B0 +B
T
0 A
T
1 A
T
2 Q3A2A1B0
T1,−1 = BT1 Q2A1A0 +B
T
1 A
T
2 Q3A2A1A0
T1,0 = B
T
1 Q2A1B0 +B
T
1 A
T
2 Q3A2A1B0
T1,1 = B
T
1 Q2B1 +B
T
1 A
T
2 Q3A2B1
T2,−1 = BT2 Q3A2A1A0
T2,0 = B
T
2 Q3A2A1B0
T2,1 = B
T
2 Q3A2B1
T2,2 = B
T
2 Q3B2
9.2 Condensing methods for MHE 233
This operation can be performed using the syrk BLAS routine, at a cost of
1
3N
3nxn
2
u + N
2n2xnu + Nn
3
x ﬂops, that is N
2n2xnu + Nn
3
x higher than in the
MPC case.
The the HR matrix is initialized with the R¯ matrix. The strictly block-lower-
triangular part of theHR matrix is initialized with the matrix S¯·Γv = (ΓTv ·S¯T )T ,
that is computed using the gemm routine as
S¯ · Γv =
 S0S1A0 S1B0
S2A1A0 S2A1B0 S2B1

at the cost of about N2nxn
2
u + 2Nn
2
xnu ﬂops.
The O(N3) condensing algorithm for the MHE case is summarized in Algorithm
20. Besides the cost to compute ΓTv (equal to about N
2n2xnu+ 2Nn
3
x ﬂops), the
cost of the algorithm is of about
1
3N
3nxn
2
u+
3
2N
2n2xnu+
7
3Nn
3
x+N
2nxn
2
u+2Nn
2
xnu ≈ 13N3nxn2u+ 32N2n2xnu+ 73Nn3x
ﬂops if the matrices Qi and Si are dense and of about.
1
3N
3nxn
2
u +N
2n2xnu +Nn
3
x
ﬂops if the matrices Qi are diagonal and the matrices Si are zero.
9.2.1.3 O(N2) computation of HR - (1)
Similarly to the MPC case, by means of the recursive expression for the A¯−1 =
A¯−1N matrix in (9.23), it is possible to write the analogue recursive expression
for the Γv = Γv,N matrix
Γv,N = A¯−1N B¯N =
[ A¯−1N−1
AN−1EN−1A¯−1N−1 I
] [B¯N−1
BN−1
]
=
=
[ A¯−1N−1B¯N−1
AN−1EN−1A¯−1N−1B¯N−1 BN−1
]
=
[
Γv,N−1
AN−1EN−1Γv,N−1 BN−1
]
(9.29)
where En is deﬁned in (9.4) and the expression EnΓv,n is the last block-row of
the matrix Γv,n, and similarly the expression Γ
T
v,nETn is the last block-column of
the matrix ΓTv,n.
234 Condensing methods
Algorithm 20 Computation of HR, O(N3) algorithm
Require:
ΓTu
1: for i← 0, . . . , N do
2: Li ← Q1/2i
3: end for
4: for i← 0, . . . , N do
5: (ΓTv Q¯)[0 : i, i]← ΓTv [0 : i, i] ·Qi
6: end for
7: HR[0, 0]← 0
8: for i← 0, . . . , N − 1 do
9: HR[i+ 1, i+ 1]← Ri
10: end for
11: for i← 0, . . . , N − 1 do
12: HR[i+ 1, 0 : i]← (ΓTv [0 : i, i] · STi )T
13: end for
14: for i← 0, . . . , N do
15: HR[0 : i, 0 : i]← HR[0 : i, 0 : i] + (ΓTv Q¯)[0 : i, i] · (ΓTv [0 : i, i])T
16: end for
By means of (9.10), it is possible to investigate the inner structure of the ex-
pression ΓTv Q¯Γv = Γ
T
v,N Q¯NΓv,N as
(ΓTv,N Q¯N )Γv,N =
=
[
ΓTv,N−1 Γ
T
v,N−1ETN−1ATN−1
BTN−1
] [
Q¯N−1
QN
] [
Γv,N−1
AN−1EN−1Γv,N−1 BN−1
]
=
=
[
ΓTv,N−1Q¯N−1 Γ
T
v,N−1ETN−1ATN−1QN
BTN−1QN
] [
Γv,N−1
AN−1EN−1Γv,N−1 BN−1
]
=
=
[
ΓTv,N−1Q¯N−1Γv,N−1 + Γ
T
v,N−1ETN−1ATN−1QNAN−1EN−1Γv,N−1 ∗
(BTN−1QNAN−1)EN−1Γv,N−1 BTN−1QNBN−1
]
Deﬁned D˜N−1 = BTN−1QNBN−1 and M˜N−1 = B
T
N−1QNAN−1, the last block-
row of the matrix ΓTv,N Q¯NΓv,N can be computed at the cost of 2Nnxn
2
u+2nxn
2
u
ﬂops as (using N = 3 to make notation easier)

M˜2A1A0 M˜2A1B0 M˜2B1 D˜2
 . (9.30)
9.2 Condensing methods for MHE 235
The top-left block of ΓTv,N Q¯NΓv,N has the structure
ΓTv,N−1Q¯N−1Γv,N−1 + Γ
T
v,N−1ETN−1ATN−1QNAN−1EN−1Γv,N−1 =(
ΓTv,N−1Q¯N−1 + Γ
T
v,N−1ETN−1ATN−1QNAN−1EN−1
)
Γv,N−1
(9.31)
that has the same structure of the matrix (ΓTv,N Q¯N )Γv,N . The proof is analogue
to the MPC case. In the MHE case, for N = 3, the matrix ΓTu,N−1Q¯N−1 +
ΓTu,N−1ETN−1ATN−1QNAN−1EN−1 is
ΓTu,N−1Q¯N−1 + Γ
T
u,N−1ETN−1ATN−1QNAN−1EN−1 =
=
Q0 AT0 Q1 AT0 AT1 Q2 +AT0 AT1 AT2 Q3A2BT0 Q1 BT0 AT1 Q2 +BT0 AT1 AT2 Q3A2
BT1 Q2 +B
T
1 A
T
2 Q3A2

where again it can be seen that only the last block-column has been updated.
Also in the MHE case the computation of the matrix S¯Γv can be embedded
in the computation of the matrix ΓTv Q¯Γv at no extra cost. In fact, the last
block-row of the matrix S¯Γv + Γ
T
u Q¯Γv can be computed easily by using
MN−1 = SN−1 + M˜N−1 = SN−1 +BTN−1QNAN−1 (9.32)
in place of M˜N−1 in (9.30). Similarly, the last block-diagonal element of the
matrix Q¯+ ΓTv Q¯Γv can be computed easily by using
DN−1 = RN−1 + D˜N−1 = RN−1 +BTN−1QNBN−1 (9.33)
in place of D˜N−1 in (9.30).
At the end of the algorithm, the (lower triangular part) condensed Hessian
matrix HR shows the structure
HR =

P0 ∗ ∗ ∗
M0 D0 ∗ ∗
M1A0 M1B0 D1 ∗
M2A1A0 M2A1B0 M2B1 D2
 .
236 Condensing methods
where
P0 = Q0 +A
T
0 Q1A0 +A
T
0 A
T
1 Q2A1A0 +A
T
0 A
T
1 A
T
2 Q3A2A1A0
M0 = S0 +B
T
0 Q1A0 +B
T
0 A
T
1 Q2A1A0 +B
T
0 A
T
1 A
T
2 Q3A2A1A0
D0 = R0 +B
T
0 Q1B0 +B
T
0 A
T
1 Q2A1B0 +B
T
0 A
T
1 A
T
2 Q3A2A1B0
D1 = R1 +B
T
1 Q2B1 +B
T
1 A
T
2 Q3A2B1
M1 = S1 +B
T
1 Q2A1 +B
T
1 A
T
2 Q3A2A1
D2 = R2 +B
T
2 Q3B2
M2 = S2 +B
T
2 Q3S2
This O(N2) condensing algorithm is summarized in Algorithm 21. Beside the
cost to compute ΓTv (equal to about N
2n2xnu + 2Nn
3
x ﬂops), the cost of the
algorithm is of about
2N2n2xnu +N
2nxn
2
u + 4Nn
3
x
ﬂops if the matrices Qi are dense, and of about
N2n2xnu +N
2nxn
2
u + 2Nn
3
x
ﬂops if the matrices Qi are diagonal.
Algorithm 21 Computation of HR, O(N2) algorithm - (1)
Require:
ΓTv
1: ΓTw[0 : N,N ]← ΓTv [0 : N,N ] ·QN
2: for i← N − 1, . . . , 0 do
3: ΓTw[0 : i+ 1, i]← ΓTw[0 : i+ 1, i+ 1] ·Ai
4: Di ← Ri +BTi · (ΓTw[i+ 1, i+ 1])T
5: HR[i+ 1, i+ 1]← Di
6: Mi ← Si + ΓTw[i+ 1, i]
7: HR[i+ 1, 0 : i]← (ΓTv [0 : i, i] ·MTi )T
8: ΓTw[0 : i, i]← ΓTw[0 : i, i] + ΓTv [0 : i, i] ·Qi
9: end for
10: P0 ← 0 + IT · (ΓTw[0, 0])T
11: HR[0, 0]← P0
9.2.1.4 O(N2) computation of HR - (2)
It is possible to compute the matrix ΓTu Q¯Γu by means of a diﬀerent algorithm,
that reduces the O(N2) terms in the computational complexity of the algorithm.
9.2 Condensing methods for MHE 237
This algorithm is particularly convenient in the MHE case, since the O(Nn3x)
terms are present also in the other condensing algorithms, while this is not the
case in the MPC case.
The update in (9.31) is performed as
ΓTv,N−1Q¯N−1Γv,N−1 + Γ
T
v,N−1ETN−1ATN−1QNAN−1EN−1Γv,N−1 =
ΓTv,N−1
(
Q¯N−1 + ETN−1ATN−1QNAN−1EN−1
)
Γv,N−1.
The matrix ETN−1ATN−1QNAN−1EN−1 is zero everywhere but in the bottom-
right block. Therefore the update reduces to the update of the bottom-right
element of Q¯N−1, that is
PN−1 = QN−1 +ATN−1PNAN−1. (9.34)
where PN = QN . This is an operation with computational cost O(n3x), and it
can be computed eﬃciently in 73n
3
x ﬂops as
PN−1 = QN−1 + (ATN−1LN )(ATN−1LN )T
where LN is the lower triangular Cholesky factor of PN .
Summing up over the N stages gives a computational complexity of 73Nn
3
x, that
replaces the terms 2N2n2xnu + 4Nn
3
x in the computational cost of Algorithm
21. Besides the cost to compute ΓTv (equal to about N
2n2xnu+ 2Nn
3
x ﬂops), the
cost of the algorithm is of about
N2nxn
2
u +
7
3Nn
3
x + 5Nn
2
xnu +Nnxn
2
u ≈ N2nxn2u + 73Nn3x
ﬂops, where the leading terms are unchanged with respect to the MPC case. The
algorithm can not exploit the fact that the matrices Qi are diagonal to reduce
the computational complexity. The algorithm is summarized in Algorithm 22.
9.2.1.5 O(N2) computation of γr
In the computation of the right hand side γr, the key operation of the algorithm
is the computation of
ΓTv · (Q¯Γb + q¯). (9.35)
If the matrix Γv (assumed to be precomputed) is used, this operation can be
done in N2nxnu + 2Nn
2
x ﬂops. The algorithm for the computation of the right-
hand-side gr is summarized in Algorithm 23. Besides the cost to compute Γb,
the cost of the algorithm is of about
N2nxnu + 4Nn
2
x + 2Nnxnu ≈ N2nxnu + 4Nn2x
238 Condensing methods
Algorithm 22 Computation of HR, O(N2) algorithm -(2)
Require:
ΓTv
1: PN ← QN
2: for i← N − 1, . . . , 0 do
3: Li+1 ← P 1/2i+1
4:
[
BTi
ATi
]
L ←
[
BTi
ATi
]
· Li+1
5:
[
Di
MTi Pi
]
←
[
Ri
STi Qi
]
+
([
BTi
ATi
]
L
)
·
([
BTi
ATi
]
L
)T
6: HR[i+ 1, i+ 1]← Di
7: HR[i+ 1, 0 : i]← (ΓTv [0 : i, i] ·MTi )T
8: end for
9: HR[0, 0]← P0
ﬂops if the matrices Qi are dense, and of about
N2nxnu + 2Nn
2
x
ﬂops if the matrices Qi are diagonal.
Algorithm 23 Computation of γr, O(N2) algorithm
Require:
ΓTv , Γb
1: for i← 0, . . . , N do
2: γr[i]← ri + Si · Γb[i]
3: end for
4: for i← 0, . . . , N do
5: (Q¯Γb + q¯)[i]← qi +Qi · Γb[i]
6: end for
7: for i← 0, . . . , N do
8: γr[0 : i]← γr[0 : i] + ΓTv [0 : i, i] · (Q¯Γb + q¯)[i]
9: end for
9.2.1.6 O(N) computation of γr - (1)
If the structure of the matrix Γv = A¯
−1B¯ is exploited, it is possible to compute
(9.35) as
ΓTv · (Q¯Γb + q¯) = B¯T · (A¯−T · (Q¯Γb + q¯))
9.2 Condensing methods for MHE 239
trading oﬀ an increase of the computational complexity in nx with a reduction
in N . In fact, the operation A¯−T · (Q¯Γb + q¯) can be computed in 2Nn2x ﬂops
using (9.8). The multiplication by B¯ can then be computed in 2Nnxnu ﬂops
exploiting the fact that it is block-diagonal. The algorithm for the computation
of the right-hand-side γr is summarized in Algorithm 24. Besides the cost to
compute Γb, the cost of the algorithm is of about
4Nn2x + 4Nnxnu
ﬂops if the matrices Qi are dense, and of about
2Nn2x + 2Nnxnu
ﬂops if the matrices Qi are diagonal.
Algorithm 24 Computation of γr, O(N) algorithm - (1)
Require:
Γb
1: for i← 0, . . . , N do
2: γr[i]← ri + Si · Γb[i]
3: end for
4: for i← 0, . . . , N do
5: (Q¯Γb + q¯)[i]← qi +Qi · Γb[i]
6: end for
7: t[N ]← (Q¯Γb + q¯)[N ]
8: for i← N − 1, . . . , 0 do
9: t[i]← (Q¯Γb + q¯)[i] +ATi · t[i+ 1]
10: end for
11: for i← 0, . . . , N do
12: γr[i]← γr[i] +BTi · t[i]
13: end for
9.2.1.7 O(N) computation of γr - (2)
The approach of the O(N2) and O(n3x) method to condense the Hessian matrix
can be applied to the condensing of the gradient vector. This algorithm employs
the Pi and Mi matrices computed in Algorithm 11, and therefore its use make
sense only in connection to that Hessian condensing algorithm. Furthermore,
the two algorithms could be merged into a single one, similarly to the fact that
the backward substitution can be merged with the backward Riccati recursion
(see Algorithm 1). For N = 3, at the end of the algorithm the gradient vector
240 Condensing methods
looks like
γr =

p0
m0
m1 +M1b0
m2 +M2(A1b0 + b1)

where
p0 = q0 +A
T
0 (P1b0 + p1)
m0 = r0 +B
T
0 (P1b0 + p1)
m1 = r1 +B
T
1 (P2b1 + p2)
m2 = r2 +B
T
2 (P3b2 + p3)
where in turn
p1 = q1 +A
T
1 (P2b1 + p2)
p2 = q2 +A
T
2 (P3b2 + p3)
p3 = q3
and the matrices Pi and Mi are deﬁned in the Hessian condensing algorithm.
The algorithm is presented in Algorithm 25, and it has a computational com-
plexity of
4Nn2x + 4Nnxnu
ﬂops, irrespective of the fact that the matrices Qi are dense or diagonal.
Algorithm 25 Computation of γr, O(N) algorithm - (2)
Require:
Γb, Pi, Mi
1: pN ← qN
2: for i← N − 1, . . . , 0 do
3: ti ← Pi+1 · bi + pi+1
4: pi ← qi +ATi · ti
5: mi ← ri +BTi · ti
6: γr[i+ 1]← mi +Mi · Γb[i]
7: end for
8: γr[0]← p0
9.2 Condensing methods for MHE 241
Table 9.3: Comparison of condensing algorithms in terms of ﬂops. In all cases,
additional N2n2xnu + 2Nn
3
x ﬂops are needed to compute Γ
T
v .
algorithm dense cost function diagonal cost function
Algorithm 20 13N
3nxn
2
u +
3
2N
2n2xnu +
7
3Nn
3
x
1
3N
3nxn
2
u +N
2n2xnu +Nn
3
x
Algorithm 21 2N2n2xnu +N
2nxn
2
u + 4Nn
3
x N
2n2xnu +N
2nxn
2
u + 2Nn
3
x
Algorithm 22 N2nxn
2
u +
7
3Nn
3
x N
2nxn
2
u +
7
3Nn
3
x
9.2.1.8 Comparison of condensing algorithms for MHE
In this section the three condensing algorithm for the computation of HR are
compared, both in terms of ﬂops and in terms of running times of a practical
implementation. The algorithms for the computation of γr are not compared,
since this is generally not the key operation.
In the comparison in therm of ﬂops, two cases are considered:
 dense Ri, Si and Qi matrices (dense Hessian of the cost function);
 diagonal Ri and Qi matrices and zero Si matrix (diagonal Hessian of the
cost function).
The number of ﬂops for the three condensing algorithms for these two cases are
reported in table 9.3. In the MHE case, all algorithms for the computation of the
condensed Hessian matrix HR have a computational complexity that is cubic in
nx. Furthermore, Algorithm 22 has the same computational complexity than in
the MPC case, and therefore it is likely to be the best choice in most cases. In
the case of dense Hessian of the cost function, the computational complexity of
Algorihtm 22 is the lowest for all problem sizes. In the case of diagonal Hessian
of the cost function, Algorithm 20 could be the best choice if the horizon length
is short and the state vector size is large, since it has a smaller coeﬃcient of the
Nn3x term in the computational complexity.
In the comparison in terms of running times, the algorithms are implemented
using matrices in panel-major format, and using the linear algebra routines
in HPMPC. The results are in Figure 9.4. The numerical results conﬁrm the
ﬂop count analysis. In particular, in case of dense Hessian of the cost function,
Algorithm 22 is the best choice for all problem sizes. In case of diagonal Hessian
of the cost function, Algorithm 22 is the best choice for most problem sizes, with
Algorithm 20 that can be slightly faster in case of very short N or very large
nx.
242 Condensing methods
10-6
10-5
10-4
10-3
10-2
10-1
100 101 102
tim
e 
[s]
N
Hessian cond - nx=16, nu=8 - dense Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(a) N varying, dense cost function.
10-6
10-5
10-4
10-3
10-2
10-1
100 101 102
tim
e 
[s]
N
Hessian cond - nx=16, nu=8 - diag Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(b) N varying, diagonal cost function.
10-5
10-4
10-3
10-2
100 101 102
tim
e 
[s]
nx
Hessian cond - N=30, nu=2 - dense Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(c) nx varying, dense cost function.
10-5
10-4
10-3
10-2
100 101 102
tim
e 
[s]
nx
Hessian cond - N=30, nu=2 - diag Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(d) nx varying, diagonal cost function.
Figure 9.4: Comparison of Hessian condensing algorithms for MHE: Algo-
rithm 20 (corresponsing to the MPC algorithm 9 with cost O(N3)
and O(n2x)) (blue), Algorithm 21 (corresponsing to the MPC Algo-
rithm 10 with cost O(N2) and O(n2x)) (red), Algorithm 22 (corre-
sponding to the MPC Algorithm 11 with cost O(N2) and O(n3x))
(green). The time to compute ΓTv is included in the total solution
time.
9.2 Condensing methods for MHE 243
9.2.2 Factorization algorithms for MHE
In this section two Hessian factorization algorithms are reviewed. The ﬁrst
algorithm is the classical Cholesky factorization algorithm, that is commonly
employed to factorize the positive-deﬁnite condensed Hessian matrix. The sec-
ond algorithm is the adaptation to the MHE case of the structure-exploiting
Cholesky factorization of the reverse condensed Hessian matrix HˆR, that is the
permutation of the condensed Hessian HR such that the input vector is
uˆ =

uN−1
...
u0
x0
 (9.36)
The two algorithms have very diﬀerent computational complexity, and can be
combined (with some limitations) with the Hessian condensing algorithms pre-
sented in Section 9.2.1, similarly to the MPC case.
9.2.2.1 O(N3) Cholesky factorization of HR
Once built the HR matrix using the Hessian condensing Algorithms 20, 21 or
22, it is trivially possible to factorize it using the standard Cholesky factoriza-
tion. More precisely, besides the cost to build the condensed Hessian HR using
Algorithm 20, 21 or 22, in the MHE case the cost of the algorithm is of about
1
3N
3n3u +N
2nxn
2
u +Nn
2
xnu +
1
3n
3
x
ﬂops, and therefore cubic in (Nnu + nx). This is the classical way to factorize
HR.
9.2.2.2 O(N) Cholesky factorization of HˆR
Here it is presented the adaptation to the MHE case of the O(N) Hessian
factorization algorithm proposed in the paper [34] for the MPC case.
Since this structure-exploiting factorization starts from the last stage, also in
the MHE case it is chosen to compute the classical Cholesky factorization of
the permutation HˆR of the condensed Hessian HR such that the input vector is
(9.36).
244 Condensing methods
The only diﬀerences with respect to the MPC case is that at the last stage the
matrix L0 needs to be computed, and the matrix P0 of size nx × nx has to be
factorized.
The lower triangular Cholesky factor LˆR of the matrix HˆR is then computed as
(using N = 3 to make notation easier)
LˆR =

Λ2
BT1 L
T
2 Λ1
BT0 A
T
1 L
T
2 B
T
0 L
T
1 Λ0
AT0 A
T
1 L
T
2 A
T
0 L
T
1 L
T
0 L0
 (9.37)
in the same way as the matrix HˆR is computed as
HˆR =

D2
BT1 M
T
2 D1
BT0 A
T
1 M
T
2 B
T
0 M
T
1 D0
AT0 A
T
1 M
T
2 A
T
0 M
T
1 M
T
0 P0
 .
Besides the cost of the Hessian condensing algorithm, the computational com-
plexity of the factorization algorithm is of about
1
3n
3
x +Nn
2
xnu +Nnxn
2
u +
1
3Nn
3
u
ﬂops, that is lower that the classical Cholesky factorization in the MHE case.
When this is added to the computational complexity of the Hessian condens-
ing algorithm, asymptotically the ﬁrst three terms are totally hidden. There-
fore, also in the MHE case the only term adding to the asymptotic complexity
is O(Nn3u), that is greatly reduced compared to the O(N3n3u) computational
complexity of the classical Cholesky factorization of the condensed Hessian.
Since the factorization procedure is embedded in the condensing procedure, a
suitable condensing algorithm has to be chosen. Namely, Algorithms 21 and 22
can be used, since they compute explicitly the quantities Dn, Mn and P0. On
the other hand, Algorithm 20 can not be used, since it does not provide these
quantities, and a modiﬁcation of Algorithm 20 that explicitly provides Dn, Mn
and P0 would increase the computational complexity, making the algorithm
unattractive with respect to e.g. Algorithm 21. Furthermore notice that, even
if Qn is diagonal, in general Q
∗
n is not diagonal for n 6= N , and therefore only
the versions of the Hessian condensing algorithms considering Qn as dense can
be used.
9.2 Condensing methods for MHE 245
10-7
10-6
10-5
10-4
10-3
10-2
100 101 102
tim
e 
[s]
N
Hessian fact - nx=16, nu=8
N3 alg
N alg
(a) N varying.
10-6
10-5
10-4
100 101 102
tim
e 
[s]
nx
Hessian fact - N=30, nu=2
N3 alg
N alg
(b) nx varying.
Figure 9.5: Comparison of Hessian factorization algorithms for MHE:
Cholesky factorization of HR with cost
1
3N
3n3u (red), structure ex-
ploiting Cholesky factorization of HˆR with cost Nn
2
xnu+Nnxn
2
u+
1
3Nn
3
u (blue).
9.2.2.3 Comparison of factorization algorithms for MHE
In Figure 9.5 there is a comparison of the computational complexity of the
two Hessian factorization algorithms, besides the computational complexity to
build the condensed Hessian matrix using Algorithms 20, 21 or 22. In the MHE
case, the O(N) factorization algorithm has a lower computational complexity
for all problem sizes. However, from the pictures it appears that the classical
Cholesky factorization is slightly more eﬃcient for small values of N , or for
very large values of nx. This is due to the eﬃciency of implementation: in fact,
the Cholesky factorization is a single linear algebra operation that has a large
matrix operand, and therefore it has high performance. On the other hand, the
O(N) factorization is implemented as O(N) calls to linear algebra routines. In
particular, beside the O(n3x) factorization of P0, most of the computational cost
comes from the computation of the terms Q∗n = Qn − LnLTn , that in case of
nx  nu are low-rank updates, and therefore generally attaining a rather low
performance. A tailored implementation of the syrk routine for the low-rank
cases can partially mitigate this performance penalty.
9.2.3 Condensing and factorization algorithms for MHE
In this section the combination of Hessian condensing and factorization algo-
rithms is presented. Similarly to the MPC case, all Hessian condensing algo-
246 Condensing methods
Table 9.4: Feasible combinations of Hessian factorization algorithms and Hes-
sian factorization algorithms in the MHE case.
condensing algorithms
Algorithm 20 Algorithm 21 Algorithm 22
factorization dense diag dense diag dense diag
algorithms Qn Qn Qn Qn Qn Qn
O(N3) x x x x x x
O(N) x x
rithms (possibly tailored to a diagonal Hessian of the cost function) can be
combined with the classical O(N3) Cholesky factorization. However, only Al-
gorithm 21 and 22 in the version for dense Hessian of the cost function can
be combined with the structure-exploiting O(N) Cholesky factorization. The
feasible combinations are summarized in Table 9.4.
In the remainder of the section, the two feasible combinations of Hessian con-
densing algorithms with the structure-exploiting O(N) Cholesky factorization
algorithm are presented in detail. Finally, diﬀerent combinations of Hessian
condensing and factorization algorithms for the MHE problem are compared.
9.2.3.1 O(N2) computation of the Cholesky factor LˆR - (1)
The Hessian condensing Algorithm 21 can be easily combined with the O(N)
structure-exploiting reversed Hessian factorization algorithm for the computa-
tion of LˆR. The resulting algorithm is summarized in Algorithm 26. Notice that
the matrix ΓTv internally used by the Algorithm 26 is not permuted, i.e. it is the
same as in Algorithm 21: this simpliﬁes the implementation if the matrices are
in panel-major format, since the top-left corner of all sub-matrices used in the
computation is properly aligned in memory. However, the algorithm builds the
lower Cholesky factor LˆR of the reversed Hessian HˆR: the permutation is per-
formed in line 7 (diagonal bocks) and in the for loop in lines 10-12 (oﬀ-diagonal
blocks).
Besides the cost to compute ΓTv (equal to about N
2n2xnu + 2Nn
3
x ﬂops), the
overall computational complexity of the algorithm is of about
2N2n2xnu +N
2nxn
2
u + 4Nn
3
x +Nn
2
xnu +Nnxn
2
u +
1
3n
3
x +
1
3Nn
3
u ≈
≈ 2N2n2xnu +N2nxn2u + 4Nn3x + 13Nn3u
ﬂops. This means that, asymptotically, the additional cost of Algorithm 26 with
respect to Algorithm 21 is equal to 13Nn
3
u (in place of
1
3 (Nnu + nx)
3 obtained
9.2 Condensing methods for MHE 247
using the classical Cholesky factorization). Notice that, even if Qi is diagonal,
in general Q∗i is not diagonal for i 6= N , and therefore (except for the last stage)
it is not possible to reduce the computational cost in case of diagonal Hessian
of the cost function.
Algorithm 26 Computation of the lower Cholesky factor LˆR of HˆR, O(N2)
algorithm - (1)
Require:
ΓTv
1: Q∗N ← QN
2: ΓTw[0 : N,N ]← ΓTv [0 : N,N ] ·QN
3: for i← N − 1, . . . , 0 do
4: ΓTw[0 : i+ 1, i]← ΓTw[0 : i+ 1, i+ 1] ·Ai
5: Di ← Ri +BTi · (ΓTw[i+ 1, i+ 1])T
6: Λi ← D1/2i
7: LˆR[N − 1− i,N − 1− i]← Λi
8: Mi ← Si + ΓTw[i+ 1, i]
9: LTi ←MTi · Λ−Ti
10: for j ← 0, . . . , i do
11: LˆR[N − j,N − 1− i]← ΓTv [j, i] · LTi
12: end for
13: Q∗i ← Qi − LTi · Li
14: ΓTw[0 : i, i]← ΓTw[0 : i, i] + ΓTv [0 : i, i] ·Q∗i
15: end for
16: P0 ← 0 + IT · (ΓTw[0, 0])T
17: L0 ← P 1/20
18: LˆR[N,N ]← L0
9.2.3.2 O(N2) computation of the Cholesky factor LˆR - (2)
Alternatively, similarly to the MPC case the Hessian condensing Algorithm 22
can be combined with the O(N) structure-exploiting reversed Hessian factor-
ization algorithm for the computation of LˆR. The resulting algorithm is sum-
marized in Algorithm 27. It is possible to embed the computation of Λn, Ln
and Q∗n with the Cholesky factorization of Pn, in the same way as in the back-
ward Riccati recursion implementation in Algorithm 2: this is done in line 4 of
Algorithm 27.
Besides the cost to compute ΓTu (equal to about N
2n2xnu + 2Nn
3
x ﬂops), the
248 Condensing methods
computational complexity of the algorithm is of about
N2nxn
2
u+
7
3Nn
3
x+6Nn
2
xnu+2Nnxn
2
n+
1
3n
3
x+
1
3Nn
3
u ≈ N2nxn2u+ 73Nn3x+ 13Nn3u
ﬂops. Notice that the third, the fourth and the ﬁfth term are excluded from
the ﬁnal approximation of the computational cost, since asymptotically they are
totally hidden by the O(N2n2xnu) and O(N2nxn2u) terms. The leading terms of
the computational complexity are identical to the ones in the MPC case.
Algorithm 27 Computation of the lower Cholesky factor LˆR of HˆR, O(N2)
algorithm - (2)
Require:
ΓTv
1: LN ← Q1/2N
2: for i← N − 1, . . . , 0 do
3:
[
BTi
ATi
]
L ←
[
BTi
ATi
]
· Li+1
4:
[
Λi
LTi Li
]
←
([
Ri
STi Qi
]
+
([
BTi
ATi
]
L
)
·
([
BTi
ATi
]
L
)T)1/2
5: LˆR[N − 1− i,N − 1− i]← Λi
6: for j ← 0, . . . , i do
7: LˆR[N − j,N − 1− i]← ΓTv [j, i] · LTi
8: end for
9: end for
10: LˆR[N,N ]← L0
9.2.3.3 Comparison of condensing and factorization algorithms for
MHE
In Figure 9.6 there is a comparison of algorithms for the computation of the
Cholesky factor of the condensed Hessian HR or of the reversed condensed Hes-
sian HˆR. The same three algorithm considered in the MPC case are compared
also in the MHE case. Namely, the three algorithms are:
 Algorithm 20 + classicalO(N3) condensed Hessian Cholesky factorization.
 Algorithm 26. It is a combination of the Hessian condensing algorithm 21
and of the structure-exploitingO(N) reversed condensed Hessian Cholesky
factorization.
9.2 Condensing methods for MHE 249
 Algorithm 27. It is a combination of the Hessian condensing algorithm 22
and of the structure-exploitingO(N) reversed condensed Hessian Cholesky
factorization.
In the comparison in terms of running times, the algorithms are implemented
using the matrices in panel-major format, and using the linear algebra routines
in HPMPC. The results are in Figure 9.6. The results are similar to the result of
the condensing algorithm for the MHE case in Figure 9.4, since the factorization
cost is generally smaller than the condensing cost. Namely, in case of dense
Hessian of the cost function, Algorithm 27 is the best choice for all problem
sizes. In case of diagonal Hessian of the cost function, Algorithm 27 is the best
choice for most problem sizes, with Algorithm 20 + classical O(N3) Cholesky
factorization being slightly faster in case of very small N or very large nx, due
to the smaller coeﬃcient of the O(Nn3x) term in Algorithm 20.
9.2.4 Solution algorithms for MHE
If the aim of the condensing procedure is the solution of the KKT system of the
MPC problem, a solution procedure must be employed after the completion of
the KKT matrix factorization procedure.
The classical way to do so is to use the e.g. lower triangular Cholesky factor
LR of the condensed Hessian to perform a forward-backward substitution with
γr as right hand side. The lower Cholesky factor can be computed using one of
the presented methods.
Alternatively, the structure still present in the Cholesky factor can be exploited
to reduce the computational cost, as proposed in the paper [34] for the MPC
case.
9.2.4.1 O(N2) solution
Once computed the lower Cholesky factor LR of the condensed Hessian matrix
HR and the condensed gradient γr, it is possible to compute the minimizer of
the cost function (9.27) by setting its gradient to zero, as
HRu+ γr = 0 ⇒ u = −L−TR L−1R γr
The forward and backward substitutions can be performed using the dense linear
algebra routine trsv in BLAS, at a total cost of 2(Nnu + nx)
2 ﬂops, that is
quadratic on Nnu and nx.
250 Condensing methods
10-6
10-5
10-4
10-3
10-2
10-1
100 101 102
tim
e 
[s]
N
Hessian cond fact - nx=16, nu=8 - dense Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(a) N varying, dense cost function.
10-6
10-5
10-4
10-3
10-2
10-1
100 101 102
tim
e 
[s]
N
Hessian cond fact - nx=16, nu=8 - diag Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(b) N varying, diagonal cost function.
10-5
10-4
10-3
10-2
100 101 102
tim
e 
[s]
nx
Hessian cond fact - N=30, nu=2 - dense Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(c) nx varying, dense cost function.
10-5
10-4
10-3
10-2
100 101 102
tim
e 
[s]
nx
Hessian cond fact - N=30, nu=2 - diag Hessian
N3 nx
2
 alg
N2 nx
2
 alg
N2 nx
3
 alg
(d) nx varying, diagonal cost function.
Figure 9.6: Comparison of Hessian condensing and factorization algorithms for
MHE: Algorithm 20 + O(N3) Cholesky factorization (correspond-
ing to the MPC Algorithm 9 + O(N3) Cholesky factorization with
costO(N3) andO(n2x)) (blue), Algorithm 26 (corresponding to the
MPC Algorithm 15 with cost O(N2) and O(n2x)) (red), Algorithm
27 (corresponding to the MPC Algorithm 16 with cost O(N2) and
O(n3x)) (green). The time to compute ΓTv is included in the total
solution time.
9.2 Condensing methods for MHE 251
9.2.4.2 O(N) solution
If the O(N) factorization Algorithm is employed to compute the lower triangular
Cholesky factor LˆR of the permuted Hessian matrix HˆR as in Algorithms 26
or 27, it is possible to exploit the structure still present in LˆR to reduce the
computational complexity in N of the solution algorithm.
Even more, as shown in the paper [34] for the MPC case, it is not even necessary
to explicitly build the lower Cholesky factor LˆR, reducing the factorization cost
by N2nxn
2
u + 2Nn
2
xnu ﬂops (and e.g. making Algorithm 27 linear in N). In
fact, only the matrices Λi, Li and L0 are employed in the solution algorithm.
The lower Cholesky factor LˆR in (9.37) has the structure
LˆR =

Λ2
BT1 L
T
2 Λ1
BT0 A
T
1 L
T
2 B
T
0 L
T
1 Λ0
AT0 A
T
1 L
T
2 A
T
0 L
T
1 L
T
0 P0
 = Λˆ + ΓˆTu LˆT = Λˆ + BˆT Aˆ−T LˆT
where
Λˆ =

Λ2
Λ1
Λ0
L0
 , LˆT =
LT2 LT1
LT0
 ,
BˆT =

BT2
BT1
BT0
I
 , Aˆ−T =

I
−AT2 I
−AT1 I
−AT0 I

−1
.
By deﬁning the vector yˆ = LˆTRuˆ, the forward substitution is in the form
LˆRyˆ =
(
Λˆ + BˆT Aˆ−T LˆT
)
yˆ = −γˆr
that gives y with the recursion
yˆ = −Λˆ−1
(
γˆr + Bˆ
T Aˆ−T LˆT yˆ
)
that for N = 3 looks like
y3
y2
y1
y0
 =

−Λ−12 (γ3)
−Λ−11
(
γ2 +B
T
1 L
T
2 y3
)
−Λ−10
(
γ1 +B
T
0 A
T
1 L
T
2 y3 +B
T
0 L
T
1 y2
)
−L−10
(
γ0 +A
T
0 A
T
1 L
T
2 y3 +A
T
0 L
T
1 y2 + L
T
0 y1
)
 .
252 Condensing methods
Notice that this is a backward recursion with respect to the indexes of the data
matrices, due to the permutation of the condensed Hessian.
The backward substitution is in the form
LˆTRuˆ =
(
ΛˆT + LˆAˆ−1Bˆ
)
uˆ = yˆ
that gives the recursion
uˆ = Λˆ−T
(
yˆ − LˆAˆ−1Bˆuˆ
)
that for N = 3 looks like
u2
u1
u0
x0
 =

Λ−T2 (y3 − L2B1u1 − L2A1B0u0 − L2A1A0x0)
Λ−T1 (y2 − L1B0u0 − L1A0x0)
Λ−T0 (y1 − L0x0)
L−T0 (y0)
 .
Notice that this is a forward recursion with respect to the indexes of the data
matrices, due to the permutation of the condensed Hessian.
The algorithm is summarized in Algorithm 28. The computational cost of the
algorithm is of 4Nn2x + 8Nnxnu + 2Nn
2
u ﬂops, plus enabling a reduction of
the cost of the factorization Algorithms 26 and 27 of N2nxn
2
u + 2Nn
2
xnu ﬂops
compared with the O(N2) solution algorithm.
9.3 Conclusion
This chapter presented and compared in a systematic way several condensing
algorithms, for both the MPC and the MHE problem. Namely, three condensing
algorithms, two factorization algorithms, two combined condensing-factorization
algorithms and ﬁnally two solution algorithms.
Regarding condensing algorithms, in the MPC case, the three condensing al-
gorithms have diﬀerent asymptotic computational complexities, of O(N3) and
O(n2x), of O(N2) and O(n2x) and of O(N2) and O(n3x) ﬂops. The former algo-
rithm is the classical condensing algorithm and it is the best option for small
values of the ratio N/nx. The latter algorithm employs a recursion somehow ana-
logue to the Riccati recursion, and it is the best option for large values of the
ratio N/nx. The middle algorithm has the best computational complexity but
larger coeﬃcients, and it is therefore always the second best. In the MHE case,
all condensing algorithms have a complexity O(n3x), and therefore the latter al-
gorithm is generally the best choice for all problem sizes. Notice that in both
9.3 Conclusion 253
Algorithm 28 Computation of the solution of the condensed system, O(N)
algorithm
Require:
Λi, Li, L0
1: tN ← 0
2: yN ← −Λ−1N−1 (γN )
3: for i← N − 1, . . . , 1 do
4: ti ← LTi yi+1 +ATi ti+1
5: yi ← −Λ−1i−1
(
γi +B
T
i−1ti
)
6: end for
7: t0 ← LT0 y1 +AT0 t1
8: y0 ← −L−10 (γ0 + t0)
9: x0 ← L−T0 (y0)
10: t0 ← x0
11: for i← 0, . . . , N − 1 do
12: ui ← Λ−Ti (yi − Liti)
13: ti+1 ← Biui +Aiti
14: end for
the MPC and the MHE case it is possible to compute the condensed Hessian
matrix in time O(N2).
Regarding factorization algorithms, in the MPC case, the two factorization algo-
rithms are the classical Cholesky factorization of the condensed Hessian matrix
(of size Nnu) with computational cost O((Nnu)3) (and therefore cubic in N and
constant in nx) and a structure-exploiting factorization with computational cost
of O(N(n2xnu+nxn2u+n3u)) (and therefore linear in N and quadratic in nx). In
the MHE case, the condensed Hessian matrix has size nx +Nnu, and therefore
the classical Cholesky factorization has a computational cost of O((nx+Nnu)3),
that is always larger than the computational cost of the structure-exploiting fac-
torization (equal to O(n3x +N(n2xnu + nxn2u + n3u)) ﬂops in the MHE case).
The structure-exploiting factorization needs to be combined with either the mid-
dle or the latter condensing algorithms. The resulting algorithm has approxi-
mately the same computational complexity as the condensing algorithm alone,
thanks to the low computational complexity of the O(N) structure-exploiting
factorization. In particular, in both the MPC and the MHE case it is possible
to compute the Cholesky factorization of the condensed Hessian matrix in time
O(N2).
Regarding the solution algorithms, it is possible to solve the KKT system with
254 Condensing methods
the classical backward-forward substitution employing the Cholesky factoriza-
tion of the condensed Hessian at a computational cost of O(N2) ﬂops, or it is
possible to employ a structure-exploiting solution at a computational cost of
O(N) ﬂops.
Notice that the use of the structure-exploiting solution removes the need to
explicitly build the condensed Hessian matrix, reducing the computational cost
of the condensing algorithms. Therefore, it is possible to solve the KKT system
in time O(N) by combining the latter condensing algorithm stopped before the
explicit Hessian build, the structure-exploiting factorization and the structure-
exploiting solution. The resulting algorithm has analogies with the backward
Riccati recursion, and they share the same computational complexity.
On the other hand, if the condensed Hessian matrix or its Cholesky factor are
explicitly needed, theO(N2) cost can not be avoided in general, trivially because
the condensed Hessian matrix contains O(N2) elements.
Chapter 10
Partial condensing
The paper [18] proposes techniques to control the level of sparsity in MPC
problems, trading-oﬀ horizon length and input vector size. The introduction of
this degree of freedom in the formulation of the MPC problem can be used to ﬁnd
the optimal trade-oﬀ, minimizing the ﬂop count for the MPC solver. Namely,
the horizon length can be reduced at the expense of a larger input vector size
(partial condensing), or conversely the input vector size can be reduced at the
expense of a larger horizon length (sequential update). If roughly nx > nu, it
is found that a decrease of the horizon length leads to a reduction in the ﬂop
count, while if roughly nx < nu, an increase of the horizon length leads to a
reduction in the ﬂop count.
In this chapter, it is considered only the partial condensing case, i.e. the de-
crease of the horizon length at the expense of augmenting the input vector size,
even if some of the results about the optimal horizon length selection cover the
sequential update case as well. Moreover, the inﬂuence of linear algebra rou-
tines implementation on the performance of partial condensing is investigated.
Therefore, partial condensing is considered as a technique to reduce the solution
time of unconstrained linear MPC (and MHE) problems, disregarding the eﬀect
on inequality constraints.
The idea of partial condensing can be introduced through an example. Let
us consider a MPC problem with horizon length N = 6, and let us deﬁne the
256 Partial condensing
partition of the state and input vectors such that the states and inputs of Nc = 3
consecutive stages are grouped together into blocks (where a block is deﬁned as
a group of consecutive stages)
x¯ =

x0
x1
x2
x3
x4
x5
x6

=
x¯0x¯3
x6
 , u¯ =

u0
u1
u2
u3
u4
u5
 =
[
u¯0
u¯3
]
.
These can be seen as the state and input vectors of a MPC problem with horizon
length Np = N/Nc = 2.
The cost function matrices and vectors can be partitioned accordingly, as
Q¯ =

Q0
Q1
Q2
Q3
Q4
Q5
Q6

=
Q¯0 0 00 Q¯3 0
0 0 Q6
 , q¯ =

q0
q1
q2
q3
q4
q5
q6

=
q¯0q¯3
q6
 ,
S¯ =

S0
S1
S2
S3
S4
S5
 =
[
S¯0 0 0
0 S¯3 0
]
,
R¯ =

R0
R1
R2
R3
R4
R5
 =
[
R¯0 0
0 R¯3
]
, r¯ =

r0
r1
r2
r3
r4
r5
 =
[
r¯0
r¯3
]
.
257
Therefore the cost function expression can be rewritten as
φ =
(
N−1∑
n=0
φn
)
+ φN =
=
(
N−1∑
n=0
1
2u
T
nRnun + u
T
nSnxn +
1
2x
T
nQnxn + r
T
nun + q
T
nxn +
1
2ρn
)
+ φN =
=
Np−1∑
i=0
φ¯{k=Nci}
+ φN =
=
Np−1∑
i=0
1
2 u¯
T
k R¯ku¯k + u¯
T
k S¯kx¯k +
1
2 x¯
T
k Q¯kx¯k + r¯
T
k u¯k + q¯
T
k x¯k +
1
2 ρ¯k
+ φN
where k = Nci and
φN =
(
1
2x
T
NQNxN + q
T
NxN +
1
2ρN
)
.
The idea of partial condensing is to remove all but the ﬁrst state component in
each block. This can be done by means of condensing. More precisely, since the
ﬁrst state component is retained as an optimization variable, the condensing
within each block is analogue to the condensing for the MHE problem.
Namely, given an horizon within each block (or block size) of Nc = 3, the state
vector of each block x¯k can be computed as function of the initial state xk and
the inputs u¯k of the block, as
x¯k = Γx,kxk + Γu,ku¯k + Γb,k (10.1)
where
Γx,k =
 IAk
Ak+1Ak
 , Γu,k =
 0 0 0Bk 0 0
Ak+1Bk Bk+1 0
 , Γb,k =
 0bk
Ak+1bk + bk+1
 .
The initial state of the following block xk+Nc = xk+3 can be computed as a
function of the initial state xk and the inputs u¯k of the current block as
xk+3 = A¯kxk + B¯ku¯k + b¯k. (10.2)
where
A¯k = Ak+2Ak+1Ak
B¯k =
[
Ak+2Ak+1Bk Ak+2Bk+1 Bk+2
]
b¯k = Ak+2(Ak+1bk + bk+1) + bk+2.
258 Partial condensing
This equation is the state-space equation of a MPC problem with horizon Np =
2, that has the same state vector size nx and larger input vector size Ncnu.
Combining the expressions (10.1) and (10.2) gives[
x¯k
xk+3
]
= Γxxk + Γuu¯k + Γb
where
Γx =

I
Ak
Ak+1Ak
Ak+2Ak+1Ak
 , Γu =

0 0 0
Bk 0 0
Ak+1Bk Bk+1 0
Ak+2Ak+1Bk Ak+2Bk+1 Bk+2
 ,
Γb =

0
bk
Ak+1bk + bk+1
Ak+2(Ak+1bk + bk+1) + bk+2
 .
These expressions clearly resemble the expressions for Γv =
[
Γx Γu
]
and Γb in
(9.25).
By inserting (10.1) in the cost function, it is possible to rewrite it as a function
of the initial state xk and of the inputs u¯k of each block. Namely, the cost
function at each block becomes
φ¯k =
=
1
2
[
xk
u¯k
]T [
ΓTx,kQ¯kΓx,k Γ
T
x,kQ¯kΓu,k + Γ
T
x,kS¯
T
k
ΓTu,kQ¯kΓx,k + S¯kΓx,k Γ
T
u,kQ¯kΓu,k + Γ
T
u,kS¯
T
k + S¯kΓu,k + R¯k
] [
xk
u¯k
]
+
+
[
ΓTx,kQ¯kΓb,k + Γ
T
x,kq¯k
ΓTu,kQ¯kΓb,k + S¯kΓb,k + Γ
T
u,kq¯k + r¯k
]T [
xk
u¯k
]
+
1
2
(
ΓTb,kQ¯kΓb,k + 2q¯
T
k Γb,k + ρ¯k
)
=
=
1
2
[
xk
u¯k
]T [
HQ,k H
T
S,k
HS,k HR,k
] [
xk
u¯k
]
+
[
gq,k
gr,k
]T [
xk
u¯k
]
+ ρρ,k
The expressions for HQ,k, HS,k, HR,k, gq,k, gr,k are formally identical to the
expressions in (9.28), provided that Γx,k, Γu,k, Γb,k are employed in place of Γx,
Γu, Γb. In practice, the diﬀerence is that Qk+Nc and qk+Nc are considered in
the computation of the cost function at the following block, and they should not
be considered twice. Therefore, the expressions for HQ,k, HS,k, HR,k, gq,k, gr,k
can be obtained by employing the condensing algorithms for MHE presented
in Section 9.2.1 with QN = 0 and qN = 0, and possibly by exploiting this to
reduce the computational cost.
10.1 Partial condensing algorithms 259
10.1 Partial condensing algorithms
The condensing algorithms for MHE presented in Section 9.2.1 can be employed
for condensing within each block, provided that the terminal cost is set to zero,
i.e. Qk+Nc = 0 and qk+Nc = 0. This can be exploited to reduce the computation
cost of the condensing algorithms.
About the choice between the three Hessian condensing algorithms presented in
Section 9.2.1, the safest choice is Algorithm 22, that is found to give the best
performance in case of free initial state. Furthermore, this condensing algorithm
allows to merge the computation of the condensed Hessian matrix and condensed
gradient vector in a single routine. A version of this algorithm tailored to the
case of terminal cost equal to zero is presented in Algorithm 29.
Algorithm 29 Computation of HQ, HS , HR, gq, gr, partial condensing case
Require:
Γx, Γu, Γb
1:
DN−1MTN−1 PN−1
mTN−1 p
T
N−1 ∗
←
RN−1STN−1 QN−1
rTN−1 q
T
N−1 ∗

2: HR[N − 1, N − 1]← DN−1
3: HR[N − 1, 0 : N − 2]←MN−1 · Γu[N − 1, 0 : N − 2]
4: HS [N − 1]←MN−1 · Γx[N − 1]
5: gr[N − 1]← mN−1 +MN−1 · Γb[N − 1]
6: for i← N − 2, . . . , 0 do
7:
[L00
L10 ∗
]
←
[
Pi+1
pTi+1 ∗
]1/2
8: ATL ←
BTiATi
bTi
 · L00 +

L10

9:
 DiMTi Pi
mTi p
T
i ∗
←
RiSTi Qi
rTi q
T
i ∗
+ (ATL) · (ATL)T
10: HR[i, i]← Di
11: HR[i, 0 : i− 1]←Mi · Γu[i, 0 : i− 1]
12: HS [i]←Mi · Γx[i]
13: gr[i]← mi +Mi · Γb[i]
14: end for
15: HQ ← P0
16: gq ← p0
260 Partial condensing
If partial condensing is employed to condense Nc stages, the computational cost
of the algorithm is of about N2c nxn
2
u+
7
3Ncn
3
x ﬂops, plus N
2
c n
2
xnu+ 2Ncn
3
x ﬂops
to compute Γx, Γu and Γb.
Let Np denote the horizon length of the partially condensed MPC (or MHE)
problem. If Np is an exact divisor of N , then each stage of the new MPC
problem (in the following, partially condensed MPC problem) is obtained by
condensing Nc = N/Np stages of the original MPC problem. In this case, all
stages of the partially condensed MPC problem have an input vector with equal
sizeNcnu, where nu is the input vector size of the original MPC problem. On the
contrary, if Np is not an exact divisor of N , then diﬀerent stages of the partially
condensed MPC problem can have diﬀerent input vector size. Therefore MPC
solvers supporting stage-variant nu and nx are needed to solve the partially
condensed MPC problem.
Assuming that Nc is an integer divisor of N and that N = Nc · Np, the par-
tially condensed MPC (or MHE) problem has Np stages. If the MPC (or MHE)
problem is time-invariant, then the cost function matrices of the partially con-
densed problem are time-invariant as well, and therefore the partial condensing
algorithm needs to be executed only once. If the MPC (or MHE) problem is
time-variant, then the partial condensing algorithm needs to be executed Np
times, once per block.
10.2 Choice of Np
This section presents a method to ﬁnd the theoretically best choice for the trade-
oﬀ between horizon length and input vector size. The analysis assumes that the
partially condensed MPC problem is solved by means of the backward Riccati
recursion presented in Section 8.1. Only the MPC case is considered, the MHE
case being analogue.
The computational cost of the backward Riccati recursion is of
(
1
3n
3
x
)
+(N−1) ( 73n3x + 4n2xnu + 2nxn2u + 13n3u)+(n2xnu + nxn2u + 13n3u) (10.3)
ﬂops, where terms due to the last, middle and ﬁrst stages are grouped together.
The following analysis requires some assumption.
10.2 Choice of Np 261
Assumption 1 N is large, such that the computational cost can be approxi-
mated as
N
(
7
3n
3
x + 4n
2
xnu + 2nxn
2
u +
1
3n
3
u
)
= n3xN
(
1
3m3 +
2
m2 +
4
m +
7
3
)
where
m =
nx
nu
is the ratio between the number of states and the number of inputs.
Assumption 2 the horizon length and the input vector size can be traded
oﬀ continuously. This means that the horizon length in the partially condensed
problem can be chosen as Np =
N
α for some α ∈ R, α > 0, and consequently
the input vector size is αnu. Therefore, the computational cost to solve the
partially condensed problem using the backward Riccati recursion is of
n3x
N
α
(
1
3m3α
3 + 2m2α
2 + 4mα+
7
3
)
ﬂops, and it is a function of α. Notice that α > 1 corresponds to a reduction
of the horizon length (the case covered here), while α < 1 corresponds to an
increase of the horizon length (the case not covered here). The minimizer of the
function can be found by setting its derivative to zero. For α > 0, the minimum
of this function is attained for
α = 0.94224m ⇒ Np ≈ N
0.94224m
≈ N
m
and it is a function of the ratio m only. In particular, for nx ≈ nu, there is no
advantage in reducing or augmenting the horizon length.
The approximate maximum ﬂops reduction factor in solving the MPC problem
is
f(m) =
n3x
N
α
(
1
3m3α
3 + 2m2α
2 + 4mα+
7
3
)
n3xN
(
1
3m3 +
2
m2 +
4
m +
7
3
) = 8.6568m2( 7
3m
3 + 4m2 + 2m+ 13
)
and it is a function of m only.
The function f(m) is plotted for 0.01 < m < 100 in Figure 10.1. Since α ≈ m,
the maximum ﬂop reduction is obtained by reducing the horizon length for
m < 1 and by increasing it form > 1. Furthermore, the maximum ﬂop reduction
is rather small for nu ≈ nx, and it gets signiﬁcant only if nx and nu are of rather
diﬀerent size.
In practice, the best value of Np is aﬀected by the value of N , since a small value
of N limits the number of possible choices for Np. Furthermore, the diﬀerent
262 Partial condensing
0
0.2
0.4
0.6
0.8
1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
flo
p 
re
du
ct
io
n 
fa
ct
or
log10(m) = log10(nx / nu)
maximum flop reduction
Figure 10.1: Maximum ﬂop reduction factor in choosing the optimal N vs nu
trade-oﬀ.
performance of the linear algebra routines for diﬀerent matrix sizes aﬀects the
practical reduction in computational time, as shown in the next section.
10.3 Inﬂuence of linear algebra routines perfor-
mance
This section investigates the inﬂuence of linear algebra routines performance on
the optimal choice for Np.
The performance of linear algebra routines generally depends on implementation
choices and matrix sizes, as widely shown in Part I of this thesis. In particular,
for linear algebra routines in HPMPC, the performance increases quickly for ma-
trix size up to about 20-50, and then it stabilizes close to the peak performance
for matrices of size up to about 300-400, before slightly decreasing due to lack
of blocking for cache. The exact intervals depend on the ISA and on the speciﬁc
linear algebra routine. Routines generally can attain only a small fraction of
the peak throughput for very small matrices, limiting the solver performance in
case of small-scale MPC problems. Partial condensing is therefore expected to
further improve performance beyond the ﬂop reduction, due to the use of larger
matrices.
Figure 10.2 investigates this through an example. The test processor is the Intel
Core i7-4800MQ often considered in this thesis, with compiler gcc. Two MPC
problems are considered, one small with nx = 4 and nu = 1, and one large with
nx = 40 and nu = 10. Both problems have the same horizon length N = 20,
10.3 Inﬂuence of linear algebra routines performance 263
and the same ratio m = nx/nu = 0.25. For these values, the analysis in Section
10.2 gives an optimal (real) value of Np equal to 5.3, and a ﬂop reduction factor
of 0.625. These values are valid for both the small-scale and the large-scale
problems.
These ﬁndings are ﬁrstly compared with the reduction in the ﬂop count for the
MPC problems for diﬀerent values ofNp when Assumptions 1 and 2 are dropped.
In case Np is not an integer divisor of N , the input size can be diﬀerent at
diﬀerent stages of the partially condensed MPC problem: the backward Riccati
recursion implementation is designed to handle values of nx and nu varying
stage-wise. It is clear that ﬁgures 10.2a and 10.2b are identical, beside the fact
that the y axes is scaled exactly by 1000 = 103 (the computational cost in (10.3)
is cubic in nx and nu, and the values of the two problems are scaled by a factor
10). The optimal value of Np is found to be 4, for a reduction factor in the
ﬂop count of 0.559. Starting from an Np value equal N = 20 on the right on
the plots, the ﬂop count decreases steadily as Np decreases until it reaches a
minimum. For smaller values of Np, the ﬂop count quickly increases, and for
Np = 1 it is larger than the one of the original MPC problem with N = 20.
The second row of ﬁgures represents the running times, when the linear algebra
routines are provided by the C99 target ISA in HPMPC. The size of the gemm
kernel is 4 × 4. For the large scale problem, the plot in Figure 10.2d closely
resembles the ﬁgures from the ﬂop count. However, for the small scale problem,
the plot in Figure 10.2c is rather diﬀerent: the best value for Np is 2, and for
Np = 1 the solution time is much lower than the original MPC problem. This
is due to the fact that for such a small problem size, there is a big advantage
in replacing many operations on very small matrices (smaller than the dgemm
kernel size) with fewer operations with larger matrices, where linear algebra
routines can attain a better performance.
This trend gets even more clear in the last row of ﬁgures, where the linear
algebra routines are provided by the AVX2 target ISA in HPMPC. The optimal
kernel size is 12 × 4, and therefore larger matrices are needed to obtain high
performance. For the large scale problem, the plot in Figure 10.2f somehow
resembles the ﬁgures from the ﬂop count, but the optimal Np value is 3, and for
Np = 1 the solution time is still lower than the original MPC problem (despite
the larger number of ﬂops). For the small scale problem, the plot in Figure
10.2e shows a rather steady decrease in the solution time as Np decreases, with
the optimal value attained for Np = 1 (and equal to a reduction in solution
time of 0.234, despite an increase in the ﬂop count with respect to the original
problem).
264 Partial condensing
0
1
2
3
4
5
0 5 10 15 20
⋅
 
10
3  
flo
ps
Np
N = 20, nx = 4, nu = 1
(a) Small-scale, ﬂops.
0
1
2
3
4
5
0 5 10 15 20
⋅
 
10
6  
flo
ps
Np
N = 20, nx = 40, nu = 10
(b) Large-scale, ﬂops.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 5 10 15 20
tim
e 
[µ 
s]
Np
nx = 4, nu = 1, N = 20
(c) Small-scale, time, C99 ISA.
0
100
200
300
400
500
600
700
800
0 5 10 15 20
tim
e 
[µ 
s]
Np
nx = 40, nu = 10, N = 20
(d) Large-scale, time, C99 ISA.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 5 10 15 20
tim
e 
[µ 
s]
Np
nx = 4, nu = 1, N = 20
(e) Small-scale, time, AVX2 ISA.
0
100
200
300
400
500
600
700
800
0 5 10 15 20
tim
e 
[µ 
s]
Np
nx = 40, nu = 10, N = 20
(f) Large-scale, time, AVX2 ISA.
Figure 10.2: Flop count / solution time for the reformulations of two MPC
problems (small scale one on the left, large scale one on the right),
obtained using partial condensing for diﬀerent values ofNp (Np =
20 is fully sparse, Np = 1 is fully condensed). The ﬂop/time value
of the original MPC problem is represented using the red line.
10.4 Conclusion 265
10.4 Conclusion
Partial condensing is a technique that gives a reformulation of the MPC problem,
decreasing the horizon length at the expense of an increase in the input vector
size. It has been proposed in [18], together with sequential update, consisting in
decreasing the input vector size at the expense of increasing the horizon length.
In this chapter, only partial condensing has been considered. By choosing as a
solver for the unconstrained MPC problem the backward Riccati recursion in
Section 8.1, a simpliﬁed analysis shows that the reduction in the ﬂop count is
a function only of the ratio m = nx/nu, and gives a simple formula to compute
the theoretical optimal horizon length Np and the corresponding ﬂop reduction
factor. Numerical tests conﬁrm the validity of the analysis, in the case of linear
algebra routines giving steady performance for all matrix sizes of interest.
In practice, in case of small matrices the performance of linear algebra routines
increases with the matrix size, and therefore for small-scale MPC problems the
optimal value of Np is smaller than the one minimizing the ﬂop count, since the
use of larger matrices boosts performance. In this sense, partial condensing is an
extremely interesting technique, allowing to better achieve high FP throughput
and obtain high performance also in case of small-scale MPC problems.
266 Partial condensing
Chapter 11
Unconstrained MPC
problems with
time-invariant matrices
This chapter presents tailored algorithms for the solution of unconstrained MPC
problems where all matrices of cost function and state-space system are time-
invariant, while all vectors are time variant, in the special case where the arrival
cost is given by the solution of the Discrete Algebraic Riccati Equation (DARE).
This problem formulation is a special instance of (7.1), and therefore it can be
solved using e.g. the backward Riccati recursion presented in Section 8.1 in time
O(N(nx + nu)3), or one of the condensing methods presented in Section 9.1.
Generally speaking, since all matrices are time-invariant, the KKT matrix can
be factorized oﬀ-line. However, the fact that the Hessian of the arrival cost is
the solution of the DARE can be exploited to factorize the KKT matrix in time
constant in N . This reduces the solution time to O(N(nx+nu)2). Furthermore,
the memory usage of the solution algorithms can be greatly reduced, further
improving performance of the level 2 BLAS linear algebra routines employed in
the solution algorithm.
Formally, the results of this chapter could be extended to the MHE problem.
However, it does not make sense to consider a MHE problem with Q0 = Q
268 Unconstrained MPC problems with time-invariant matrices
and QN = P∞. Therefore, the MHE problem is not further mentioned in this
chapter.
11.1 Problem formulation
The unconstrained MPC problem with Time-Invariant Matrices (MPC-TIM) is
deﬁned as
min
u,x
N−1∑
n=0
1
2
[
un
xn
]T [
R S
ST Q
] [
un
xn
]
+
[
rn
qn
]T [
un
xn
]
+
1
2
xTNP∞xN + q
T
NxN
(11.1a)
s.t. xn+1 = Axn +Bun + bn (11.1b)
x0 = xˆ0 (11.1c)
where all matrices in the dynamical system and the cost function are time-
invariant, while the vectors can be time-variant. The optimization is performed
over the ﬁnite horizon N , and
[
R S
ST Q
]
≥ 0, R > 0 and P ≥ 0. The Hessian of
the arrival cost is initialized to the solution P∞ of the DARE,
P∞ = Q+ATP∞A− (ST +ATP∞B)(R+BTP∞B)−1(S +BTP∞A). (11.2)
11.2 Motivation
Problems in the form of the MPC-TIM (11.1) are encountered in a number
of situations. The choice P = P∞ is either required from or compatible with
all these cases, and it is the key to improve the performance of the solution
methods.
11.2.1 Linear time-invariant control problems
Problems in the form (11.1) arise in the (unconstrained) predictive control of
linear systems [52]. In this class of problems, the state-space and the cost
function matrices are time-invariant. However, the vectors can be time-variant.
The terms qn and rn can e.g. be employed to predictively handle changes in the
set point. The term bn can be employed to predictively handle changes in known
disturbances, as bn = Edn. Alternatively, the term bn arises in the control of
state space systems in innovation form [53], as b0 = Ken.
11.2 Motivation 269
11.2.2 Sub-problem in splitting methods for linear MPC
The linear time-invariant (ﬁnite horizon) MPC problem
min
u,x
N−1∑
n=0
1
2x
T
nQxn +
1
2u
T
nRun +
1
2x
T
NP∞xN (11.3a)
s.t. xn+1 = Axn +Bun (11.3b)
x0 = xˆ0 (11.3c)
Cxxn ≤ cx (11.3d)
Cuxn ≤ cu (11.3e)
can be solved by applying slitting methods such as the Alternating Minimization
Algorithm (AMA) to the dual of the problem [82]. At each iteration of the
splitting method, the input un and state xn sequences are obtained by solving
a problem in the form
min
u,x
N−1∑
n=0
1
2u
T
nRun +
1
2x
T
nQxn −
[
Cxxn − cx
Cuun − cu
]T
λn +
1
2
xTNP∞xN (11.4a)
s.t. xn+1 = Axn +Bun (11.4b)
x0 = xˆ0 (11.4c)
(i.e. in the form of the MPC-TIM (11.1)) where λn are not optimization
variables in (11.4). The advantage of the use of AMA over other splitting
methods such as the Alternating Direction Method of Multipliers (ADMM) is
that the Hessian of the cost function in (11.4) is the same as the original MPC
problem, and therefore P is still initialized to the solution of the DARE.
270 Unconstrained MPC problems with time-invariant matrices
11.2.3 Sub-problem in splitting methods for constrained
LQR
In the papers [80, 81] authors propose to solve the (inﬁnite horizon) Constrained
Linear-Quadratic Regulator (CLQR) problem
min
u,x
∞∑
n=0
1
2x
T
nQxn +
1
2u
T
nRun (11.5a)
s.t. xn+1 = Axn +Bun (11.5b)
x0 = xˆ0 (11.5c)
Cxxn ≤ cx (11.5d)
Cuxn ≤ cu (11.5e)
by applying splitting methods such as Alternating Minimization Algorithm
(AMA) and Forward-Backward Splitting (FBS) to the dual of the problem. Un-
der assumptions clearly stated on the papers, the proposed algorithmic scheme
requires a ﬁnite amount of computations: if the un and xn sequences last hit a
constraints at stage T k, the sequence of the Lagrangian multipliers associated
with the inequality constraints gets zero for all stages n > T k.
For n > T k the problem reduces to the unconstrained LQR [56], and the optimal
input sequence can be computed as the time-invariant state feedback
un = K∞xn
where in this case the LQ-gain is
K∞ = −(R+BTP∞B)−1(BTP∞A)
and P∞ is the solution of the Discrete Algebraic Riccati Equation (DARE)
associated with the matrices A,B,Q,R:
P∞ = Q+ATP∞A− (BTP∞A)T (R+BTP∞B)−1(BTP∞A).
For n ≤ T k, the input un and state xn sequences are obtained by solving a
problem in the form
min
u,x
N−1∑
n=0
1
2u
T
nRun +
1
2x
T
nQxn −
[
Cxxn − cx
Cuun − cu
]T
λn +
1
2
xTNP∞xN (11.6a)
s.t. xn+1 = Axn +Bun (11.6b)
x0 = xˆ0 (11.6c)
11.3 Sparse formulation 271
(that is formally identical to (11.4)) at each iteration of the splitting method,
where λn are not optimization variables in (11.6). The key diﬀerence with
respect to the MPC case (i.e. the case with ﬁnite horizon) is that the horizon
length N = T k can increase with the iteration count k of the splitting method.
If the assumption P = P∞ is dropped, in general problems in the form (11.1)
if the horizon length N increases the factorization of the sparse KKT system
or of the condensed Hessian needs to be updated, or computed oﬀ-line for a
suﬃciently long horizon N . However, the fact that P = P∞ in (11.6) can be
exploited to completely avoid the factorization of the sparse KKT matrix or of
the condensed Hessian.
11.3 Sparse formulation
If problem (11.1) is considered in the sparse formulation where both states x
and inputs u are retained as optimization variables, then the natural choice is
its solution using the backward Riccati recursions. In fact, the backward Riccati
recursion can be seen as a stage-wise factorization of the KKT matrix [70].
The key idea is that the backward Riccati recursion
Pn = Q+A
TPn+1A− (S +BTPn+1A)T (R+BTPn+1B)−1(S +BTPn+1A)
reduces to the DARE (11.2) if P = P∞. This means that all blocks of the
stage-wise factorization of the KKT matrix are constant over the stages, and
therefore can be computed oﬀ-line. As a further advantage, only the matrices of
one stage need to be stored, making the memory requirement for the factorized
KKT matrix constant with respect to the horizon length N . This fact can be
exploited to design algorithms to solve the problem (11.1) without the need to
explicitly build or factorize the sparse KKT matrix (or condensed Hessian, in
the next section).
Since the linear terms in the dynamic equations and cost function are time-
variant, a backward vector recursion has to be performed on-line. The optimal
value of un is therefore computed as the aﬃne state feedback
un = K∞xn + kn
where K∞ is the LQ-gain, computed oﬀ-line using the P∞ matrix as
K∞ = −(R+BTP∞B)−1(S +BTP∞A), (11.7)
while kn is computed on-line as
kn = −(R+BTP∞B)−1(rn +BT (P∞bn + pn+1)),
272 Unconstrained MPC problems with time-invariant matrices
and pn is computed by means of the backward recursion
pn = qn +A
T pn+1 +K
T
∞(rn +B
T (P∞bn + pn+1)).
The algorithm is presented in Algorithm 30. The computational cost of the
algorithm is of
N(6n2x + 8nxnu + 2n
2
u)
ﬂops, that can be reduced by 2n2x ﬂops if bn is time invariant and the product
P∞ · b is computed oﬀ-line. Beside the storing of the un and xn sequences, the
working memory requirements for the algorithm are only the Nnu elements of
the kn sequence.
Algorithm 30 Riccati algorithm for the MPC-TIM problem (11.1)
Require:
P∞
D−1∞ = (R+B
T · P∞ ·B)−1
K∞ = −D−1∞ · (S +BT · P∞ ·A)
1: p← pN
2: for n← N − 1, . . . , 0 do
3: p← P∞ · bn + p
4: t← rn +BT · p
5: p← qn +AT · p+KT∞ · t
6: kn ← −D−1∞ · t
7: end for
8: for n← 0, . . . , N − 1 do
9: un ← K∞ · xn + kn
10: xn+1 ← A · xn +B · un + bn
11: end for
11.4 Condensed formulation
Condensing techniques have been widely used to speed-up the solution of MPC
problem, by reducing the size of the corresponding optimization problem. Namely
they exploit the system dynamic equation (11.1b) to rewrite future stats xn as
function of the initial state xˆ0 (datum) and of the future inputs un (retained
as optimization variables). The output of the condensing process is a dense
positive-deﬁnite system of equations, that has traditionally be considered un-
structured and solved by means of Cholesky factorization and forward-backward
substitution, at a cost O(N3n3u).
11.4 Condensed formulation 273
The condensing methods presented in Section 9.1 could be employed to solve the
MPC-TIM problem (11.1). However, the special structure of the problem can be
exploited to reduce the computational cost, similarly to the sparse formulation
case.
Namely, the relation between the backward Riccati recursion and the factoriza-
tion of the condensed Hessian matrix in Algorithm 16 suggests that also in the
condensed case the factorization of the entire Hessian matrix can be reduced to
the factorization of a single stage.
When the condensed Hessian factorization algorithm 16 is combined with the
solution algorithm 17, it is possible to avoid altogether the build of the condensed
Hessian matrix or its Cholesky factor.
Therefore, Algorithm 17 can be tailored to the solution of systems of linear
equations in the form
Hu¯ = g¯ (11.8)
where H is the condensed Hessian matrix associated with problem (11.1)
u¯ =

u0
u1
...
uN−1
 , g¯ =

g0
g1
...
gN−1
 .
The fact that in this special case the backward Riccati recursion reduces to
the DARE (11.2) implies that there is no need to explicitly build or factorize
the Hessian matrix H. In fact, all quantities needed in the backward-forward
recursions in Algorithm 17 can be computed oﬀ-line using the P∞ matrix.
The resulting algorithm is presented in Algorithm 31. Line 1 means that the
computations can be carried-on in place, overwriting the g¯ vector with the so-
lution vector u¯. The computational cost of the algorithm is of
N(4n2x + 8nxnu + 2n
2
u)
ﬂops. Beside the storing of the un and xn sequences and the problem data, the
algorithm requires a constant amount of working memory (with respect to the
horizon length N), equal to the nx elements of the working vector t.
274 Unconstrained MPC problems with time-invariant matrices
Algorithm 31 Condensing algorithm for the MPC-TIM (11.1)
Require:
M∞ = S +BT · P∞ ·A
D−1∞ = (R+B
T · P∞ ·B)−1
K∞ = −D−1∞ ·M∞
1: u¯← g¯
2: uN−1 ← −D−1∞ · uN−1
3: t←MT∞ · uN−1
4: uN−2 ← −D−1∞ · (uN−2 +BT · t)
5: for n← N − 3, . . . , 0 do
6: t← AT · t+M∞ · un+1
7: un ← −D−1∞ · (un +BT · t)
8: end for
9: t← B · u0
10: u1 ← u1 +K∞ · t
11: for n← 2, . . . , N − 1 do
12: t← A · t+B · un−1 + bn
13: un ← un +K∞ · t
14: end for
11.5 Implementation aspects
From an implementation point of view, both Algorithms 30 and 31 are very
eﬃcient. Both algorithms are implemented using the gemv (general matrix-
vector multiplication) and dsymv (symmetric matrix-vector multiplication) rou-
tines (that are part of level 2 BLAS).
Generally speaking, as shown in the ﬁrst part of this thesis, the performance
of level 2 BLAS routines heavily depends on the memory footprint of the data,
due to the lack of reuse of matrices elements (or limited reuse, in case of dsymv).
The performance gets particularly poor if the memory footprint of the control
problem data is large enough to exceed cache size, as data needs to be streamed
from main memory.
Since in the special case analyzed in the current chapter all matrices in the fac-
torized KKT matrix are time invariant, the amount of memory needed to store
them is constant (i.e. does not increase with N), since the same matrices are
reused at all stages. This greatly reduces the memory footprint, meaning that
the gemv routine needed in the implementation can attain its best performance.
11.6 Conclusion 275
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
nx (with nu=nx/2)
N = 30
standard ric
tailored ric
(a) nx.
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
N
nx = 40, nu = 20
standard ric
tailored ric
(b) N .
Figure 11.1: Performance plot for Algorithm 17 (blue) and Algorithm 30 (red)
in the solution of problem (11.1) once the KKT matrix has been
factorized. Test performed on an Intel Core i7 4800MQ processor,
gcc compiler, AVX2 target ISA in HPMPC.
Figure 11.1 compares the computational performance (in Gﬂops) of the standard
(Algorithm 17) and the tailored (Algorithm 30) Riccati solvers for the solution
of the KKT system once it has been factorized. Therefore, this shows the
performance advantages given by the reduced memory footprint, beside the
reduction in the factorization time. Figure 11.1a shows that the performance
of the tailored version is higher for large nx and nu, since the matrices of only
one stages are needed, thus they can grow larger and still ﬁt in cache. Figure
11.1b shows that the performance of the tailored version does not decrease as N
increases, since the same matrices are reused at all stages (making their memory
footprint constant with respect to N). This makes this algorithm particularly
favorable in case of long horizons.
The results are analogue for the tailored condensing Algorithm 31, and therefore
they are not repeated.
11.6 Conclusion
This chapter describes solution methods tailored for the MPC-TIM problem
(11.1), that is a special instance of the unconstrained MPC problem (7.1) where
all matrices are time-invariant, the arrival cost matrix is initialized to the solu-
tion of the DARE, while all vectors can in general be time-variant.
276 Unconstrained MPC problems with time-invariant matrices
The choice of P implies that the KKT matrix can be factorized in time constant
in N , since the the backward Riccati recursion reduces to the DARE, and there-
fore if the Riccati recursion is used to factorize the KKT matrix all matrices
of the factorized KKT matrix are constant stage-wise. The optimal input is
computed as an aﬃne state feedback where the time-invariant feedback matrix
is the LQ-gain, while the feedback vector is time-varying. Since problems in
the form (11.1) arise as sub-problems in splitting methods for the CLQR, this
implies that the optimal input sequence of the CLQR can be obtained as an
aﬃne state feedback where the feedback matrix is still the LQ-gain.
From an implementation point of view, the fact that the matrices of the factor-
ized KKT matrix are constant stage-wise greatly reduces the required amount
of memory, since only the matrices of one stage are needed. As a further advan-
tage, this means that the problem data can ﬁt in cache for much larger values
of N (since only vectors require a diﬀerent instance per each stage), allowing
the level 2 BLAS routines to perform well.
As a ﬁnal note, there is no need to update the KKT matrix factor as the hori-
zon length increases in splitting methods for the CLQR, making the developed
algorithms a perfect ﬁt for solving the sub-problems in these methods.
Part III
Algorithms for Constrained
and Non-Linear MPC

Chapter 12
Constrained MPC problem
formulations
Part III deals with the solution of constrained and non-linear MPC and MHE
problems. This last part is much shorter than the ﬁrst two, and it does not
contain novel optimization algorithms for optimal control. The algorithms pre-
sented in this part are only meant to be examples of the use of the eﬃcient
routines developed in the previous parts of the thesis. The fact that the solvers
presented in this section can outperform state-of-the-art solvers for MPC is ul-
timately due to the superior performance of the routines for linear algebra and
unconstrained MPC and MHE problems.
As a diﬀerence with respect to part II, in part III only algorithms for MPC
are explicitly considered. This choice is justiﬁed by the fact that algorithms for
MHE are essentially identical, since the speciﬁc features of MHE are exploited
at the level of the unconstrained sub-problems.
280 Constrained MPC problem formulations
12.1 Linear MPC problem
The linear MPC problem with soft box constraints and hard general constraints
is the quadratic program
min
u,x,s
N−1∑
n=0
1
2
unxn
1
T Rn Sn rnSTn QTn qn
rTn q
T
n ρn
unxn
1
+ 1
2
[
xN
1
]T [
QN qN
qTN ρN
] [
xN
1
]
+
(12.1a)
+
N∑
n=0
1
2
[
sn
1
]T [
Zn zn
zTn 0
] [
sn
1
]
(12.1b)
s.t. xn+1 = Anxn +Bnun + bn (12.1c)
x0 = xˆ0 (12.1d)
uln,i ≤ un,i ≤ uun,i (12.1e)
xln,i ≤ xn,i ≤ xun,i (12.1f)
xln,i ≤ xn,i + sln,i (12.1g)
xn,i − sun,i ≤ xun,i (12.1h)
sln,i , s
u
n,i ≥ 0 (12.1i)
dl0,i ≤ D0,iu0 ≤ du0,i (12.1j)
dln,i ≤ Cn,ixn +Dn,iun ≤ dun,i (12.1k)
dlN,i ≤ CN,ixN ≤ duN,i (12.1l)
where the exponents l and u indicate the quantities associated with the lower
and upper constraints, and i is the index of the constraint. The vector sn is
the vector of the slack variables associated with the soft constraints, and it
contains the slack variables associated with the lower soft constraint (sln), and
the slack variables associated with the upper soft constraint (sun). There are
both a quadratic and a linear penalty term on the slack variables of the soft
constraints.
In this deﬁnition, only the box constraints on states are softened: this keeps the
notation somehow lighter, the extension to soft general polytopic constraints
being straightforward.
Notice that, as a diﬀerence with respect to the unconstrained MPC problem
in 7.1 there are no additional equality constraints at the last stage. In case of
need, equality constraints can be modeled as box constraints with equal lower
and upper bound, and processed directly as inequality constraints.
12.1 Linear MPC problem 281
As an example, the IPM implemented in the HPMPC library is an infeasible-
start Mehrotra's predictor-corrector variant, that can handle this case of prob-
lems not strictly feasible with respect to the inequality constraints.
12.1.1 Matrix formulation
In order to simplify the notation let us assume that all inputs are hard bounded,
and all states are soft bounded.
The linear MPC problem can be reformulated in the matrix form
min
y
1
2
y¯T H¯y¯ + g¯T y¯ + γ
s.t. A¯y¯ = b¯
C¯y¯ ≥ d¯
(12.2)
where
min
s,u,x
1
2
s¯u¯
x¯
T Z¯ R¯ S¯
S¯T Q¯
s¯u¯
x¯
+
z¯r¯
q¯
T s¯u¯
x¯

s.t.
[
0 B¯ A¯
] s¯u¯
x¯
 = b¯
 D¯bI2Nx C¯b
D¯g C¯g
s¯u¯
x¯
 ≥
d¯b,ud¯b,x
d¯g

where in turn (for N = 3) u¯, x¯, A¯, B¯, b¯, Q¯, R¯, S¯, q¯, r¯ are deﬁned in 7.3, I2Nx
282 Constrained MPC problem formulations
is the identity matrix of size 2Nnx and
s¯ =

sl1
sl2
sl3
su1
su2
su3
 , Z¯ =

Zl1
Zl2
Zl3
Zu1
Zu2
Zu3
 ,
C¯b =

Ix
Ix
Ix
−Ix
−Ix
Ix
 , db,x =

xl1
xl2
xl3
xu1
xu2
xu3
 , D¯b =

Iu
Iu
Iu
−Iu
−Iu
Iu
 , db,u =

ul0
ul1
ul2
uu0
uu1
uu2
 ,
D¯g =

D0
D1
D2
−D0
−D1
−D2

, C¯g =

.
C1
C2
C3
−C1
−C2
−C3

, dg =

dl0
dl1
dl2
d3
du0
du1
du2
du3

.
(12.3)
The matrices Ix and Iu are the identity matrices of size nx and nu respectively.
The matrices of the inequality constraints are highly structured, and this struc-
ture should be exploited in the solvers.
12.1.2 Optimality conditions
The KKT optimality conditions for the constrained MPC problem can be found
e.g. in [31].
Introducing the slack variables t¯ = C¯y¯ − d¯ ≥ 0, the KKT conditions for (12.2)
are
H¯y¯ + g¯ − A¯T p¯i − C¯T λ¯ = 0 (12.4a)
A¯y¯ − b¯ = 0 (12.4b)
C¯y¯ − d¯− t¯ = 0 (12.4c)
λ¯T t¯ = 0 (12.4d)
(λ¯, t¯) ≥ 0 (12.4e)
12.1 Linear MPC problem 283
where p¯i are the Lagrangian multipliers of the equality constraints and λ¯ are the
Lagrangian multipliers of the inequality constraints. Equations (12.4a)-(12.4d)
form a system of non-linear equations f(y¯, p¯i, λ¯, t¯) = 0 due to the non-linear
equation (12.4d). The solution of this system of equations is made harder by the
constraint on the sign of Lagrangian multipliers and slack variables in (12.4e).
284 Constrained MPC problem formulations
Chapter 13
Solution of sub-problems in
linear MPC and MHE
problems
The algorithms for unconstrained MPC and MHE problems can be used as
routines in optimization algorithms for constrained and nonlinear MPC and
MHE problems.
Since the computationally most expensive part of optimization algorithms for
constrained MPC and MHE problems is typically the solution of systems of
linear equations in the form of unconstrained MPC and MHE problems, the use
of tailored algorithms for these problems can greatly reduce the solution time.
The algorithm presented in this chapter are a far-from-complete selection of op-
timization algorithms commonly employed in MPC. These algorithms employ
the backward Riccati recursion presented in Section 8.1 for the solution of the
sub-problems in the form of unconstrained MPC problems. The backward Ric-
cati recursion has been chosen over other solution methods such as the forward
Schur complement recursion or the various condensing methods since it provides
reasonably good performance over a wide range of problem sizes. Other solution
methods can perform better on special cases, such as condensing methods (if nx
is much larger than Nnu), or the forward Schur complement recursion (if there
286 Solution of sub-problems in linear MPC and MHE problems
are only equality constraints on the last stage and no inequality constraints,
since this algorithm can directly handle this case).
Furthermore, partial condensing is brieﬂy investigated as a technique to further
decrease the solution time of some optimization algorithms.
13.1 Interior-point methods
Interior-point methods (IPM) are second order optimization methods. The
methods in this class, containing also Active-Set (AS) methods [28, 29], have
been widely used in optimal control, since they use second order information to
converge to the solution in few iterations.
In the case of IPM, the number of iterations is rather unaﬀected by the prob-
lem instance: if well initialized, these methods can typically converge in 8-15
iterations. Each iteration is rather computationally heavy, requiring the factor-
ization and solution of a system of linear equations in the computation of the
search direction. The factorization makes use of level 3 BLAS and therefore
requires a cubic number of ﬂops in the input and state size. The use of IPM
to solve linear MPC problems is not new [70, 62, 26, 75], but it is somehow
in contrast with the current trend in linear MPC toward the use of ﬁrst order
methods.
In the MPC framework, Riccati-based IPM have been proposed in [70]. As a
solver for the unconstrained MPC problem, the Riccati recursion gives a reason-
ably good performance for a wide range of problem sizes [31]: in particular, the
solution time scales linearly with the horizon length. Therefore, a Riccati-based
IPM should reasonably give a good performance for a wide range of problem
sizes.
13.1.1 Basics about interior-point methods
IPM have widely been employed in the solution of quadratic programs. IPM
arises from the application of the Newton method to the solution of the system
of nonlinear equations the KKT conditions (12.4), plus the sign conditions in
(12.4e). This section summaries some basics about IPM. A detailed presentation
can be found e.g. in [92].
13.1 Interior-point methods 287
IPM methods often relax the equation (12.4d) as
λ¯T t¯ = τk (13.1)
with the parameter τk going to zero as the iteration count k increases. Diﬀerent
IPMs primarily diﬀer in the choice of τk.
Given an initial guess (y¯0, p¯i0, λ¯0, t¯0), at each iteration k primal-dual IPMs iter-
atively attempt to solve (12.4) by performing steps
(y¯k+1, p¯ik+1, λ¯k+1, t¯k+1) = (y¯k, p¯ik, λ¯k, t¯k) + α(∆y¯aﬀ,∆p¯iaﬀ,∆λ¯aﬀ,∆t¯aﬀ)
in the direction given by the Newton step, found solving the linear system
∇f(y¯k, p¯ik, λ¯k, t¯k)(∆y¯aﬀ,∆p¯iaﬀ,∆λ¯aﬀ,∆t¯aﬀ) = −f(y¯k, p¯ik, λ¯k, t¯k)
that takes the form
H¯ −A¯T −C¯T
A¯
C¯ −I
T¯k Λ¯k


∆y¯aﬀ
∆p¯iaﬀ
∆λ¯aﬀ
∆t¯aﬀ
 = −

H¯y¯k − A¯T p¯ik − C¯T λ¯k + g
A¯y¯k − b¯
C¯y¯k − t¯k − d¯
Λ¯kT¯ke− τk
 (13.2)
where Λ¯k and T¯k are the diagonal matrices with λ¯k and t¯k on the diagonal, and
e is a vector of ones. Notice that the step length α has to be chosen such that
the condition (12.4e) is strictly satisﬁed at each iteration.
The system of equations (13.2) can be rewritten in the form
H¯ −A¯T −C¯T
A¯
C¯ −I
T¯k Λ¯k


y¯aﬀ
p¯iaﬀ
λ¯aﬀ
t¯aﬀ
 = −

g¯
−b¯
−d¯
−τk
 (13.3)
and the search direction can then be computed as
(∆y¯aﬀ,∆p¯iaﬀ,∆λ¯aﬀ,∆t¯aﬀ) = (y¯aﬀ, p¯iaﬀ, λ¯aﬀ, t¯aﬀ)− (y¯k, p¯ik, λ¯k, t¯k). (13.4)
From a computational point of view, this formulation as a lower complexity,
since the computation of the right-hand-side in (13.2) requires the computation
of the residuals (that has a quadratic cost in nx and nu), while in (13.3) and
(13.4) the computation of the right hand side and the step given the aﬃne iterate
have a linear cost in nx and nu. For small-scale systems, the computation of
residuals has a cost comparable to the factorization of the KKT matrix of the
unconstrained problem.
288 Solution of sub-problems in linear MPC and MHE problems
The system in (13.3) can be rewritten in the form
[H¯+ C¯T T¯−1k Λ¯kC¯ −A¯
−A¯
] [
y¯aﬀ
p¯iaﬀ
]
= −
[
g¯ − C¯T (λ¯k + T¯−1k Λ¯kd¯+ T¯−1k τk)
b¯
]
(13.5)
that is the KKT system of an equality constrained QP program, where the cost
function is updated to take into account the constraints.
13.1.2 Interior-point methods for the linear MPC problem
The use if IPM in the solution of linear MPC problems has been discussed in
[70]. In this section, some of the basics are reported.
In the case of the linear MPC problem in (12.2), this system of equations takes
the form
Z¯ + Γ¯x,k Γ¯x,kC¯b
R¯+ D¯Tb Γ¯u,kD¯b + D¯
T
g Γ¯g,kD¯g S¯ + D¯
T
g Γ¯g,kC¯g −B¯T
C¯Tb Γ¯x,k S¯
T + CˆTg Γ¯g,kD¯g Q¯+ C¯
T
b Γ¯x,kC¯b + C¯
T
g Γg,kC¯g −A¯T
−B¯ −A¯
 ·
·

s¯aﬀ
u¯aﬀ
x¯aﬀ
p¯iaﬀ
 = −

z¯ − γ¯x,k
r¯ − D¯Tb γ¯u,k − D¯Tg γ¯g,k
q¯ − C¯Tb γ¯x,k − C¯Tg γ¯g,k
b¯

where
Γ¯u,k = (T¯u)
−1
k (Λ¯u)k
Γ¯x,k = (T¯x)
−1
k (Λ¯x)k
Γ¯g,k = (T¯g)
−1
k (Λ¯g)k
γ¯u,k = (λ¯u)k + Γ¯u,kd¯u + (T¯u)
−1
k τk
γ¯x,k = (λ¯x)k + Γ¯x,kd¯x + (T¯x)
−1
k τk
γ¯g,k = (λ¯g)k + Γ¯g,kd¯g + (T¯g)
−1
k τk.
Since the diagonal matrix Z¯ + Γ¯x,k is deﬁnite positive at each IPM iteration
k, we can use Schur complement to eliminate the s¯aﬀ variable, obtaining the
13.1 Interior-point methods 289
systemR¯+ D¯Tb Γ¯u,kD¯b + D¯Tg Γ¯g,kD¯g S¯ + D¯Tg Γ¯g,kC¯g −B¯TS¯T + C¯Tg Γ¯g,kD¯g Q¯+ C¯Tb Γ˜x,kC¯b + C¯Tg Γ¯g,kC¯g −A¯T
−B¯ −A¯
u¯aﬀx¯aﬀ
p¯iaﬀ
 =
=
r¯ − D¯Tb γ¯u,k − D¯Tg γ¯g,kq¯ − C¯Tb γ˜x,k − C¯Tg γ¯g,k
b¯
 (13.6)
where
Γ˜x,k = Γ¯x,k + Γ¯x,k(Z¯ + Γ¯x,k)
−1Γ¯x,k
γ˜x,k = γ¯x,k + Γ¯x,k(Z¯ + Γ¯x,k)
−1(z¯ − γ¯x,k)
(13.7)
The take-home idea about the use of IPMs for MPC problems is that system
(13.6) is the KKT system of a linear-quadratic control problem (7.1). Therefore
the search direction at each IPM iteration can be found by means of a solver
for problem (7.1).
Note that the constraints matrices C¯b and D¯b of the box constraints are highly
structured and very sparse: this can be exploited to compute the update to the
R¯, Q¯ matrices and the r¯, q¯ vectors at a cost O(Nnb), linear in the number box
constraints. In case of general polytopic constraints, the update to the matrices
R¯, S¯ and Q¯ can be computed at a cost O(Nng(nu + nx)2), while the update to
the vectors r¯ and q¯ can be computed at a cost O(Nng(nu + nx)).
The only extra cost due to the introduction of soft constraints is the computation
of the matrices Γ˜x,k and γ˜x,k in (13.7), requiring O(Nns) ﬂops, linear in the
number of soft constraints.
13.1.3 Interior-point methods implementation choices
The IPM considered in this thesis is the infeasible start Mehrotra's predictor-
corrector IPM [64], proposed in the MPC context e.g. in [70, 26].
The computation of the search direction in (13.6) is the key step in IPMs, typi-
cally requiring most of the computation time. In the implemented Riccati-based
IPM, the prediction direction is computed solving a system in the form (7.1)
by means of the backward Riccati recursion presented in details in Section 8.1:
the cost of this step is therefore cubic in the number of optimization variables
and linear in the horizon length. The correction direction is computed reusing
the KKT matrix factorized while computing the prediction direction: the cost
of this step is quadratic in the number of optimization variables and linear in
290 Solution of sub-problems in linear MPC and MHE problems
the horizon length. All other operations in the IPM have cost linear in both the
number of optimization variables and the horizon length.
Therefore, for large-scale problems the cost at each IPM iteration is dominated
by the cost to compute the prediction search direction. However, for the small-
scale problems typical of MPC, quadratic and linear cost terms accounts for a
considerable fraction of the IPM iteration cost. Therefore, particular care is
used in their implementation, by reducing memory movements or using vector
instructions when supported by hardware. The use of faster single-precision
division instructions can speed-up the computation of the step length α, if exact
line search is employed in place of backtracking.
Generally, IPM can converge in few iterations to a solution, if well initialized.
Warm starting in not particularly beneﬁcial, and on the contrary it can consider-
ably increases the number of iterations in case of large disturbances. Therefore,
as a general rule, cold stating is preferred in case of IPM. Even if infeasible-
start IPM can converge also if the initial guess is infeasible, a good choice of the
initial guess can reduce the number of iterations. In particular, a good initial
guess lies away from the constraints, close to the central path [92]. Furthermore,
the slack variables and the associated Lagrangian multipliers should be chosen
such that the initial value of the duality measure µ0 is not too small compared
to the largest Lagrangian multiplier at the solution. Empirically, if the MPC
problem is well scaled, a good choice for µ0 is the largest absolute value of the
elements in the cost function matrices. In particular, this is found to work well
also in the soft-constrained case, where typically the penalty on the slack vari-
ables of the soft constraints is set to very large values. With this choice, the
number of iterations in the soft-constrained case is only slightly larger than in
the box-constrained case.
As a ﬁnal note, the backward Riccati recursion routines presented in Section 8.1
can deal with input and state vector sizes varying at each stage, and the IPM
is implemented in the same fashion. This allows the use of partial condensing
as a preparation step before the IPM for all possible values of the horizon Np
of the partially condensed MPC problem.
13.1.4 Partial condensing for linear MPC problems
This section investigates the eﬀect of partial condensing in the performance of
solvers for the linear MPC problem (12.1). The presentation is limited to the
case of bounds in the original MPC problem, the extension to general polytopic
constraints being straightforward.
13.1 Interior-point methods 291
Partial condensing can be employed to reformulate an MPC problem into an-
other one with shorter horizon length and larger input vector size. This is
analogue to the case of the unconstrained MPC problem (7.1), with the diﬀer-
ence that in the case of problem (12.1) also the constraints need to be partially
condensed.
13.1.4.1 Condensing of bounds on states
The bounds on states that are condensed become general polytopic constraints
on the inputs and remaining states. Two cases can be distinguished: all states
are condensed (called in the following 'MPC case', since it is analogue to the
condensing of a MPC problem), or all states are condensed beside the ﬁrst one
(called in the following 'MHE case', since it is analogue to the condensing of a
MHE problem).
Let us consider a state constraint in the form
x¯l ≤ C¯x¯ ≤ x¯u (13.8)
where the matrix C¯ describes the constraint shape. In case of bounds on all
state components, the matrix C¯ is the identity: this case will be assumed in the
following, the extension to the general case being straightforward.
In the remaining of Section 13.1.4.1 it is assumed that the MPC and MHE
problems to be condensed have horizon length N , state vector size nx and input
vector size nu, and therefore these quantities do not refer to the size of the MPC
problem to be partially condensed.
MPC case Inserting the expression for x¯ in (9.1), equation (13.8) becomes
x¯l − Γx,b ≤ Γuu¯ ≤ x¯u − Γx,b
that can be considered as a general polytopic constraint on the inputs. The
matrix Γu has size Nnx ×Nnu, and therefore if it is processed as the matrix of
the general polytopic constraints in an IPM (i.e. D¯g = Γu), the update of the
condensed Hessian matrix HR at each iteration of the IPM
HR + D¯
T
g Γ¯g,kD¯g
(where the matrix Γg,k is diagonal) can be computed in about N
3nxn
2
u ﬂops.
The IPM could be modiﬁed to exploit the structure of the Γu matrix. In this
case, the update can be computed in 13N
3nxn
2
u ﬂops by exploiting the fact that
292 Solution of sub-problems in linear MPC and MHE problems
Γu is block lower triangular, similarly to the computation of the term Γ
T
u Q¯Γu in
Algorithm 9. Alternatively, the fact that Γu = A¯
−1B¯ can be exploited similarly
to Algorithms 10 and 11.
If the condensing algorithm is employed as a routine in partial condensing, the
bound on the last state is not condensed, and therefore the matrix Γu has size
(N − 1)nx ×Nnu. This slightly reduces the computational cost.
MHE case Inserting the expression for x¯ in (9.22), equation (13.8) becomes
x¯l − Γb ≤ Γv v¯ ≤ x¯u − Γb
that can be considered as a general polytopic constraint on the inputs and the
ﬁrst state. The matrix Γv has size Nnx × (nx + Nnu), and therefore if it is
processed as the matrix of the general polytopic constraints in an IPM (i.e.
D¯g = Γv), the update of the HR matrix at each iteration of the IPM
HR + D¯Tg Γ¯g,kD¯g
(where the matrix Γg,k is diagonal) can computed in about Nn
3
x + 2N
2n2xnu +
N3nxn
2
u ﬂops.
Also in the MHE case the IPM could be modiﬁed to exploit the structure of
the Γu matrix. In this case, the update can be computed in Nn
3
x +N
2n2xnu +
1
3N
3nxn
2
u ﬂops by exploiting the fact that Γu is block lower triangular, similarly
to the computation of the term ΓTu Q¯Γu in Algorithm 20. Alternatively, the fact
that Γu = A¯
−1B¯ can be exploited similarly to Algorithms 21 and 22.
If the condensing algorithm is employed as a routine in partial condensing, the
bound on the last state is not condensed, and therefore the matrix Γu has size
(N − 1)nx × (nx +Nnu). This slightly reduces the computational cost.
13.1.5 Comparison of solvers for linear MPC problems
This section presents the results of some test for the IPM solver part of the
HPMPC library. The predictor and corrector search directions are computed
using the Riccati solvers in Algorithm 1 and Algorithm 3, tested in Section 8.3.
The IPM solver can be called through two interfaces: a low-level interface, and
an high-level interface that is a wrapper around the low-level one (see Figure
1.1).
The low-level interface exposes all features of the IPM solver, and gives the best
performance. It requires all matrices to be stored using the panel-major matrix
13.1 Interior-point methods 293
format in Figure 3.1: this avoids the cost of converting them from row-major
or column-major orders on-line. The data matrices are passed to the solver
as double** type, and this can be exploited to reduce memory usage in case
of time-invariant systems or cost functions. All data matrices and vectors are
assumed to be properly aligned in memory to SIMD or cache line boundaries:
these, together with the panel width bs, are architecture-dependent quantities.
The high-level interface is a wrapper hiding all architecture-dependent details
and converting data from row-major or column-major orders into the panel-
major format. Therefore it adds the overhead of this on-line conversion to the
solver execution time. Furthermore, the ability to reduce memory usage in case
of time-invariant data is lost. In fact, the data matrices are passed to this
interface as double* type, i.e. as a sequences of matrices, with matrices for the
diﬀerent stages following each other (i.e. as produced using the repmat function
in Matlab). This interface is designed to simplify the integration of the solver
into existing software, at the cost of some performance penalty.
Unless diﬀerently stated, all tests in the remainder of the chapter are performed
on the laptop equipped with the Intel Core i7 4800MQ already considered in
this thesis. The linear algebra is given by the AVX2 target in HPMPC.
13.1.5.1 Box-constrained linear MPC problems
In this section the performance of the proposed IPM solver is compared against
the successful IPM solver FORCES [26] and the IPM solver in FORCES_Pro
[8] in the solution of a box-constrained linear MPC problem.
The test problem is the linear mass-spring system often used as benchmark
for linear solvers in MPC [89]. Namely, the tests performed in [26] (where the
FORCES solver is showed to outperform by a wide margin the solvers CVXGEN
[62], CPLEX [7] and MA57 [27] in this benchmark) are repeated. The only
exception is the largest problem: we had to decrease the horizon length from
30 to 15 in case of the FORCES solver and to 10 in case of the FORCES_Pro
solver (for larger test problems the server running the code generation returns
a timeout error), and therefore the value obtained for N = 30 is an estimation
based on the value obtained for N = 10 and the assumption that the solution
time is linear.
The dynamic system to be controlled is the undamped linear mass-spring sys-
tem, consisting of a chain of M masses connected each other with springs. The
system is modeled using 2 states per mass, one representing the deviation of the
mass from the rest position and one representing the mass velocity (the total
294 Solution of sub-problems in linear MPC and MHE problems
number of states is nx = 2M). The inputs are nu forces acting on the ﬁrst nu
masses. The continuous-time system is discretized with a sampling time of 0.5
seconds: therefore the discrete-time matrices An and Bn in (12.1) are dense.
Forces acts on all but the last mass, so nu = M − 1. There are nb = nu + nx
two-sided box constraints: inputs must satisfy the bounds |u| ≤ 0.5, while
states must satisfy the bounds |x| ≤ 4. The cost-function matrices are R = 2I,
Q = P = I, S = 0, r = 0, q = p = 0. Notice that the test problem is linear
time-invariant.
Both the low- and high-level interfaces of the IPM in HPMPC are tested against
each other and against FORCES and FORCES_Pro. The results of the test are
in Table 13.1, where the number of IPM iterations is ﬁxed to 10 for all solvers.
In the comparison of the low- and high-level HPMPC interfaces, for all problems
the high-level interface is about 10% slower than the low-level one, showing that
the on-line cost for packing matrices is well amortized over the 10 IPM iterations.
The comparison of the IPM solver in HPMPC with FORCES or FORCES_Pro
clearly shows the performance gains of the proposed implementation approach.
The performance of FORCES and FORCES_Pro is rather similar, with the
latter showing small performance improvement. For this benchmark problem,
FORCES (and presumably FORCES_Pro) has a slightly lower cost in terms of
ﬂops with respect to the Riccati-based IPM in HPMPC. However, the IPM in
HPMPC is from about 2 (smallest problem) to more than 10 (largest problem)
times faster than FORCES_Pro, with the performance gap increasing with the
problem size. Also the high-level interface in HPMPC is considerably faster that
FORCES_Pro, proving that in an MPC framework it makes sense to invest time
in arranging data in a way optimal for the linear-algebra routines, even if this
has to be done on-line.
13.1.5.2 Soft-constrained linear MPC problems
This section tests the performance of the IPM solver in HPMPC in the solution
of a soft-constrained linear MPC problem. Two test are performed: the ﬁrst
handles the soft constraints using the specialized interface, while as a reference
the second handles them as general polytopic constraints.
The tests are the same performed in Section 13.1.5.1, with the diﬀerence that
the bounds are tightened in order to make the problem unfeasible if bounds are
considered as hard. Namely, the bounds on inputs are still |u| ≤ 0.5, but the
bounds on states are now |x| ≤ 1. Furthermore, the cost function matrices are
now Q = P = 0, R = 2I, S = 0, q = p = 0 and r = 0: the expected behavior
13.1 Interior-point methods 295
Table 13.1: Comparison of solvers for the box-constrained linear MPC prob-
lem: low- and high-level interfaces for the IPM in HPMPC,
FORCES IPM and FORCES_Pro IPM. Run times are presented
in seconds. For each problem size and solver, the number of IPM
iterations is ﬁxed to 10. Blue: estimated values based on the re-
sults for N = 15 (FORCES) and N = 10 (FORCES_Pro) and
the assumption of linear solution time in N .
HPMPC HPMPC FORCES FORCES
M nx nu nb N low-level high-level Pro
2 4 1 5 10 5.39 · 10−5 6.31 · 10−5 1.1 · 10−4 1.0 · 10−4
4 8 3 11 10 9.05 · 10−5 1.04 · 10−4 3.4 · 10−4 3.1 · 10−4
6 12 5 17 30 5.07 · 10−4 5.74 · 10−4 2.11 · 10−3 1.84 · 10−3
11 22 10 32 10 3.94 · 10−4 4.60 · 10−4 3.96 · 10−3 3.29 · 10−3
15 30 14 44 10 7.03 · 10−4 8.17 · 10−4 9.47 · 10−3 7.49 · 10−3
30 60 29 89 30 1.10 · 10−2 1.26 · 10−2 1.67 · 10−1 1.25 · 10−1
for the controller is to let the masses to freely oscillate around the rest position,
while ensuring that displacements and velocities of masses satisfy the bound
|x| ≤ 1.
When the solver tailored to handle soft-constraints is considered, the problem
has nb = nu two-sided box hard constraints on inputs, and ns = nx two-sided
box soft constraints on states. The slack variables s in (12.1) are not explicitly
considered as dynamical system variables, but rather are internally handled by
the solver as shown in Section 13.1.2. This keeps the size of the dynamic system
as small as in the hard-constrained case.
When the generic IPM solver in HPMPC is considered, the slack variables s
have to be considered as extra input variables: in the current test problem, this
meas that the new input size is n˜u = nu + 2nx. In order to keep the number
of constraints as small as possible, the two one-sided constraints (12.1g) and
(12.1h) are rewritten as the single two-sided constraint
xln,i ≤ xn,i + sln,i − sun,i ≤ xun,i.
The condition on the sign of the slack variables (12.1i) and the penalties Zl >
0 or zl > 0 ensure the equivalence of the two formulations at the optimum.
Therefore there are n˜g = nx two-sided general constraints. The 2nx one-sided
constraint on the sign of the slack variables (12.1i) are handled as the lower
bound of 2nx two-sided box constraints, where the upper bound is arbitrary
ﬁxed to a value large enough. In total, there are n˜b = nu + 2nx two-sided box
constraints and n˜g = nx general polytopic constraints per stage. Compared to
296 Solution of sub-problems in linear MPC and MHE problems
Table 13.2: Comparison of solvers for the soft-constrained linear MPC prob-
lem: tailored IPM solver (soft-box) and generic IPM solver for
(hard) general polytopic constraints (hard-gen). Run times are
presented in seconds. For each problem size and solver, the num-
ber of IP iterations is ﬁxed to 10.
HPMPC HPMPC
M nx nu N nb ns soft-box n˜u n˜b n˜g hard-gen
2 4 1 10 1 4 7.0 · 10−5 9 9 4 1.3 · 10−4
4 8 3 10 3 8 1.2 · 10−4 19 19 8 2.9 · 10−4
6 12 5 30 5 12 6.4 · 10−4 29 29 12 1.9 · 10−3
11 22 10 10 10 22 4.6 · 10−4 52 52 22 2.2 · 10−3
15 30 14 10 14 30 8.7 · 10−4 74 74 30 4.6 · 10−3
30 60 29 30 29 60 1.2 · 10−2 149 149 60 9.3 · 10−2
the (hard) box-constrained test problem in Table 13.1, the solver tailored to
the soft-constrained linear MPC problem is about 20% slower for the smaller
problems, down to about 10% slower for the largest ones. On the other hand,
the general IPM solver applied to the soft-constrained linear MPC problem is
from 2x slower (smaller problem) to about 8x slower than the tailored solver,
due to the increased size of the dynamical system.
13.1.5.3 Partial condensing
This section contains the results of some numerical investigation on the use of
partial condensing to speed up the solution of linear MPC problems.
More precisely, partial condensing is employed as a preparation step before
the call to the IPM solver, with the aim of changing the size of the linear
MPC problem. The IPM has not been tailored to take advantage of the special
structure of the Γu matrix, that therefore is processed as the constraint matrix
of general polytopic constraints.
The test problem is the mass-spring problem with N = 10, nx = 8 and nu = 3.
Two series of tests are performed: in one all inputs and states are bounded (and
therefore nb = 11), in the other one only inputs are bounded (and therefore
nb = 3). These are the two limit cases, with the former being the worst case (all
state bounds beside the last one are turned into general polytopic constraints),
while the latter being the best case (there are no general polytopic constraints).
The performance of all other cases is in the middle between the two limit cases.
13.1 Interior-point methods 297
0
2e-05
4e-05
6e-05
8e-05
0.0001
0.00012
1 2 3 4 5 6 7 8 9 10
tim
e 
[s]
Np
N=10, nx=8, nu=3, nb=11
part cond
IPM 10 iter
total
(a) All states bounded.
0
2e-05
4e-05
6e-05
8e-05
0.0001
0.00012
1 2 3 4 5 6 7 8 9 10
tim
e 
[s]
Np
N=10, nx=8, nu=3, nb=3
part cond
IPM 10 iter
total
(b) No state bounded.
Figure 13.1: Solution time for the mass-spring MPC problem with N = 10,
nx = 8, nu = 3 and either bounds on all inputs and states
(nb = 11) or input bounds only (nb = 3). A combination of
partial condensing and a predictor-corrector IPM (ﬁxed to 10
iterations) is employed for diﬀerent values of the horizon length
Np for the partially condensed MPC problem. Test performed
on an Intel Core i5 3520M processor, gcc compiler, AVX target
ISA in HPMPC.
Figure 13.1 contains the results of some numerical tests, for values of the hori-
zon length of the partially condensed MPC problem Np ∈ [1, 10]. In the partial
condensing, the linear MPC problem is assumed to be time-variant, and there-
fore the matrices are recomputed at each stage of the partially condensed MPC
problem: this is the worst case, and it is the expected behavior in solving the
linear MPC sub-problems in NMPC.
The cost of partial condensing is well amortized over the 10 IPM iterations, and
accounts for a small fraction of the total computational time. The use of partial
condensing is particularly advantageous in case there are input constraints only.
In this case, for Np = 2 there is a reduction in the computational time of
about 40%. On the other hand, in case all states are bounded, the reduction in
computational time is of about the 10%, since the update of the cost function
at each iteration of the IPM makes up for most of the savings.
298 Solution of sub-problems in linear MPC and MHE problems
13.2 Alternating direction method of multipliers
The Alternating Direction Method of Multipliers (ADMM) is a ﬁrst order opti-
mization method belonging to the class of splitting method. ADMM has been
ﬁrstly proposed in the ﬁeld of optimal control in [68].
First order optimization methods (as gradient methods [74, 73] and splitting
methods [68, 82]) are generally straightforward to implement and can easily ex-
ploit sparsity and special problem structures. They perform many but cheap
iterations, where the cost-per-iteration is quadratic in the input and state size
(requiring level 2 BLAS operations as e.g a matrix-vector multiplication or so-
lution of a system of linear equation whose matrix is factorized oﬀ-line). In
general, the number of iterations (and therefore the solution time) can vary sig-
niﬁcantly with the number of active constraints and the problem conditioning.
Generally speaking, ﬁrst order methods can ﬁnd quickly an approximate solu-
tion, but require a large number of iterations to improve the accuracy of the
solution.
13.2.1 Notation and basics about ADMM
This section brieﬂy reports some basics about ADMM. A more detailed presen-
tation can be found in [68]. The notation follows the one in that reference.
The Alternating Direction Method of Multipliers (ADMM), also known as Douglas-
Rachford (D-R) splitting, is an operator splitting method. It breaks the problem
into two parts, a quadratic optimal control problem and a set of single period
optimization problems. The splitting is generally not unique. The solution is
found iterating over these two steps.
In the following, only the case of the linear MPC problem is considered. In this
framework, φ(x¯, u¯) indicates the (convex) cost function of the quadratic cost
function of the linear MPC problem,
φ(x¯, u¯) =
1
2
N−1∑
n=0
unxn
1
T Rn Sn rnSTn Qn qn
rTn q
T
n 0
unxn
1
+ 1
2
[
xN
1
]T [
QN qN
qTN 0
] [
xN
1
]
.
The set C indicates the close convex feasible set with respect to the inequality
constraints. Therefore, ψ(x¯, u¯) indicates the (convex) non-quadratic cost func-
tion used to describe the inequality constraints, as the indicator function IC of
13.2 Alternating direction method of multipliers 299
the set C,
ψ(x¯, u¯) = IC(x¯, u¯) =
{
0, (x¯, u¯) ∈ C
∞, (x¯, u¯) 6∈ C
Similarly, D indicates the set of the feasible pairs (x¯, u¯) with respect to the
equality constraints (system dynamic equations)
D = {(x¯, u¯)|x0 = xˆ0, xn+1 = Anxn +Bnun + bn, n = 0, 1, . . . , N − 1}
and ID indicates the indicator function of D,
ID(x¯, u¯) =
{
0, (x¯, u¯) ∈ D
∞, (x¯, u¯) 6∈ D
With this notation, the linear MPC problem can be rewritten as
min
(x¯,u¯)
ID(x¯, u¯) + φ(x¯, u¯) + ψ(x¯, u¯).
An algorithm for ADMM is presented in [68]: starting from any initial guess
(x˜0, u˜0) and (z¯0, y¯0), for each iteration k the next iterate is obtained from the
recursion
(x¯k+1, u¯k+1) = arg min
(x¯,u¯)
(
ID(x¯, u¯) + φ(x¯, u¯) + ρ2 ||(x¯, u¯)− (x˜k, u˜k)− (z¯k, y¯k)||22
)
(13.9)
(x˜k+1, u˜k+1) = arg min
(x˜,u˜)
(
ψ(x˜, u˜) + ρ2 ||(x¯k+1, u¯k+1)− (x˜, u˜)− (z¯k, y¯k)||22
)
(13.10)
(z¯k+1, y¯k+1) = (z¯k, y¯k) + (x˜k+1, u˜k+1)− (x¯k+1, u¯k+1) (13.11)
where ρ > 0 is an algorithm parameter.
13.2.2 Box constraints
The case of box constraints is a particularly favorable case for ADMM. In case
of box constrained MPC problems, the step in (13.9) requires the solution of an
unconstrained problem formally identical to the unconstrained MPC problem
(7.1). In fact, deﬁned (p¯, t¯) = −(x˜k, u˜k)− (z¯k, y¯k), the penalty on the two norm
can be written as
ρ
2
||(x¯, u¯) + (p¯, t¯)||22 =
1
2
N−1∑
n=0
unxn
1
T  ρI 0 ρtn0 ρI ρpn
ρtTn ρp
T
n ∗
unxn
1

300 Solution of sub-problems in linear MPC and MHE problems
that is an update of the cost function. Notice that the quadratic part of the
cost function does not depend on (p¯, t¯), and therefore in the KKT system (in
matrix notation)R¯+ ρI S¯ −B¯TS¯T Q¯+ ρI −A¯T
B¯ A¯ 0
u¯x¯
p¯i
 = −
 r¯ + ρt¯q¯ + ρp¯
b¯
 (13.12)
the KKT matrix can be factorized only once (oﬀ-line or at the ﬁrst iteration),
and reused at all subsequent iterations. The cost of this step is thus quadratic in
the state and input dimension. The solution of this step, (x¯k+1, u¯k+1) is primal
feasible, but in general dual infeasible.
Step in (13.10) requires the solution of a QP that does not have equality con-
straints (there are no dynamic equations) but only box constraints. Deﬁned
(p¯, t¯) = (x¯k+1, u¯k+1) − (z¯k, y¯k), the direction of the negative gradient of this
QP is simply (h¯, g¯) = −ρ−1(−ρ(p¯, t¯)) = (p¯, t¯). Since the box constraints are
completely separable, the solution of this step is computed by clipping (p¯, t¯) to
the value of the box constraints,
(˜¯xk+1 ,˜¯ uk+1) = min
(
max
(
(p¯, t¯), (x¯l, u¯l)
)
, (x¯u, u¯u)
)
This step can be computed in time linear in the state and input size. The solu-
tion of this step, (x˜k+1, u˜k+1) is dual feasible, but in general primal infeasible.
Step in (13.11) is simply the consensus step, and it can clearly be computed in
time linear in the state and input size.
13.2.3 Soft constraints
The case of soft box constraints is also a rather favorable case for ADMM,
at least regarding the computational cost per iteration, and disregarding the
possible increase in the number of iterations required to converge.
Namely, the splitting of dynamic and constraints into two separate steps makes
it easy to adapt the ADMM scheme to the case of soft constraints on the states.
In this case, also the slack variables s of the soft constraints are optimization
variables.
In step (13.9), s enter in the cost function formulation, but do not enter in the
dynamic. If we deﬁne (l¯, p¯, t¯) = −(s˜k, x˜k, u˜k)− (w¯k, z¯k, y¯k), this means that the
13.2 Alternating direction method of multipliers 301
new KKT system to be solved at step (13.9)
Z¯ + ρI 0 0 0
0 R¯+ ρI S¯ −B¯T
0 S¯T Q¯+ ρI −A¯T
0 B¯ A¯ 0


s¯
u¯
x¯
p¯i
 = −

z¯ + ρw¯
r¯ + ρt¯
q¯ + ρp¯
b¯
 (13.13)
is block diagonal, since s is completely separated from the other variables. Since
the matrix Z+ρI is diagonal and and constant over the ADMM iterations (and
therefore it can be inverted oﬀ-line or at the ﬁrst iteration), and the bottom-
right block can be solved exactly as in (13.12), the cost of this step is almost
identical to the case with box-constraints.
The computation of step (13.10) is more interesting. The inputs u¯ are box
constrained, and therefore treated as in the case of box-constrained MPC. The
states are soft constrained, but the constraints are completely separable over
both the control horizon N and the state components. Therefore, deﬁned
(l¯, p¯, t¯) = (s¯k+1, x¯k+1, u¯k+1) − (w¯k, z¯k, y¯k), for each stage n ∈ {1, . . . , N} and
for each component i ∈ {0, . . . , nx − 1}, the variables x˜k+1n,i , s˜u,k+1n,i and s˜l,k+1n,i
are computed by solving the optimization problem (dropping the iteration in-
dex k+1 and using the index i instead of n,i in the equations for clearness of
presentation)
min
x˜i,s˜ui ,s˜
l
i
1
2 x˜
2
i − pix˜i + 12 (s˜ui )2 − lui s˜ui + 12 (s˜li)2 − llis˜li
s.t. x˜i − s˜ui ≤ xui
s˜ui ≥ 0
xli ≤ x˜i + s˜li
s˜li ≥ 0
(13.14)
This optimization problem can be solved analytically. In fact, since the upper
and the lower soft constraints can not be violated at the same time, there are
only three possible cases:
 both the upper and the lower soft constraints are inactive. In this case,
both s˜u,k+1n,i = 0 and s˜
l,k+1
n,i = 0, and the minimization problem (13.14)
reduces to
min
x˜i
1
2 x˜
2
i − pix˜i
that has the solution (with all indexes)
x˜k+1n,i = pn,i
as in the box-constrained case when the constraints are inactive.
302 Solution of sub-problems in linear MPC and MHE problems
 the bottom soft constraint is inactive, but the upper one is active. This
means that s˜l,k+1n,i = 0 and x˜
k+1
n,i − s˜u,k+1n,i = xun,i. Inserting these values in
the cost function expression in (13.14) gives the minimization problem
min
s˜ui
(s˜ui )
2 − (−xui + pi + lui )s˜ui
giving the analytic solution (with all indexes)
s˜u,k+1n,i =
1
2 (−xun,i + pn,i + lun,i) ≥ 0.
 the upper soft constraint is inactive, but the bottom one is active. This
means that s˜u,k+1n,i = 0 and x
l
n,i = x˜
k+1
n,i + s˜
l,k+1
n,i . Inserting these values in
the cost function expression in (13.14) gives the minimization problem
min
s˜li
(s˜li)
2 − (xli − pi + lli)s˜li
giving the analytic solution (with all indexes)
s˜l,k+1n,i =
1
2 (x
l
n,i − pn,i + lln,i) ≥ 0.
Since for each stage n and index i the computation requires a constant number
of ﬂops, the step in (13.10) can be computed in time linear in state and input
dimension also in the case of soft-constraints.
Step in (13.11) is again simply the consensus step, and it can clearly be computed
in time linear in the state and input size. Notice that this time also the consensus
variables wk+1 associated with the slack variables s are updated in this step.
13.2.4 ADMM implementation choices
The key step in the ADMM algorithm is the solution of the KKT system at
each iteration. Since the KKT matrix does not change at diﬀerent iterations,
it can be factorized once and reused several times (or even factorized oﬀ-line in
case of time-invariant problems), well amortizing the factorization cost.
The factorization and the solution are performed using the backward Riccati
Algorithms 2 and 3 respectively. Riccati-based ADMM implementations can
be found in [17, 76]. In the solution Algorithm 3, it is possible to save many
ﬂops and use much less memory if the matrices Ln,22 are not employed. This is
the case if the Lagrangian multipliers λ are not computed, and if the quantity
Pn+1bn is precomputed during the factorization step. A further advantage of the
use of less memory is that the solution routine attains a better computational
performance, since the computational performance steadily decreases as soon as
the MPC problem memory footprint exceeds cache levels [37].
13.2 Alternating direction method of multipliers 303
Table 13.3: Numerical test for the ADMM solver implemented using routines
in HPMPC in the solution of the box-constrained linear MPC
problem. Run times are presented in seconds. For each problem
size and solver, the number of ADMM iterations is ﬁxed to 50.
HPMPC
M nx nu nb N ADMM
2 4 1 5 10 7.02 · 10−5
4 8 3 11 10 1.06 · 10−4
6 12 5 17 30 5.13 · 10−4
11 22 10 32 10 3.08 · 10−4
15 30 14 44 10 4.65 · 10−4
30 60 29 89 30 4.73 · 10−3
13.2.5 Numerical results for the linear MPC problem
In this section, the numerical tests for the box-constrained and soft-constrained
linear MPC problems (employed in the testing of the IPM solver in Section
13.1.5) are repeated for the ADMM solver.
13.2.5.1 Box-constrained linear MPC problems
This section contains the results on the use of the ADMM in the solution of
box-constrained MPC problems. The results are in table 13.3, where the size of
the problem is the same as in the IPM case in table 13.1. The factorization is
performed oﬀ-line, and the number of ADMM iterations is ﬁxed to 50.
Generally speaking, the ADMM is slightly slower than the IPM for small prob-
lems, and almost twice as fast for the larger one. This is due to the fact that, as
the problem size increases, the solution of the KKT system (requiring a num-
ber of ﬂops quadratic in the input and state size) gets increasingly cheap with
respect to the factorization of the KKT matrix (requiring a cubic number of
ﬂops).
However, the accuracy reached in 50 iterations appears to decrease as the prob-
lem size increases, and therefore in practice more iterations may be necessary
in case of large problems, decreasing the performance advantage. Furthermore,
the accuracy of the solution varies considerably with the problem instance.
304 Solution of sub-problems in linear MPC and MHE problems
Table 13.4: Numerical test for the ADMM solver implemented using routines
in HPMPC in the solution of the soft-constrained linear MPC
problem. Run times are presented in seconds. For each problem
size and solver, the number of ADMM iterations is ﬁxed to 50.
HPMPC
M nx nu N nb ns ADMM
2 4 1 10 1 4 9.32 · 10−5
4 8 3 10 3 8 1.55 · 10−4
6 12 5 30 5 12 7.49 · 10−4
11 22 10 10 10 22 4.23 · 10−4
15 30 14 10 14 30 6.18 · 10−4
30 60 29 30 29 60 5.74 · 10−3
13.2.5.2 Soft-constrained linear MPC problems
This section contains the results on the use of the ADMM in the solution of soft-
constrained MPC problems. The results are in table 13.4, where the size of the
problem is the same as in the IPM case in table 13.2. Also in the soft-constrained
case the factorization is performed oﬀ-line, and the number of ADMM iterations
is ﬁxed to 50.
Generally speaking, in the soft-constrained case the cost per iteration increases
only slightly with respect to the box-constrained cases, as a result of the struc-
ture exploiting ADMM algorithm. The increase is about 40% for the smaller
problems, and 15% for the larger one.
However, the accuracy obtained in 50 iterations is much lower than in the box
constrained case. This is signiﬁcantly diﬀerent than in the IPM case, where the
number of iterations increased only slightly with respect to the box-constrained
case. The reference [49] analyzes the reason for the steep increase in the iteration
count and suggests proper scaling mitigate the issue. However, this has not been
employed in the current tests.
13.3 Conclusion
In this section, IPMs and ADMMs for the solution of linear MPC and MHE
problems are reviewed. Eﬃcient implementations are proposed, that make use of
the backward Riccati recursion (presented in Section 8.1) for the solution of the
13.3 Conclusion 305
unconstrained sub-problems. Techniques to eﬃciently process a soft formulation
of the constraints is presented, and numerical tests conﬁrm a very small increase
in the cost-per-iteration compared to the hard formulation of the constraints.
In particular, IPMs require the factorization and solution of a system of linear
equations at each IPM iteration, in the computation of the search direction. The
factorization step makes use of level 3 BLAS and LAPACK routines, that, if
well optimized, can attain a large fraction of the full FP throughput. Therefore,
the performance of IPMs is much more sensitive to the implementation of linear
algebra routines than e.g. ﬁrst order methods (that only use level 2 BLAS
routines). The IPM implemented using the linear algebra routines in HPMPC
can outperform other state-of-the-art IPMs for optimal control by about an
order of magnitude in case of medium to large size MPC problems.
As a conclusion, an IPM implemented using the proposed linear algebra routines
is an excellent choice for a solver for linear MPC problems, quickly and reliably
ﬁnding the solution of linear MPC problems for a wide range of problem sizes
and constraint structures.
306 Solution of sub-problems in linear MPC and MHE problems
Chapter 14
Summary and
considerations about
solution of sub-problems in
nonlinear MPC and MHE
problems
Part I of the thesis proposed implementation techniques specially tailored to
embedded optimization. The focus of the implementation is on getting the best
possible performance for small to medium size matrices (i.e. for matrices of size
up to a few hundreds). One of the key features is the adoption of a special ma-
trix format (called panel-major), roughly corresponding to the innermost level
of packing in the GotoBLAS's approach [44], clearly exposed in [86]. This ma-
trix format provides optimal performance for matrices roughly ﬁtting in the last
level of cache, and avoids the need for packing matrices on-line. Furthermore,
a specialized linear algebra kernel (based on the gemm kernel) is implemented
for each linear algebra routine, carefully optimized for a number of computer
architectures. The linear algebra routines implemented using these kernels out-
perform both optimized BLAS libraries and code-generated routines commonly
employed in embedded optimization.
308
Summary and considerations about solution of sub-problems in nonlinear
MPC and MHE problems
Part II of the thesis presented tailored algorithms for the solution of uncon-
strained MPC and MHE problems. In particular, a backward Riccati recursion
is found to be a reliable choice, giving good performance for a wide range of
problem sizes, while a forward Schur-complement recursion can handle MPC
and MHE problems with further equality constraints on the last stage. Con-
densing methods for MPC and MHE can be employed as routines in partial
condensing, that in turn can be employed as a preparation step to reformulate
the MPC and MHE problems in the size giving the best performance for the
solver at hand. All these algorithms are implemented using the linear algebra
routines developed in Part I of the thesis and therefore all matrices are assumed
to be in the panel-major matrix format.
Part III of the thesis quickly reviewed some optimization algorithm that can
be implemented using the solvers from Part II to solve the unconstrained sub-
problems. In particular, a Riccati-based IPM implemented using the linear alge-
bra routines in HPMPC is found to outperform state-of-the-art solvers for em-
bedded linear MPC problems by more that one order of magnitude for medium
to large scale MPC problems. These optimization algorithms still assume that
all matrices are in the panel-major matrix format. The conversion between
standard row-major or column-major formats into the panel-major format can
be performed in a wrapper around the IPM routine. In this way, the conversion
overhead is well amortized over a large number of IPM iterations. Numerical
tests conﬁrm the eﬀectiveness of the approach.
14.1 Interface with existing solvers for NMPC
The solution of linear MPC and MHE problems is a key step in the solution of
NMPC problems using the multiple-shooting and direct collocation discretiza-
tion methods. In fact, in that case the sub-problems that need to be solved at
each Sequential Quadratic Programming (SQP) iteration can be seen as time-
variant instances of linear MPC problems [83].
In this section, the strength of the solvers for the MPC and MHE problems
in HPMPC are presented for the state estimation and control of a nonlinear
system. Namely, the results of closed-loop real-time simulations of rotational
start-up for an airborne wind energy system [93] are presented. The system is
modeled as a diﬀerential-algebraic equation (DAE), with 27 diﬀerential states,
1 algebraic state and 4 control inputs.
To solve the NMPC and NMHE formulations, the ACADO Code Generation
Tool (CGT) [47, 48] is employed. ACADO implements the real-time iteration
14.1 Interface with existing solvers for NMPC 309
(RTI) scheme [25, 59]. Multiple shooting is employed for the discretization. The
QPs underlying the NMPC solver have box constraints, and they are handled
with the Riccati-based IPM presented in Section 13.1. The QPs underlying the
NMHE solver do not have general or box constraints, but they have further
equality constraints on the last stage: they are solved eﬃciently using the for-
ward Schur-complement recursion presented in Section 8.2. More details about
the experimental setup can be found in [88] (where the use of the Riccati-based
IPM is investigated for the solution of the MPC sub-problems) and in [40]
(where the use of the forward Schur-complement recursion is investigated for
the solution of the MHE sub-problems).
The linear MPC sub-problems have nx = 27 states, nu = 4 controls and an
horizon length of Nc = 50. An augmented model used for the NMHE, one
that includes a disturbance model. Therefore the linear MHE sub-problem have
nx = 33 states and nw = 6 disturbance inputs. Consistency conditions of
the DAE model yield nd = 9 equality constraints on the last stage, while the
number of estimation intervals is Ne = 15. More details can be found in [87]
and references therein.
The simulation results are reported in Figure 14.1. A control interval begins
with a feedback step of the RTI scheme for the NMHE (MHE FBK), after
which the current state estimate is obtained. Afterwards, the NMPC feed-
back step is triggered (MPC FBK) for calculation of optimal control inputs. In
essence, the execution times of the feedback steps amount to solutions of under-
lying QPs. After each feedback step corresponding preparation step is executed
(MHE PREP and MPC PREP), which includes model integration, sensitivity
generation and linearization of the objective and the constraints. In this setting
both NMHE and NMPC run on the separate CPU cores.
Figure 14.1a presents the results when the qpOASES [28] solver is used to solve
the QPs underlying the NMHE and NMPC formulations. In that case, the
feedback step of the NMHE requires about 3.5 ms, while the preparation step
requires about 8.5 ms. The feedback step for the NMPC requires about 12 ms,
while the preparation step requires slightly less than 20 ms.
Figure 14.1b presents the results when the solvers in HPMPC are employed
instead for the solution of the underlying QPs. In that case, the feedback step
of the NMHE requires about 0.5 ms (that is about 7 times as fast as in the
qpOASES case), while the preparation step requires about 5 ms. The feedback
step for the NMPC requires about 3 ms (that is about 4 times as fast as in
the qpOASES), while the preparation step requires less than 8 ms. In total,
the maximum feedback delay is always less than 3.5 ms, far below the control
period of 40 ms.
310
Summary and considerations about solution of sub-problems in nonlinear
MPC and MHE problems
M
H
E
F
B
K
M
H
E
P
R
E
P
M
P
C
F
B
K
Time [20 ms/div]
M
P
C
P
R
E
P
(a) ACADO-qpOASES.
M
H
E
FB
K
M
H
E
PR
EP
M
PC
FB
K
Time [20 ms/div]
M
PC
PR
EP
Ex
ec
ut
io
n 
or
de
r
Control period
(b) ACADO-HPMPC.
Figure 14.1: Feedback and preparation step times for the MHE and MPC
using the ACADO-qpOASQS and ACADO-HPMPC solvers in
the rotational start-up of an airbone wind energy system.
The much shorter feedback steps of the solvers in HPMPC are due to the fact
that the forward Schur-complement method can eﬃciently solve the exact MHE
sub-problem formulations, while the long horizon length of Nc = 50 in the MPC
sub-problems favors sparse solvers. The much shorter preparation steps are
mainly due to the fact that the solvers in HPMPC do not require a condensing
step, while that is employed in the qpOASES case. The integration algorithms
are identical in the two cases.
Appendix A
Custom gcc Compiler
During the optimization of the dpotrf and dtrtri kernels on the Haswell micro-
architecture, a performance issue has been found in the gcc compiler: the per-
formance of these kernels was about 20% lower than expected based on the
performance of the other level 3 BLAS kernels.
An analysis of the produced assembly code showed that the fnmadd intrinsics
(implementing c ← c − a · b, where a, b, c are registers) used in the kernels
implementation have been translated into fmadd (implementing c ← c + a · b)
and xor (used to change the sign of either a or b) instructions. This causes a
performance degradation due to the introduction of the xor instructions in an
already resource-constrained loop.
In gcc, the intrinsics are implemented as calls to builtin functions, that are
functions deﬁned in the compiler itself. Generally, there is a builtin for each
intrinsics, but this is not the case for the various fnmadd, fmsub and fnmsub
intrinsics. An inspection of the
main_gcc_folder\gcc\config\i386\fmaintrin.h
header ﬁle shows that they are all implemented with a call to the fmadd builtin,
plus the negation of the proper operands, as (in the case of the 256-bit double-
312 Custom gcc Compiler
precision version of the intrinsics, similarly for 128-bit and/or single-precision
versions of the intrinsics)
extern __inline __m256d
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_fmadd_pd (__m256d __A, __m256d __B, __m256d __C)
{
return (__m256d)__builtin_ia32_vfmaddpd256 ((__v4df)__A, (__v4df)__B,
(__v4df)__C);
}
extern __inline __m256d
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_fmsub_pd (__m256d __A, __m256d __B, __m256d __C)
{
return (__m256d)__builtin_ia32_vfmaddpd256 ((__v4df)__A, (__v4df)__B,
-(__v4df)__C);
}
extern __inline __m256d
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_fnmadd_pd (__m256d __A, __m256d __B, __m256d __C)
{
return (__m256d)__builtin_ia32_vfmaddpd256 (-(__v4df)__A, (__v4df)__B,
(__v4df)__C);
}
extern __inline __m256d
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_fnmsub_pd (__m256d __A, __m256d __B, __m256d __C)
{
return (__m256d)__builtin_ia32_vfmaddpd256 (-(__v4df)__A, (__v4df)__B,
-(__v4df)__C);
}
The builtins are deﬁned in the ﬁle
main_gcc_folder\gcc\config\i386\i386.c
where the enumeration ix86_builtins contains the codes for all the builtins,
that are deﬁned in the bdesc_multi_arg array of builtin_description struc-
tures. The lines
313
IX86_BUILTIN_VFMSUBPD256,
IX86_BUILTIN_VFNMADDPD256,
IX86_BUILTIN_VFNMSUBPD256,
have been added to the ix86_builtins enumeration, while the lines
{ OPTION_MASK_ISA_FMA | OPTION_MASK_ISA_FMA4, CODE_FOR_fma4i_fmsub_v4df,
"__builtin_ia32_vfmsubpd256", IX86_BUILTIN_VFMSUBPD256,
UNKNOWN, (int)MULTI_ARG_3_DF2 },
{ OPTION_MASK_ISA_FMA | OPTION_MASK_ISA_FMA4, CODE_FOR_fma4i_fnmadd_v4df,
"__builtin_ia32_vfnmaddpd256", IX86_BUILTIN_VFNMADDPD256,
UNKNOWN, (int)MULTI_ARG_3_DF2 },
{ OPTION_MASK_ISA_FMA | OPTION_MASK_ISA_FMA4, CODE_FOR_fma4i_fnmsub_v4df,
"__builtin_ia32_vfnmsubpd256", IX86_BUILTIN_VFNMSUBPD256,
UNKNOWN, (int)MULTI_ARG_3_DF2 },
have been added to the bdesc_multi_arg array. The builtin_description
structure is deﬁned as
struct builtin_description
{
const HOST_WIDE_INT mask;
const enum insn_code icode;
const char *const name;
const enum ix86_builtins code;
const enum rtx_code comparison;
const int flag;
};
The ﬁle
main_gcc_folder\gcc\config\i386\sse.md
contains the machine description of the various SSE, AVX and FMA instructions.
In particular, it contains a RTL (Register Transfer Language, a low-level in-
termediate representation) pattern for each instruction that the target machine
supports (or that it is worth telling the compiler about). The name of insns
from either named define_insn or define_expand are used to generate a list
of insns, that then undergoes various optimization passes. Finally, the insn list's
314 Custom gcc Compiler
RTL is matched against the RTL templates in the define_insn to produce as-
sembly code.
In the sse.md ﬁle, there is a define_insn for each of the fmadd, fmsub, fnmadd
and fnmsub instructions, but there is only a single define_expand, that cor-
responds to the single builtin __builtin_ia32_vfnmaddpd256 deﬁned in the
ﬁle i386.c (that in the struct builtin_description has insn_code equal to
CODE_FOR_fma4i_fnmadd_v4df). This means that the compiler can emit all
fmadd, fmsub, fnmadd and fnmsub instructions if the corresponding patterns
are recognized in the optimized list of insns, but that only the fmadd builtin
can be used to generate the patterns in the list of insns. In practice, the insns
generated by the combinations of fmadd and xor builtins (corresponding to the
fmsub, fnmadd and fnmsub intrinsics) undergo the various optimization steps,
and their patterns are not recognized as belonging to the fmsub, fnmadd and
fnmsub instructions when producing the assembly code.
This is solved by adding a define_expand for each of the fmsub, fnmadd and
fnmsub builtins deﬁned in i386.c. Namely, the lines
(define_expand "fma4i_fmsub_<mode>"
[(set (match_operand:FMAMODE_AVX512 0 "register_operand")
(fma:FMAMODE_AVX512
(match_operand:FMAMODE_AVX512 1 "nonimmediate_operand")
(match_operand:FMAMODE_AVX512 2 "nonimmediate_operand")
(neg:FMAMODE_AVX512
(match_operand:FMAMODE_AVX512 3 "nonimmediate_operand"))))])
(define_expand "fma4i_fnmadd_<mode>"
[(set (match_operand:FMAMODE_AVX512 0 "register_operand")
(fma:FMAMODE_AVX512
(neg:FMAMODE_AVX512
(match_operand:FMAMODE_AVX512 1 "nonimmediate_operand"))
(match_operand:FMAMODE_AVX512 2 "nonimmediate_operand")
(match_operand:FMAMODE_AVX512 3 "nonimmediate_operand")))])
(define_expand "fma4i_fnmsub_<mode>"
[(set (match_operand:FMAMODE_AVX512 0 "register_operand")
(fma:FMAMODE_AVX512
(neg:FMAMODE_AVX512
(match_operand:FMAMODE_AVX512 1 "nonimmediate_operand"))
(match_operand:FMAMODE_AVX512 2 "nonimmediate_operand")
(neg:FMAMODE_AVX512
(match_operand:FMAMODE_AVX512 3 "nonimmediate_operand"))))])
315
are added to the sse.md ﬁle.
Finally, the fmaintrin.h header can explicitly use the newly deﬁned builtins
for the fmsub, fnmadd and fnmsub instructions, as
extern __inline __m256d
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_fmadd_pd (__m256d __A, __m256d __B, __m256d __C)
{
return (__m256d)__builtin_ia32_vfmaddpd256 ((__v4df)__A, (__v4df)__B,
(__v4df)__C);
}
extern __inline __m256d
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_fmsub_pd (__m256d __A, __m256d __B, __m256d __C)
{
return (__m256d)__builtin_ia32_vfmsubpd256 ((__v4df)__A, (__v4df)__B,
(__v4df)__C);
}
extern __inline __m256d
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_fnmadd_pd (__m256d __A, __m256d __B, __m256d __C)
{
return (__m256d)__builtin_ia32_vfnmaddpd256 ((__v4df)__A, (__v4df)__B,
(__v4df)__C);
}
extern __inline __m256d
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_fnmsub_pd (__m256d __A, __m256d __B, __m256d __C)
{
return (__m256d)__builtin_ia32_vfnmsubpd256 ((__v4df)__A, (__v4df)__B,
(__v4df)__C);
}
In a similar fashion, the builtins for the 128-bit and/or single precision version
of the fmsub, fnmadd and fnmsub can be added.
After these modiﬁcations, the gcc compiler correctly emits the fnmadd instruc-
tions in the dpotrf and dtrtri kernels, improving their performance by about
20%. In Figure A.1 there is the comparison of the performance of the dpotrf
316 Custom gcc Compiler
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dpotrf
custom gcc
stack gcc
(a) dpotrf
0
10
20
30
40
50
0 50 100 150 200 250 300
G
flo
ps
matrix size n
dtrtri
custom gcc
stack gcc
(b) dtrtri
Figure A.1: Performance test for routines dpotrf and dtrtri using stack
and curtom gcc compilers, on an Intel core i7 4800MQ proces-
sor (Haswell micro-architecture, supporting the AVX2 and FMA
ISAs).
and dtrtri routines when the code is compiled with stack gcc compiler (version
5.2.0) and with the custom gcc compiler (customization of version 5.2.0).
A modiﬁcation of the latest gcc (version 6.0.0) can be found in [9].
Bibliography
[1] http://www.realworldtech.com/.
[2] http://www.agner.org/optimize/.
[3] ACML. http://developer.amd.com/tools-and-sdks/archive/amd-core-
math-library-acml/.
[4] ATLAS. http://math-atlas.sourceforge.net/.
[5] BLAS. http://www.netlib.org/blas/.
[6] BLIS. https://github.com/ﬂame/blis.
[7] CPLEX. http://www.ibm.com/software/integration/optimization/cplex-
optimization-studio/.
[8] FORCES_Pro. https://www.embotech.com.
[9] gcc. https://github.com/giaf/gcc.git.
[10] GotoBLAS. https://www.tacc.utexas.edu/research-development/tacc-
software/gotoblas2.
[11] LAPACK. http://www.netlib.org/lapack/.
[12] MKL. https://software.intel.com/en-us/intel-mkl.
[13] PLASMA. http://icl.cs.utk.edu/plasma/.
[14] BDO Anderson and JB Moore. Optimal Filtering. Prentice-Hall, Engle-
wood Cliﬀs, NJ, 1979.
318 BIBLIOGRAPHY
[15] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Don-
garra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and
D. Sorensen. LAPACK Users' Guide (Third Ed.). Society for Industrial
and Applied Mathematics, Philadelphia, PA, USA, 1999.
[16] J. Andersson. A General-Purpose Software Framework for Dynamic Opti-
mization. PhD thesis, Faculty of Engineering Science, K.U. Leuven, Hev-
erlee, Belgium, 2013.
[17] M. Annergren, A. Hansson, and Wahlberg. B. An ADMM algorithm for
solving l_1 regularized MPC. In IEEE Conference on Decision and Control,
pages 44864491. IEEE, 2012.
[18] D. Axehill. Controlling the level of sparsity in MPC. Systems & Control
Letters, 76:17, 2015.
[19] D. Axehill and M. Morari. An alternative use of the Riccati recursion for
eﬃcient optimization. Systems & Control Letters, 61:3740, 2012.
[20] A. Bemporad, F. Borelli, and M. Morari. Model predictive control based
on linear programming - the explicit solution. IEEE Transactions on Au-
tomatic Control, 47(12):19741985, 2002.
[21] A. Bemporad, M. Morari, V. Dua, and E. N. Pistikopoulos. The Explicit
Linear Quadratic Regulator for Constrained Systems. Automatica, 38(1):3
20, January 2002.
[22] D.P. Bertsekas. Dynamic Programming and Optimal Control, volume 1.
Athena Scientiﬁc, Belmont, Massachusetts, 1995.
[23] A. Buttari, J. Dongarra, J. Langou, J. Langou, P. Luszczek, and Kurzak
J. Mixed precision iterative reﬁnement techniques for the solution of dense
linear systems. The International Journal of High Performance Computing
Applications, 21(4):457466, 2007.
[24] G. Constantinides. Tutorial paper: Parallel architectures for model predic-
tive control. In IEEE European Control Conference, 2009.
[25] M. Diehl. Real-Time Optimization for Large Scale Nonlinear Process. PhD
thesis, Universität Heidelberg, 2001.
[26] A. Domahidi, A. Zgraggen, M. N. Zeilinger, M. Morari, and C. N. Jones.
Eﬃcient interior point methods for multistage problems arising in receding
horizon control. In IEEE Conference on Decision and Control (CDC),
pages 668  674, Maui, HI, USA, December 2012.
[27] I. S. Duﬀ. MA57 - a code for the solution of sparse symmetric deﬁnite and
indeﬁnite systems. ACM Trans. Math. Softw., 30:118144, June 2004.
BIBLIOGRAPHY 319
[28] H. J. Ferreau, H. G. Bock, and M. Diehl. An online active set strategy to
overcome the limitations of explicit mpc. International Journal of Robust
and Nonlinear Control, 18(8), 2008.
[29] H. J. Ferreau, C. Kirches, A. Potschka, H. G. Bock, and M. Diehl.
qpOASES: A parametric active-set algorithm for quadratic programming.
Mathematical Programming Computation, 6(4):327363, 2014.
[30] G. Frison. HPMPC. https://github.com/giaf/hpmpc.git.
[31] G. Frison. Numerical methods for model predictive control. Master's thesis,
Department of Informatics and Mathematical Modelling, Technical Univer-
sity of Denmark, Kgs. Lyngby, Denmark, 2012.
[32] G. Frison, H. H. B. Sørensen, B. Dammann, and J. B. Jørgensen. High-
performance small-scale solvers for linear model predictive control. In IEEE
European Control Conference, pages 128133. IEEE, 2014.
[33] G. Frison and J. B. Jørgensen. Eﬃcient implementation of the riccati
recursion for solving linear-quadratic control problems. In IEEE Multi-
conference on Systems and Control, pages 11171122. IEEE, 2013.
[34] G. Frison and J. B. Jørgensen. A fast condensing method for solution of
linear-quadratic control problems. In IEEE Conference on Decision and
Control, pages 77157720. IEEE, 2013.
[35] G. Frison and J. B. Jørgensen. Parallel implementation of riccati recur-
sion for solving linear-quadratic control problems. In 18th Nordic Process
Control Workshop, 2013.
[36] G. Frison and J. B. Jørgensen. Eﬃcient Solvers for Soft-Constrained MPC.
In 18th Nordic Process Control Workshop, 2015.
[37] G. Frison and J. B. Jørgensen. MPC related computational capabilities of
ARMv7A processors. In IEEE European Control Conference. IEEE, 2015.
[38] G. Frison, D. K. M. Kufualor, L. Imsland, and J. B. Jørgensen. Eﬃcient
implementation of solvers for linear model predictive control on embedded
devices. In IEEE Multi-conference on Systems and Control, pages 1954
1959. IEEE, 2014.
[39] G. Frison, L. E. Sokoler, and J. B. Jørgensen. A family of high-performance
solvers for linear model predictive control. In IFAC World Congress, pages
30743079. IFAC, 2014.
[40] G. Frison, M. Vukov, N. K. Poulsen, M. Diehl, and J. B. Jørgensen. High-
Performance Small-Scale Solvers for Moving Horizon Estimation. In IFAC
Conference on Nonlinear Model Predictive Control, pages 8086. IFAC,
2015.
320 BIBLIOGRAPHY
[41] N. F. Gade-Nielsen, J. B. Jørgensen, and B. Dammann. MPC toolbox with
GPU accelerated optimization algorithms. In 10th European Workshop on
Advanced Control and Diagnosis, 2012.
[42] G. H. Golub and C. F. van Loan. Matrix Computations. The Johns Hopkins
University Press, 3rd edition, 1996.
[43] K. Goto and R. A. van de Geijn. On reducing TLB misses in matrix
multiplication. Technical report, Department of Computer Sciences, The
University of Texas at Austin, 2002.
[44] K. Goto and R. A. van de Geijn. Anatomy of high-performance matrix
multiplication. ACM Trans. Math. Softw., 34(3), 2008.
[45] K. Goto and R. A. van de Geijn. High-performance implementation of the
level-3 blas. ACM Trans. Math. Softw., 35(1), 2008.
[46] N. Haverbeke. Eﬃcient Numerical Methods for Moving Horizon Estima-
tion. PhD thesis, Katholieke Universiteit Leuven, 2011.
[47] B. Houska, H. J. Ferreau, and M. Diehl. ACADO Toolkit  An Open-Source
Framework for Automatic Control and Dynamic Optimization. Optimal
Control Applications and Methods, 32(3):298312, 2011.
[48] B. Houska, H. J. Ferreau, and M. Diehl. An Auto-Generated Real-Time
Iteration Algorithm for Nonlinear MPC in the Microsecond Range. Auto-
matica, 47(10):22792285, 2011.
[49] J. Jerez, P. Goulart, S. Richter, G. Constantinides, E. Kerrigan, and
M. Morari. Embedded online optimization for model predictive control
at megahertz rates. IEEE Transactions on Automatic Control, 2013.
[50] J. Jerez, P. Goulart, S. Richter, G. Constantinides, E. C. Kerrigan, and
M. Morari. Embedded Predictive Control on an FPGA using the Fast
Gradient Method. In European Control Conference, pages 3614  3620,
Zurich, Switzerland, 2013.
[51] J. B. Jørgensen. Moving Horizon Estimation and Control. PhD thesis,
Department of Chemical Engineering, Technical University of Denmark,
Kgs. Lyngby, Denmark, 2005.
[52] J. B. Jørgensen, G. Frison, N. F. Gade-Nielsen, and B. Damman. Numerical
methods for solution of the extended linear quadratic control problem. In
IFAC Conference on Nonlinear Model Predictive Control, pages 187193.
IFAC, 2012.
BIBLIOGRAPHY 321
[53] J. B. Jørgensen, J. K. Huusom, and J. B. Rawlings. Finite horizon MPC
for systems in innovation form. In 50th IEEE Conference on Decision and
Control and European Control Conference (CDC-ECC), pages 18961903.
IEEE, 2011.
[54] J. B. Jørgensen, J. B. Rawlings, and S. B. Jørgensen. Numerical methods
for large-scale moving horizon estimation and control. In Int. Symposium
on Dynamics and Control Process Systems (DYCOPS), volume 7, 2004.
[55] T. Kailath, A. H. Sayed, and B. Hassibi. Linear estimation. Prentice Hall
information and system sciences series. Upper Saddle River, N.J. Prentice
Hall, 2000.
[56] R. Kalman. Contributions to the theory of optimal control. Boletin de la
Sociedad Matematica Mexicana, 1960.
[57] D. K. M. Kufoalor, G. Frison, L. Imsland, T. A. Johansen, and J. B. Jør-
gensen. Block factorization of step response model predictive control prob-
lems. Jornal of Process Control, 2015. Submitted.
[58] D. K. M. Kufoalor, S. Richter, L. Imsland, T. A. Johansen, M. Morari,
and G. O. Eikrem. Embedded Model Predictive Control on a PLC using a
Primal-Dual First-Order Method for a Subsea Separation Process. In 22nd
IEEE Mediterranean Conference on Control and Automation, Palermo,
Italy, 2014.
[59] P. Kühl, M. Diehl, T. Kraus, J. P. Schlöder, and H. G. Bock. A real-time
algorithm for moving horizon state and parameter estimation. Computers
& Chemical Engineering, 1(35), 2011.
[60] T. M. Low, F. D. Igual, T. M. Smith, and E. S. Quintana-Ortí. Analyti-
cal models for the BLIS framework. ACM Transactions on Mathematical
Software, 2015. Pending.
[61] J. M. Maciejowski. Predictive Control with Constraints. Prentice Hall,
2002.
[62] J. Mattingley and S. Boyd. CVXGEN: a code generator for embedded
convex optimization. Optimization and Engineering, 13(1):127, March
2012.
[63] D. Q. Mayne, J. B. Rawlings, C. V. Rao, and P. O. M. Scokaert. Con-
strained model predictive control: Stability and optimality. Automatica,
36(6):789814, 2000.
[64] S. Mehrotra. On the implementation of a primal-dual interior point method.
Journal on Optimization, 2(4):575601, 1992.
322 BIBLIOGRAPHY
[65] G. O. Mutambara. Decentralized estimation and control for multi-sensor
systems. CRC press, 1998.
[66] I. Nielsen, D. Ankelhed, and D. Axehill. Low-rank modiﬁcations of Ric-
cati factorizations with applications to model predictive control. In IEEE
Conference on Decision and Control, pages 36843690. IEEE, 2013.
[67] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York,
2nd edition, 2006.
[68] B. O'Donoghue, G. Stathopoulos, and S. Boyd. A splitting method
for optimal control. IEEE Transactions on Control Systems Technology,
21(6):24322442, 2013.
[69] S. J. Qin and T. A. Badgwell. A survey of industrial model predictive
control technology. Control Engineering Practice, 11:733764, 2003.
[70] C. V. Rao, S. J. Wright, and J. B. Rawlings. Application of interior-point
methods to model predictive control. Journal of optimization theory and
applications, 99:723757, 1998.
[71] J. B. Rawlings. Tutorial overview of model predictive control. Control
Systems, 20(3):3852, 2000.
[72] J. B. Rawlings and D. Q. Mayne. Model predictive control: theory and
design. Nob Hill Publishing, 2009.
[73] S. Richter, C. N. Jones, and M. Morari. Real-time input-constrained MPC
using fast gradient methods. In IEEE Conference on Decision and Control,
pages 7387  7393, 2009.
[74] M. Morari S. Richter and C. N. Jones. Towards Computational Complexity
Certiﬁcation for Constrained MPC Based on Lagrange Relaxation and the
Fast Gradient Method. In IEEE Conference on Decision and Control, pages
5223  5229, 2011.
[75] A. Shahzad, E. C. Kerrigan, and G. A. Constantinides. A fast well-
conditioned interior point method for predictive control. In CDC, pages
508513. IEEE, 2010.
[76] L. E. Sokoler, G. Frison, M. S. Andersen, and J. B. Jørgensen. Input-
constrained model predictive control via the alternating direction method
of multipliers. In IEEE European Control Conference, pages 115120. IEEE,
2014.
[77] L. E. Sokoler, G. Frison, K. Edlund, A. Skajaa, and J. B. Jørgensen. A
riccati based homogeneous and self-dual interior-point method for linear
economic model predictive control. In IEEE Multi-conference on Systems
and Control, pages 592598. IEEE, 2013.
BIBLIOGRAPHY 323
[78] L. E. Sokoler, G. Frison, A. Skajaa, R. Halvgaard, and J. B. Jørgensen.
A homogeneous and self-dual interior-point linear programming algorithm
for economic model predictive control. IEEE Transactions on Automatic
Control, 2015.
[79] L. E. Sokoler, A. Skajaa, G. Frison, R. Halvgaard, and J. B. Jørgensen. A
warm-started homogeneous and self-dual interior-point method for linear
economic model predictive control. In IEEE Conference on Decision and
Control, pages 36773683. IEEE, 2013.
[80] G. Stathopoulos, M. Korda, and C. N. Jones. Solving the inﬁnite-horizon
constrained LQR problem using splitting techniques. In IFAC World
Congress, pages 22852290. IFAC, 2014.
[81] G. Stathopoulos, M. Korda, and C. N. Jones. Solving the inﬁnite-horizon
constrained LQR problem using accelerated dual proximal methods. ArXiv:
1501.04352, 2015.
[82] G. Stathopoulos, A. Szucs, Y. Pu, and C. N. Jones. Splitting methods
in control. In 13th IEEE European Control Conference, pages 24782483.
IEEE, 2014.
[83] M. C. Steinbach. A structured interior point SQP method for nonlinear
optimal control problems. Springer, 1994.
[84] H. Sutter. The free lunch is over: A fundamental turn toward concurrency
in software. Dr. Dobb's Journal, 30(3):202210, 2005.
[85] F. G. Van Zee, T. Smith, F. D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler,
V. Austel, J. Gunnels, T. M. Low, B. Marker, L. Killough, and R. A.
van de Geijn. The BLIS framework: Experiments in portability. ACM
Transactions on Mathematical Software, 2015. Accepted.
[86] F. G. Van Zee and R. A. van de Geijn. BLIS: A framework for rapidly in-
stantiating BLAS functionality. ACM Transactions on Mathematical Soft-
ware, 41(3):14:114:33, 2015.
[87] M. Vukov. Embedded Model Predictive Control and Moving Horizon Es-
timation for Mechatronics Applications. PhD thesis, KU Leuven, April
2015.
[88] M. Vukov, S. Gros, G. Horn, G. Frison, K. Geebelen, J. B. Jørgensen, J. Sw-
evers, and M. Diehl. Real-time nonlinear MPC and MHE for a large-scale
mechatronic application. Control Engineering Practice, 45:6478, 2015.
[89] Y. Wang and S. Boyd. Fast model predictive control using online optimiza-
tion. In IFAC World Congress, pages 6974  6997, Seoul, July 2008.
324 BIBLIOGRAPHY
[90] Y. Wang and S. Boyd. Fast model predictive control using online optimiza-
tion. IEEE transactions on Control Systems Technology, 18(2):267278,
2010.
[91] R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical opti-
mization of software and the ATLAS project. Parallel Computing, 27:2001,
2000.
[92] S. J. Wright. Primal-dual interior-point methods. SIAM, 1997.
[93] M. Zanon, S. Gros, and M. Diehl. Rotational Start-up of Tethered Air-
planes Based on Nonlinear MPC and MHE. In Proceedings of the European
Control Conference, 2013.
[94] X. Zhang. OpenBLAS. http://www.openblas.net/.
