Abstract-Solutions to constrained linear and hybrid model predictive control (MPC) can be explicitly characterized in terms of piecewise linear (PWL) state feedback control. A PWL controller is pre-computed using parametric programming, and the exact explicit MPC implementation corresponds to the evaluation of a PWL function in the control unit. It has recently been shown that such evaluation can be boosted by binary search tree data structures. We report digital hardware architecture design results for this type of PWL control, and show that explicit MPC solutions can be implemented in a standard field programmable gate array (FPGA) or an application specific integrated circuit (ASIC) with about 20 kGates leading to computation times around one microsecond per evaluation for typical applications. This facilitates implementation of constrained MPC in small-scale industrial and consumer electronics applications that are characterized by fast sampling or low production cost, such as mechatronics, MEMS, rotating machinery, power electronics, and acoustics.
I. INTRODUCTION
Model predictive control (MPC) is a powerful control method that allows constraints and MIMO systems to be treated systematically. Until recently, MPC has been exclusively used for relatively slow systems, mainly process plants. However, fast implementation of model predictive control based on online optimization in real-time systems has been considered in [1] , [2] , [3] , [4] , [5] , [6] , [7] , [8] for applications such as MEMS and mechatronics. Several modifications of the conventional MPC formulations that reduces online computational complexity has also been suggested [9] , [10] .
Moreover, it has recently been shown that several classes of constrained MPC problems admit explicit piecewise linear (PWL) solutions. This includes constrained linear MPC [11] , [12] , approximate explicit PWL solutions to nonlinear constrained MPC [13] , [14] , and hybrid MPC [15] . Implementation of explicit MPC amounts to the relatively simple task of evaluating a PWL function that is pre-computed using multi-parametric programming.
A binary search tree representation of arbitrary polyhedral PWL functions was suggested in [16] , [17] . It leads to very low processing requirements for evaluation in the control unit, but requires additional memory to store a precomputed binary search tree data structure. In this work we report some results on hardware synthesis of PWL function evaluation based on such a data structure.
Compared to conventional MPC, that relies on extensive real-time numerical optimization, the benefits include verifiability, low computational complexity, no need for floating point arithmetics (no recursive numerical computations), and deterministic execution. The main limitation of the explicit MPC approach is that the offline computational load (for solving multi-parametric programs and generating the binary search tree data structure) and control unit memory requirements usually increase quickly with the dimension and complexity of the problem, making it useful mainly for small scale problems. Still, this may not be prohibitive in many application areas. The main contribution of the present paper is to consider the option of direct hardware implementation using a high-level hardware synthesis tool [18] , which may prove to be beneficial or cost-efficient in some applications.
II. EXPLICIT MODEL PREDICTIVE CONTROL
Below we give a short summary of linear MPC problems and their explicit solutions. Consider the linear system
n is the state variable, u t ∈ R m is the input variable, A ∈ R n×n , B ∈ R n×m and (A, B) is a controllable pair. The output and the control input are subject to the bounds
where y min < y max and u min < u max . For the current x t , MPC solves the optimization problem
subject to 
For ease of notation, we may in the sequel skip the index t, and use u for u t and x for x t . These problems can be reformulated and solved as the following multi-parametric programs:
1) mp-QP (p = 2):
where
T and is a vector of slack variables, see [19] for details.
3) Robust MPC (p = ∞). In [20] the authors show that when introducing uncertainty to the linear model (1) , that is
where v(t) and w(t) are unknown input disturbances and parametric uncertainties, respectively, a min-max optimization problem analogous to (3)-(9) can be solved by N mp-LPs. 4) mp-MILP (p = 1 or p = ∞): Here the linear model (1) is replaced by the piecewise affine model.
where f i ∈ R n are constant vectors and {X i } is a polyhedral partition of the state+input space. The problem can be reformulated as a mp-MILP, see [21] . Similar formulations can also be given for similar and extended MPC problems, e.g. with non-zero setpoint, input rate constraints, infeasibility handling etc. [11] . These control design methods described above result in PWL control functions k :
with a polyhedral partition P = {X 1 , ....., X N } where the polyhedral sets are represented by linear inequalities (hyperplanes)
1 Although this does not strictly define a norm, we choose this description for ease of notation.
for i = 1, ..., N . Such a partition satisfies intX i ∩intX j = ∅ for i = j (they intersect only at the boundary), where intX i denotes the open interior of the set X i . The PWL controller is completely characterized by the following matrices and vectors:
The controller output will be given by the PWL function u t = k(x t ) and the argument x t will typically change at every sampling instant. Controller implementation thus requires evaluation of a PWL function (17)- (18) at each sampling instant in the control unit.
In some variations of explicit MPC, such as [22] , the polyhedral sets X i are represented by vertices
where conv() denotes the convex hull. These representations are equivalent, but require some modification of the algorithms used for evaluation. In other variants of explicit MPC, such as [14] , [23] , the partition has an orthogonal structure (quad-tree or k − d-tree) that further reduces computational complexity since the partition consists of hyper-rectangles
rather than general polyhedra.
III. POLYHEDRAL PARTITION SEARCH STRATEGIES
A typical polyhedral partition used to define a PWL function solving an MPC problems contains several hundred or thousands of polyhedra, even if complexity reduction methods such as [24] , [25] are used. Often, neighboring regions contain the same linear mapping, cf. (17) , such that the number of linear mapping coefficient matrices to store is significantly less than the number of polyhedral regions that must be stored as linear inequality coefficient matrices.
A. Binary search tree representation
Evaluating a PWL function (17) for a given x consists of two steps:
• First, identify the polyhedral region index i such that
x ∈ X i , and • second, evaluate the linear function K i x + g i . The second step is completely straightforward, while the first step can be implemented using a computationally efficient strategy described in [17] . It builds a binary search tree data structure that supports efficient search for the polyhedral region X i that satisfies x ∈ X i . The overall idea is based on the observation that evaluation of a linear expression c i x − d i corresponding to a single hyper-plane cut c i x − d i = 0 may provide information that significantly reduce the number of candidate polyhedral regions. This is illustrated in Figure 1 , where the partition P = {X 1 , X 2 , X 3 , X 4 , X 5 , X 6 , X 7 } contains only 7 polyhedra. Consider the point x near the center of the area. In order to determine which polyhedron this point belongs to, the linear expression c 1 x − d 1 may be evaluated. This hyper-plane essentially cuts the partition into two parts: (below the hyper-plan) and {X 4 , X 5 , X 6 , X 7 } (above the hyper-plane). With the given x, the sign of the evaluated expression c i x − d i provides information that x is located above the hyper-plane, and one can infer that x ∈ X 4 ∪X 5 ∪ X 6 ∪ X 7 . In other words, 3 out of 7 polyhedral regions has been excluded, and the remaining problem is greatly reduced. This procedure can be repeated, as shown in the 2nd part of Figure 1 . By evaluating the linear expression
In other words, 2 out of 4 remaining regions have been eliminated and the problem is again greatly reduced. Since X 4 and X 5 are separated by a single hyper-plane, one can easily detect that x ∈ X 5 by evaluating a third linear expression. In summary, evaluation of three linear expressions is sufficient to determine which polyhedral region x belongs to in this case. We observe that each region is characterized by 3-4 linear inequalities, such that an exhaustive search may require the evaluation of more than 20 linear expressions in this case. For general algorithms to construct such a binary search tree, we refer to [17] . The procedure illustrated above is completely general, and corresponds to the construction and traversal of a binary search tree where at each node there is a linear expression corresponding to a hyper-plane that cuts the remaining partition into three parts: Polyhedra that are completely on one side of the hyper-plane, polyhedra that are completely on the other side of the hyper-plane, and polyhedra that are cut by the hyper-plane. Estimating that each of these parts are of similar size, each node in the search tree will exclude approximately 1/3 of the polyhedral regions. This leads to a search tree depth and computational complexity that is logarithmic in the number of polyhedra in the partition [17] .
B. Numerical round-off errors
Due to the complicated numerical computations that leads to the polyhedral representation of the PWL function, see the references in the introduction, the mathematical partition property intX i ∩ intX j = ∅ for i = j and X 1 ∪ X 2 ∪ ... ∪ X N = X (X is the whole region of interest) will hold only "approximately". This means that some regions may slightly overlap, and there may be small "numerical gaps" between some regions. The binary search procedure will automatically complete the partition since when a leaf node is reached, the corresponding linear function will be evaluated without regard to numerical errors in the representation of the polyhedra. This means that the linear functions will be arbitrarily extended to cover the small "numerical gaps". Likewise, the boundary linear functions will be arbitrarily extrapolated outside the original partition if x happens to be located outside the partition.
The insensitivity to numerical errors and lack of recursive numerical computations means that fixed point arithmetics may be utilized without any complications since round-off errors cannot accumulate. The accuracy of implementation will decrease gracefully as the number of bits used to represent the numerical data decreases.
C. Binary tree search algorithm
The PWL function evaluation problem consists of executing at each sample two sets of nested loops
• Tree search loop: Starting at the root node of the binary search tree the loop iterates through the nodes until a leaf node is reached. The leaf node identifies the linear expression to be evaluated to compute the PWL function value.
• Control evaluation loop: Evaluation of the linear expression k(x) = K i x + g i where the index i corresponds to the leaf node. This is a nested loop corresponding to a matrix multiplication operation. Together the two nested loops form a control block. The total time for execution of a control block is the sum of the times for the two parts.
IV. HARDWARE DESIGN RESULTS
In this section we consider how the binary tree search algorithm in section III-C can be implemented in digital hardware, and how the complexity of the resulting hardware design scales with the problem parameters. The key parameters characterizing the PWL function complexity are defined in Table I . 
A. Hardware design approach
The hardware design program PICO-Express (a product of Synfora Inc., http://synfora.com), based on the HP Labs PICO (Program In Chip Out) research [18] takes the C source code of PWL function evaluation, as described in section III-C, as an input along with some assumptions about the memory bandwidth. This program then computes an optimal architecture including cache size, functional register assignments, ALUs, etc. It also generates estimates of performance, execution time, memory requirements etc. PICO produces the hardware design expressed at the register-transfer level (RTL) in a hardware design language, either VHDL or Verilog. The RTL design is tested and verified by PICO as well. This design can be implemented in an FPGA or an ASIC. In either case, routing and placement tools, as well as libraries of parameterized macrocells (for the RTL level components such as adders, registers, etc.) are used to generate the FPGA or ASIC implementation.
The tree search and control evaluation loops are processed in a pipeline. It consists of a pipe-fill phase, the prolog, where iterations are initiated at regular intervals. These intervals are know as initiation intervals. No iterations are complete during the prolog. Next, the loop enters the steady-state where every initiation interval, one iteration is completed and one new iteration is initiated. When the loop nears termination, it enters the pipe-drain state, know as the epilog, in which the pipeline is drained by allowing iterations to complete without new iterations being initiated. Iterations complete every initiation interval until all iterations have completed.
The hardware design program tries to deliver a given initiation interval with a minimum gate count. An initiation interval of one can be achieved by PICO, since the inner loop recurrence is through an addition, and PICO can generate designs using a one-cycle adder. There is also an important dependence of larger latency. At the end of the inner loop, the dot product of the input/parameter vector x with one hyperplane normal c i is first compared with a constant d i , then a branch is taken to select one of two possible child nodes in the search tree, then the index of the selected node is used to start the lookup of the parameters of the next hyperplane from memory. To accommodate the latency of these operations and still achieve an initiation interval of one, the inner loop is padded with a few "slack" iterations at the front. These do nothing, but they increase the number of inner-loop iterations between the completing one dot product and starting the next, so that there are enough cycles to cover this latency. For these experiments, all the tables of hyperplane normals and pointer arrays that define the search tree structure are assumed to be stored in fast SRAM in the accelerator device. PICO can also synthesize designs in which data arrays are kept in main memory.
B. Benchmark problems
The characteristics of the benchmark problems are given in Table II . The double integrator and helicopter examples are described in more detail in [26] , and the quadruple integrator example in [27] .
C. Number of clock cycles per iteration
The total time to execute one control block in clock cycles is summarized in Table III and illustrated in Figure  2 . The number of clock cycles does not explicitly depend on the number of parameters but rather on the search depth of the tree. In fact the number of clock cycles, NClock is roughly given by
where SLACK is the number of padding iterations needed, as discussed above, and is 2 at 10 -20 MHz clock speeds and 4 for 200 MHz clock speeds. This variable represents the time it takes to set up the inner loop and the number of clock cycles for memory access. The first term is the search time and the second is the function evaluation term. Because the search depth increases as log 2 (M ), as expected, the execution time also increases as log 2 (M ). This scaling can continue until the number of nodes is so large that the data can no longer be held in scratch SRAM memory and must be added as general system memory. Then the access time jumps to roughly 50 nsec. For clock speeds of 10 and 20 MHz, this is not a problem but for 200 MHz clock speeds, this transition can lead to slower performance requiring a larger value for SLACK. In Figure 3 the dependence of the tree search clock cycles as a function of clock speed and number of nodes is shown. The dependence of the tree search on clock speed is not strong up to 200 MHz indicating that for these problems clock speed translates directly into improved control loop speed. It should be mentioned that for 200 MHz clock speeds, even the slowest benchmark problem can execute in about 1.1 μs.
D. Number of gates
The number of gates as a function of problem dimensions are given in Table III and Figure 4 . Basically, the number of gates (20 kGates) for the tree search does not depend strongly on the problem parameters. The number of gates for the function evaluation also does not increase much with problem complexity. This result occurs because, in all cases, we have fixed the performance at one loop iteration per clock cycle, and PICO has produced the least costly hardware that it can, while achieving this fixed (independent of parameters) throughput. Total latency, as indicated above, depends strongly on the parameters. If we were to change the performance requirement up or down, we would see an increase or a decrease in the gate count.
The size of the embedded SRAMs does not change, however, with performance, and these may dominate the chip cost. However, the number of SRAM cells scales directly with the number of search tree nodes, see Figure  4 . For larger problems, this cost is by far the dominant cost of implementing the chip.
V. DISCUSSION AND CONCLUSIONS If the clock frequency is assumed to be 200 MHz, then the control loop time would range from 0.2 μs to 1.1 μs for the benchmark problems. This result is rather important in that it indicates that control for mechanical, thermal, and acoustic time scales can be handled by these controllers with speed to spare. In particular, lower level constrained multi-dimensional controls in mechanical systems such as robotics, cars etc. can readily be implemented using hardware implementation of explicit model predictive control. Because the complexity of the problem scales strongly with the number of parameters, this number should be minimized in order to improve the performance. The biggest cost factor appears to be the memory requirements for the number of nodes M , but reducing the complexity of the PWL function representation is in fact an active area of research [16] , [26] , [28] , [24] , [25] , [23] .
In our case study, the hardware synthesis applied floating point arithmetics. Sufficient numeric accuracy will be achieved with implementation using fixed point arithmetics. This would lead to a reduction in memory requirements because data of lower resolution are stored, and in addition a significant reduction in the number of gates.
