Abstract-The general solution to constrained linear and piecewise linear model predictive control (MPC) has recently been explicitly characterized in terms of piecewise-linear (PWL) state feedback control. This means that a PWL controller can be precomputed using parametric programming, and the exact explicit MPC implementation amounts to the evaluation of a PWL function in the control unit. It has recently been shown that PWL function evaluation can be accelerated by searching a binary tree data structure, leading to highly efficient, accurate, and verifiable software implementation in low-cost embedded control units. In this work, we report hardware synthesis results for this type of PWL control, and show that explicit MPC solutions can be implemented in an application specific integrated circuit (ASIC) with about 20 000 gates, leading to computation times in the microsecond scale. This opens the way for the use of highly advanced control designs such as constrained MPC in small-scale industrial and consumer electronics application areas that are characterized by fast sampling or low cost, including mechatronics, microelectromechanical systems (MEMS), automotive control, power electronics, and acoustics. The main limitation of the approach is that the memory requirements increase rapidly with the problem dimensions.
I. INTRODUCTION
R ECENTLY, several control design and synthesis methods resulting in piecewise linear (PWL) state feedback control structures have been developed. These include exact explicit PWL solutions to constrained linear model predictive control (MPC) [1] - [3] , MPC of piecewise linear systems [4] , approximate explicit PWL solutions to nonlinear constrained MPC [5] , [6] , hybrid MPC [7] , in addition to optimal constrained control allocation problems [8] , [9] .
These control design methods result in PWL controller functions represented as if if . . . where is the input to the controller function, is the dimension of this vector, is the output dimension of the function , and and are gain matrices and vectors. The polyhedral sets of the polyhedral partition are represented by linear inequalities (half-spaces separated by hyper-planes) (2) for . Such a partition may be assumed to satisfy for (they intersect only at the boundary), where denotes the open interior of the closed set . The PWL controller is completely characterized by the following data:
. The controller output will be given by the PWL function and the argument will typically change at every sampling instant based on measurements, user input, and signals from a higher level control system. Controller implementation, thus, requires evaluation of a PWL function (1) and (2) at each sampling instant in the control unit.
In some variations of approximate explicit MPC, such as [10] , the polyhedral sets are represented by vertices
where denotes the convex hull. These representations are equivalent, but require some modification of the algorithms used for evaluation. In other variants of approximate explicit MPC, such as [6] and [11] , the partition has an orthogonal structure (quad-tree or -tree [12] , [13] ) that may reduce computational complexity since the partition consists of hyperrectangles (4) rather than general polyhedra.
A binary search tree representation of arbitrary polyhedral PWL functions (1) and (2) was suggested in [14] and [15] . It leads to very low requirements for processing in the control unit, but requires additional memory to store a precomputed binary search tree data structure. In this work, we report some results on digital hardware synthesis for PWL function evaluation logic based on such a data structure. Compared to conventional MPC, which relies on extensive numerical optimization in real time, the benefits of explicit PWL evaluation include simpler verification, low computational complexity, no recursive numerical computations, and deterministic execution. The main limitation of the explicit MPC approach is that the offline computational load (during synthesis) and control unit memory requirements usually increase quickly with the dimension and complexity of the problem, making it useful mainly for small scale problems. Still, this may not be prohibitive in some applications from areas such as automotive [16] - [18] , biomedical systems, aerospace, power electronics [19] , microelectronics/ microelectromechanical systems (MEMS) [20] , acoustics, and rotating machinery [5] . Recently, the problem of implementing explicit MPC efficiently on microprocessors and microcontrollers in practical application 1 has been given much attention [21] - [26] .
A different approach to a computationally efficient implementation of MPC is to make the core computations of a numerical online solver as fast as possible. Suboptimal and simplified MPC strategies [27] - [30] , allow computational complexity to be reduced at the cost of performance loss. A general purpose processor is used in [31] for the efficient implementation of MPC, while in [32] a parallel processing real-time architecture for MPC is investigated, and in [33] a field-programmable gate array (FPGA) implementation of an iterative numeric quadratic programming solver is proposed and tested. Dynamically reconfigurable analog/digital hardware capable of handling MPC computation requirements was proposed in [34] , dynamic scheduling of real-time MPC was considered in [35] , while [36] and [37] describe methods for reducing the precision of a microprocessor to the minimum while maintaining close to optimal control performance. A logarithmic number system (LNS)-based microprocessor was proposed in [38] for computational cost savings. Compared to explicit MPC, such approaches are expected to scale much better with respect to memory use as the dimensions of the problem increases.
The main contribution of the present paper is to consider the option of direct hardware synthesis of explicit PWL MPC controllers using a high-level hardware synthesis tool [39] . It may prove to be beneficial or cost efficient in some applications where an application-specific integrated circuit (ASIC) or FPGA implementation would be preferable to a microprocessor-based implementation.
II. POLYHEDRAL PARTITION BINARY SEARCH
A typical polyhedral partition used to define a PWL function solving a MPC problem contains several hundred or thousands of polyhedra, even if complexity reduction methods such as [40] and [41] are used. Often, neighboring regions contain the same linear mapping such that the number of linear mapping coefficient matrices to store is significantly less than the polyhedral regions that must be stored as linear inequality coefficient matrices.
Evaluating a PWL function (1) for a given consists of the following two steps:
• identify the polyhedral region index such that ; • evaluate the corresponding linear function . The second step is completely straightforward, while the first step can be implemented in at least two ways, as described below (see, also, [42] for a third way). Direct implementation of the first step by sequentially searching through each polyhedral region in order to determine the one that satisfies is a simple, but computationally inefficient strategy. In the worst case, matrix operations of the form must be carried out, which is computationally expensive since the number of polyhedral regions may amount to several thousands (see [2] for worst case complexity results). The use of a binary search tree to organize the search is a more efficient strategy [15] , and will be used here.
A. Binary Search Tree Representation
The PWL evaluation strategy described in [15] is to build a binary search tree data structure that supports efficient search for the polyhedral region that satisfies for any given . The overall idea is based on the observation that evaluation of a linear expression corresponding to a single hyperplane cut may significantly reduce the number of candidate polyhedral regions. This is illustrated in Fig. 1 , where the partition contains 7 polyhedra. Consider the point near the center of the area. In order to determine which polyhedron this point belongs to, the linear expression may be evaluated. This hyperplane cuts the partition into two parts:
(below the hyperplane) and (above the hyperplane). With the given , the sign of the evaluated expression shows that is located above the hyperplane, and one can infer that . In other words, 3 out of 7 polyhedral regions have been excluded, and the remaining problem is greatly reduced. This procedure can be repeated, as shown in the second part of Fig. 1 . By evaluating the linear expression one can infer that . In other words, two out of four remaining regions have been eliminated and the problem is again greatly reduced. Since and are separated by a single hyperplane, one can easily detect that by evaluating this third linear expression. In summary, evaluation of three linear expressions is sufficient to determine which polyhedral region belongs to in this case. We observe that each region is characterized by three or four linear inequalities, such that an exhaustive search may require the evaluation of more than 20 linear expressions in this case. For general algorithms to construct such a binary search tree, we refer to [15] and remark that the computational benefits are relatively much more significant in larger examples.
The procedure illustrated in Fig. 1 is completely general, and corresponds to the construction and traversal of a binary search tree where at each node there is a linear expression corresponding to a hyperplane that cuts the remaining partition into three parts: Polyhedra that are completely on one side of the hyperplane, polyhedra that are completely on the other side of the hyperplane, and polyhedra that are cut by the hyperplane. Estimating that each of these parts are of similar size, each node in the search tree will exclude approximately 1/3 of the polyhedral regions. This leads to a search tree depth (5) that is estimated to be logarithmic in , the number of polyhedra in the partition [15] .
B. Binary Tree Search Algorithm
The PWL function evaluation problem consists of executing at each sample two sets of nested loops. • Tree search loop: Starting at the root node of the binary search tree the loop iterates through the nodes until a leaf node is reached. At each loop iteration, an inner loop evaluates the hyperplane cut . A binary tree branch is based on the sign of this expression. The leaf node identifies the linear expression to be evaluated to compute the PWL function value.
• Control evaluation loop: Evaluation of the linear expression , where the index corresponds to the leaf node. This is a nested loop corresponding to a matrix multiplication operation. Together the two loops form a control block. The total time for execution of a control block is the sum of the times for the two parts.
C. Numerical Roundoff Errors
Due to roundoff errors in numerical computations that leads to the polyhedral representation of the PWL function (see [2] and [3] ), the mathematical partition property for and ( is the whole region of interest) will hold only approximately due to numerical errors. This means that some regions may slightly overlap, and there may be small gaps between some regions. The binary search procedure will automatically complete the partition since when a leaf node is reached, the corresponding linear function will be evaluated without regard to numerical errors in the representation of the polyhedra. This means that the polyhedra will be extended to cover the small gaps. Likewise, overlapping regions will be uniquely resolved, and the boundary linear functions will be extrapolated outside the original partition if happens to be located outside the partition.
Numerical roundoff errors cannot accumulate in the tree search loop since the only information carried from one iteration to the next is the binary branching decision. The resulting insensitivity to numerical errors means that fixed-point arithmetics may be utilized as an alternative to floating-point arithmetics without any complications. Consequently, the accuracy of implementation will degrade gracefully as the number of bits used to represent the numerical data decreases (roundoff errors), in the sense that numerical instability will not occur. The hardware synthesis tests that we performed used 32-bit integer arithmetic to represent the PWL function parameters and 16-bit integers for array indices. Less costly circuits may be achievable if shorter integers can be used, but scaling and accuracy becomes an issue.
III. HARDWARE SYNTHESIS
In this section, we consider digital hardware synthesis of the binary tree search algorithm in Section II-B, and how the complexity of the resulting hardware design scales with the problem parameters. The key parameters characterizing the PWL function complexity are defined in Table I .
A. Hardware Synthesis Approach
The hardware design program PICO-Express (a product of Synfora Inc.), based on the HP Labs PICO research [39] takes the C source code of the PWL function evaluation algorithm, as described in Section II-B, as an input along with some assumptions about the memory bandwidth. This program then computes an efficient hardware architecture including cache size, functional register assignments, and arithmetic logic units (ALUs). It also generates estimates of performance, execution time, and memory requirements. PICO produces the hardware design expressed at the register-transfer level (RTL) in a hardware design language, either VHDL or Verilog. The RTL design is tested and verified by PICO as well. This design can be implemented in an FPGA or an ASIC. In either case, routing and placement tools, as well as libraries of parameterized macrocells (for the RTL level components such as adders and registers) are used to generate the FPGA or ASIC implementation. The main differences between ASIC and FPGA implementations are as follows. ASICs can be faster, can include analog as well as digital signals, can demonstrate lower power consumption and are less expensive for large unit volumes. FPGAs are more flexible during design time, less expensive for small lots, and require less lead time.
The tree search loop and control evaluation loop are processed in a pipeline. It consists of a pipefill phase, the prolog, where iterations are initiated at regular intervals. The initiation interval is the number of clock cycles between sequential starts of the inner loop. No iterations are complete during the prolog. Next, the loop enters the steady-state where for every initiation interval one iteration is completed and one new iteration is initiated. When the loop nears termination, it enters the pipedrain state, known as the epilog, in which the pipeline is drained by allowing iterations to complete without new iterations being initiated. PICO synthesizes a hardware design with the requested initiation interval using heuristics to reduce the hardware cost.
(If an unachievable initiation interval is requested, PICO reports that the requested throughput is impossible using its library of functional elements.)
Both the tree search loop and the control evaluation loop are doubly nested loops. In the tree search loop, whose synthesis we report here, the outer loop is a loop over depth in the search tree. The inner loop is a loop of iterations for the evaluation of the dot product of the input vector with one hyperplane normal vector . An initiation interval equal to one can be achieved by PICO, since the inner loop recurrence (evaluation of ) is through an addition, and PICO can generate designs using a one-cycle adder. There is also an important dependence of larger latency. At the end of the inner loop, the dot product is first compared with a constant , then a branch is taken to select one of two possible child nodes in the search tree, then the index of the selected node is used to start the lookup of the parameters of the next hyperplane from memory. To accommodate the latency of these operations and still achieve an initiation interval equal to one, the inner loop is padded with a few "slack" iterations at the front. These have no functionality, but they increase the number of inner-loop iterations between the completion of one dot product and the start of the next one, so that there are enough cycles to cover this latency. For the case studies, all the tables of hyperplane normals and pointer arrays that define the search tree structure are assumed to be stored in fast SRAM in the accelerator device. PICO can also synthesize designs in which data arrays are kept in main memory.
B. Benchmark Problems
The characteristics of the benchmark problems used in our case study are given in Table II . The double integrator and helicopter examples are described in more detail in [14] , and the quadruple integrator example in [42] . The MPC horizon is the number of samples the MPC looks into the future when optimizing the control input in order not to violate the constraints at some future point in time. Please see these references for a description of the control design, performance, and additional details.
C. Number of Clock Cycles Per Control Blocks
The total number of clock cycles required to execute one control block is summarized in Table III and illustrated in Fig. 2 . The number of clock cycles does not explicitly depend on the number of parameters but rather on the search depth of the tree. The number of clock cycles , is approximately given by (6) where is the number of padding iterations needed, as previously discussed. It is 2 at 10-20 MHz clock speeds and 4 for 200 MHz clock speeds. This variable represents the time it takes to set up the inner loop and the number of clock cycles for memory access. The first term is the number of clock cycles for the tree search; the second is the number of clock cycles needed for evaluation of once the appropriate polyhedron has been determined. Because the search depth increases as , the execution time also increases as . This scaling can continue until the number of nodes is so large that the data can no longer be held in scratch SRAM memory and must be added as general system memory. The limits depend on the technology of the chip but roughly 20 MB of SRAM is put on current Intel chips so this represents a break point. In the future, this number will go up. Beyond this limit, the access time jumps to roughly 50 ns. For clock speeds of 10 and 20 MHz, this is not a problem, but for 200 MHz clock speeds, this transition can lead to slower performance requiring a larger value for . In Fig. 3 , the dependence of the tree search clock cycles as a function of clock speed and number of nodes is shown. The dependence of the tree search on clock speed is not strong up to 200 MHz indicating that for these problems clock speed translates directly into improved control loop speed. It should be mentioned that for 200 MHz clock speeds, even the slowest benchmark problem can execute in about 1.1 s. 
D. Number of Gates and Chip Circuit Area
The number of gates as a function of problem dimensions are given in Table III and Fig. 4 . Basically, the number of gates (20 kGates) for the tree search does not depend strongly on the problem parameters. The number of gates for the function evaluation also does not increase much with problem complexity. This result occurs because, in all cases, we have fixed the performance at one loop iteration per clock cycle, and PICO has produced the least costly hardware that it can, while achieving this fixed (independent of parameters) throughput. Total latency, as previously indicated, depends strongly on the parameters. If we were to change the performance requirement up or down, we would see an increase or a decrease in the gate count.
The size of the embedded SRAMs does not change, however, with performance, and these may dominate the chip cost. The number of SRAM cells scales directly with the number of search tree nodes, see Fig. 4 . For larger problems, this cost is by far the dominant one for implementing the chip and may be prohibitive for some applications.
PICO's output is a hardware design expressed in Verilog, which is an industry-standard hardware description language. In order to generate an ASIC implementation, the Verilog specification [combined with the rest of the specification of an entire system-on-chip (SoC) design, normally] is input to a logic-synthesis step, which in turn creates a netlist, input to a place-androute step. We used a standard logic synthesis tool, Synopsys dc Ultra, on the Helicopter design. The process targeted was TSMC at 0.13 m. Synopsys gives a process-specific estimate of chip area for an ASIC implementation. For this design, its estimate is 91 000 (square microns); of this total, 53 000 is for sequential circuit elements (the memory used for the hyperplace normals takes up most of this) and 38 000 is for the combinational circuit elements, the gates.
E. Discussion
If the clock frequency is assumed to be 200 MHz, then the control loop execution time would range from 0.2 to 1.1 s for the benchmark problems, see Table III and Fig. 5 . This result is rather important in that it indicates that control for mechanical, thermal, and acoustic time scales can be handled by these controllers with speed to spare, e.g., [16] - [18] . In particular, lower level constrained multidimensional controls in mechanical systems may readily be implemented using hardware implementation of explicit MPC. Because the complexity of the problem grows with the dimension of the parameter vector and the MPC horizon , [2] , [14] these numbers should be minimized in order to improve the performance. The biggest cost factor appears to be the memory needed for storing the hyperplane normals; reducing the complexity of the PWL function representation is in fact an active area of research [4] , [11] , [14] , [40] , [41] , [43] . Moreover, a tradeoff between computational and memory requirements can be made by constructing a binary search tree of less depth such that at each leaf node a number of candidate linear expression can be compared using a sequential search [44] or other evaluation methods [42] .
IV. CONCLUSION
We have shown that small-scale explicit MPC solutions exactly represented as PWL functions can be efficiently implemented in an ASIC using about 20 kGates and resulting in execution times around 1 s with a 200-MHz clock frequency. The binary search tree representation of polyhedral PWL mappings is the key data structure that allows an efficient binary tree search to be applied. The cost of implementation is largely determined by the requirement for memory to store the data structures needed to hold the PWL functions and the associated search tree. Methods for complexity reduction and PWL function approximation will greatly reduce implementation cost.
