Abstract-In this paper we propose a digital architecture suited for fast, low-power and small-size electronic implementation of PieceWise Affine (PWA) functions defined over n-dimensional domains partitioned into multi-resolution hyperrectangles. The point location problem, which requires most of the computational effort, is solved through an orthogonal search tree, which is easily and efficiently implementable. In the case of domains partitioned into single-resolution hyperrectangles, a simpler and even faster architecture is proposed. After introducing the new architectures, their key features are discussed and compared to previous architectures implementing PWA functions with domains partitioned into different types of polytopes. Case studies concerning the FPGA implementation of so-called explicit Model Predictive Control (MPC) laws for constrained linear systems are used as benchmarks to compare the different architectures.
Abstract-In this paper we propose a digital architecture suited for fast, low-power and small-size electronic implementation of PieceWise Affine (PWA) functions defined over n-dimensional domains partitioned into multi-resolution hyperrectangles. The point location problem, which requires most of the computational effort, is solved through an orthogonal search tree, which is easily and efficiently implementable. In the case of domains partitioned into single-resolution hyperrectangles, a simpler and even faster architecture is proposed. After introducing the new architectures, their key features are discussed and compared to previous architectures implementing PWA functions with domains partitioned into different types of polytopes. Case studies concerning the FPGA implementation of so-called explicit Model Predictive Control (MPC) laws for constrained linear systems are used as benchmarks to compare the different architectures.
Index Terms-Digital circuits, digital control, field programmable gate arrays, piecewise linear techniques, predictive control.
I. INTRODUCTION
T HE need for circuits evaluating nonlinear functions in real time arises in many engineering problems where it is not possible (usually due to size limitations) to resort to personal computers or other general-purpose devices, such as DSP boards. Applications that may benefit of the availability of electronic circuits implementing multivariate nonlinear functions can be found in many different fields, ranging from biology (neural networks) [1] , to computer simulation (Internet models) [2] , economics (heterogeneous agents models) [3] , electrical power grids [4] , and so forth. Another appealing application of circuits implementing nonlinear functions is the synthesis of embedded control systems where the control law has a nonlinear dependence on the state variables [5] , [6] .
Among possible techniques for the approximation and circuit synthesis of nonlinear functions, piecewise affine (PWA) techniques present the advantages of being versatile and having the ability of approximating any nonlinear system (or nonlinear control law) arbitrarily well [7] - [10] .
PWA functions, which already have a long history in circuit theory [11] - [16] and control theory [17] , [18] are characterized by expressing the output values in terms of a collection of affine functions and corresponding domains. In particular, given an input point, the value of the function is expressed by an affine map whose coefficients depend on the region containing the input. Recently, the attention in the control community for PWA functions has increased due to the discovery that popular optimization-based control laws, called Model Predictive Control (MPC), can be written explicitly as a PWA state feedback law defined using a polytopic partition of the feasible set [19] , [20] . However, the resulting partitioning is typically irregular and often consists of a huge number of regions. As a consequence, the problem of determining the region the state belongs to, called the point location problem [21] , requires a large amount of computational effort. To solve this problem, in [5] a binary search tree is built offline to minimize the resulting depth. By exploring the tree online, the polytope containing the input can be located in a smaller time, if compared to a combinatorial approach. Digital architectures for the implementation of these PWA Generic (henceforth denoted as PWAG) functions have been recently proposed, solving the point location problem through a VLSI [6] or a FPGA [22] implementation of the binary search tree proposed in [5] . However, the irregular shapes of the polyhedral regions can still lead to an enormous computation time for large-scale problems. In [23] , a lattice approach is introduced to allow a reduction of online computation and storage when dealing with explicit MPC solutions that have many polyhedral regions with equal affine control laws. In [24] , algorithms for online evaluation of PWAG functions, based on a truncated binary search tree and/or lattice representation are presented, while in [25] the implementation of PWAG functions is simplified by a two-stage algorithm utilizing the concept of hash tables together with the usual direct search. The latter two methods enable the designer to tradeoff between (off-line and on-line) time and storage complexity. A digital architecture implementing PWA functions based on a lattice representation has been proposed in [26] , [27] .
1549-8328/$31.00 © 2012 IEEE Another way to improve the computation efficiency in the online evaluation of PWAG functions is to approximate the optimal PWAG control law through suboptimal PWA control laws defined over more regularly-shaped partitions [25] , [28] - [32] . For instance, the point location problem can be highly simplified by using a simplicial [33] or (hyper)rectangular [30] - [32] partition of the domain. PWA Simplicial (PWAS) functions can be easily implemented on specific digital or mixed-signal architectures [8] , [34] , [35] . Two completely digital solutions suitable for FPGA implementation, one mainly serial and one fully parallel (henceforth referred to as architectures A and B, respectively), were proposed in [10] . Generalizations of them in [36] allow to represent PWAS functions defined over non-uniform partitions, but the number of coefficients required still grows exponentially with the number of dimensions, thus making this approach affected by the "curse of dimensionality".
The shape of a PWA function is coded by a set of coefficients that are embedded in the circuit, and one of the main problems the circuit designers are concerned with is to minimize the number of coefficients, i.e., to compress the information since this heavily affects the overall circuit size. This can be fruitfully done by resorting to multi-resolution methods (see, e.g., [37] , [38] in the field of wavelets), where the domains over which the function is affine are characterized by regularity and self-similarity in scale. The circuit architectures proposed in this paper rely on a multiresolution-based compression algorithm for the approximation of PWAG (control) functions by PWA functions defined over a multi-resolution hyperrectangular partition of the domain (which we will denote as PWAR functions) [30] - [32] . We remark that the proposed method has been adopted not owing to its compression performances (other algorithms are more efficient from this point of view), but since it is suitable for circuit implementation. As stated above, from a circuit implementation perspective, a multi-resolution partition of the domain [29] - [32] , [39] is of particular interest since it provides a mitigation to the curse of dimensionality.
Since suitable digital implementations of PWAR functions are still missing, the focus of this paper is on introducing suitable architectures to evaluate (possibly discontinuous) PWAR functions, in particular those with a multi-resolution partition. The point location problem is solved through an orthogonal tree [30] , [31] , 1 which provides fast on-line computation and a low circuit complexity for large-scale problems. In the case the domain is partitioned into single-resolution hyperrectangles, a simpler and even faster architecture is proposed.
The outline of the paper is as follows. Some notations together with the necessary preliminaries are introduced in Section II. In Section III the procedure adopted for the evaluation of PWAR functions is presented, with a particular focus on the point location strategy. Two different implementations are proposed in Section IV for each of the new architectures and they are compared with other existing circuit solutions in Section V. The results obtained by the FPGA implementations for two case studies of MPC are used to point out the main 1 Note that the type of search tree used here (specifically, a generalized octree or -tree [21] ) is different from the binary search tree used e.g., in [5] . features of the different digital realizations in Section VI. Section VII contains the concluding remarks.
II. PRELIMINARIES
We call a collection of regions , a partition of a domain , if and for . If in addition all sets are polytopes, we call it a polytopic partition. A scalar function on a domain is called a PWAG scalar function if there is a polytopic partition of and coefficients and such that satisfies
where denotes the transpose of matrix . Fig. 1(a) shows an example of a polytopic partition of a two-dimensional domain. While PWAG functions are a rich class of non-linear functions, from an implementation point of view it is convenient (as we will show in Section IV) to consider subclasses of PWA functions defined over domains partitioned into more regularly shaped regions. In the next subsections we define three particular partitions of compact domains: simplicial, single-resolution hyperrectangular and multi-resolution hyperrectangular.
A. Simplicial Partition
The hyperrectangular compact domain can be subdivided by using a simplicial boundary configuration [40] such that the -th dimensional component of contains subintervals of ampli-tude . Thus, is partitioned in a particular manner (see [40] for details) into hyperrectangles, each hyperrectangle containing nonoverlapping hypertriangular simplices, whose vertices are collected in the set . For circuit implementation reasons, we choose , with , so that the resulting number of vertices for each dimension is an exact power of two.
We call the obtained partition a simplicial partition [33] , [41] , and we can use it to define a subclass of continuous PWAG functions henceforth referred to as PWAS. In fact, PWAS functions are all PWAG functions, as in (1), in which the underlying partition is a simplicial partition and the function is continuous. The interested reader is referred to [28] for further information.
As we shall see in the following, from a circuit-synthesis point of view, it is convenient to handle variables ranging over scaled intervals. For this reason we introduce an affine transformation of the original input variable to map the aforementioned hyperrectangular domain to a domain , where the hyperrectangles become hypercubes and the amplitude of the subintervals is [22] , [34] . Fig. 1(b) shows a scaled two-dimensional domain partitioned into simplices with .
B. Single-Resolution Hyperrectangular Partition
Also in this case we subdivide each axis of the hyperrectangular domain into subintervals of amplitude , but we choose , with , so that the resulting partition contains hyperrectangles of identical shape. 2 We call the obtained partition a single-resolution hyperrectangular partition; the term single-resolution refers to the fact that we choose a single resolution for each axis, thus leading to a domain partitioned into equally-sized hyperrectangles. All PWAG functions defined over such a regular partition form a subclass of PWAG functions, referred to as sPWAR (single-resolution PWA hyperRectangular). For circuit implementation reasons, we restrict our attention to PWAR functions defined over an -dimensional compact domain . Each coordinate axis is then divided into subintervals of unitary length . Fig. 1(c) shows a two-dimensional example of a scaled domain partitioned into single-resolution hyperrectangles with and .
C. Multi-Resolution Hyperrectangular Partition
Consider a similar procedure as above, where each state axis of the domain is divided not just into intervals with a single 2 The reason why it is preferable to choose a different number of subdivisions in the simplicial and in the hyperrectangular cases is due to the way in which the PWA expressions are evaluated in the two cases. Indeed, the PWAS function evaluation is based on the values of the function at the vertices of the partition, as detailed in Section V.B, whereas in the hyperrectangular case the function evaluation requires the values of the coefficients for each hyperrectangle. As we will see, this difference in the evaluation procedures justifies the choice of the number of subintervals for each case, as well as the different affine mappings introduced.
resolution given by , but instead divided subsequently into higher and higher levels of resolution by halving the subintervals each time (also called dyadic or binary division), up to some chosen maximum resolution. That is, , where , with levels of resolution . This procedure of refinement results in a collection of overlapping sets corresponding to different levels of resolution , also referred to as levels of refinement or simply levels. Clearly their union is the whole domain. By selecting now regions , out of this collection, such that they still cover , but are non-overlapping, i.e., and for , a partition of is obtained. The resulting partition contains at most hyperrectangles , thus
. We call such a partition a multi-resolution hyperrectangular partition.
All (possibly discontinuous) PWAG functions of the form (1) based on a multi-resolution hyperrectangular partition where is the total number of multi-resolution hyperrectangular regions, form a subclass of PWAG functions referred to as mPWAR (multi-resolution PWA hyperRectangular) functions.
For circuit implementation reasons we introduce again an affine mapping of the original domain, passing from to the scaled domain , where the multi-resolution hyperrectangles become hypercubes. Fig. 1(d) displays a two-dimensional example of a such a partition with levels and regions.
III. PWAR FUNCTION EVALUATION
On-line evaluation of PWAG functions as in (1) for an input is usually performed in three steps: 1) solve the point location problem to obtain the index of the region that contains the current input, i.e., ; 2) obtain the function parameters in (1) from a lookup  table; 3) construct the output by evaluating the affine expression . From a circuit point of view, the second and third steps can be carried out in a simple way by a few functional blocks. Indeed, what we need is a memory to store the values of the coefficients and a multiplier and an adder to evaluate the affine expression. In particular, once index is known, the corresponding and are supposed to be already available, as a result of off-line computation. The first step, known as the point location problem, can become computationally very expensive, especially in case of irregular partitions and/or a large input dimension.
To reduce the computational complexity of the point location step, approximation methods have been proposed in recent years to obtain functions on partitions with more regular shapes such as simplices [28] , [29] or hyperrectangles [30] - [32] . For PWAG and PWAS forms, several circuit implementations have already been proposed [6] , [8] , [10] , [22] , [42] . In this paper, we will focus on circuit implementations of the PWAR family of functions, for both the single-and multi-resolution case.
A. Point Location for sPWAR Partitions
When having a sPWAR function, the point location problem can be solved through a simple strategy resulting in short computation times.
Once we have fixed the resolution for each input variable, , in the single-resolution case, each of the domain axes is partitioned into subintervals, thus leading to a total of single-resolution hyperrectangular regions. This means that if we use bits to code the components of the input variable in the domain , each hyperrectangle can be univocally determined by the most significant bits (MSBs), coding the signed integer part of each input component, . Let us define the operator as the operator extracting the signed integer part of the input vector . Then, we define as the binarising operator that, given a vector of signed integer values and a vector of precisions for , returns a -long string of bits concatenating the 2's complement binary values of the elements of the vector starting from down to . For instance, suppose that the current input belongs to the gray hyperrectangle in Fig. 1 (c) for a two-dimensional case, with and , then . Therefore, allows to compute an unambiguous binary label for the single-resolution hyperrectangle the current input belongs to, and we can map each of the single-resolution hyperrectangles making up the partition to the corresponding binary label simply by joining the MSBs from each of the input components: (2) This simple operation can be performed in only one clock cycle in a parallel fashion, as detailed in Section IV.B.
Remark 1: We point out that the binary label does not correspond to the mere binary conversion of since this would not allow the efficient digital implementation (detailed in Section IV.B) of the point location strategy described in this section.
B. Point Location for mPWAR Partitions
The main idea at the base of our multi-resolution approach is to build offline an orthogonal search tree made up of a number of nodes , such that the real-time search complexity is logarithmic with respect to the number of regions. In the following, we often refer to the multi-resolution regions as hypercubes instead of hyperrectangles, since the hyperrectangles of the original domain become hypercubes in the scaled domain . The root of the tree corresponds to the initial partition , the non-leaf nodes correspond to hypercubic regions, which are in turn partitioned into higher-resolution hypercubes, whereas the leaf nodes represent the hypercubic multi-resolution sets . The level of depth where the leaf nodes are located can vary from 1 to . When exploring the tree online from the root to a leaf node, one ends up with a region containing the input and leading to the corresponding control law, characterized by the coefficients and .
The orthogonal tree depth influences the maximum time required by the on-line tree exploration, whereas the total number of nodes influences the circuit dimensions. Each non-leaf node has children representing the refined hypercubes within the current region, thus leading to an upper bound of nodes. Fig. 2 (a) shows an example, where a two-dimensional domain is partitioned at a maximum of levels of refinement with regions. The corresponding orthogonal search tree, illustrated in Fig. 2(b) , is characterized by nodes. Remark 2: We choose to label both the nodes and the regions in increasing order, primarily with respect to the refinement level, and secondarily with the dimensional component within each hypercube (see Fig. 2 ). The orthogonal tree structure is particularly suited for an FPGA implementation with fixed point arithmetic, since each node can be easily related to a state of a Finite State Machine (FSM). Exploiting the regularity of the hypercubic partition, we can indeed associate each node of the tree (and thus each state of the related FSM) to an unambiguous binary label that can be easily computed from the Most Significant Bits (MSB) of the input variable .
Let us define as the binarising operator that, given a scalar precision , returns a -long string of bits concatenating the 2's complement binary values of the elements of the vector starting from down to . Therefore, allows to compute an unambiguous binary label for each node making up the tree:
This simple operation can be performed in a maximum of clock cycles in a parallel fashion, regardless of the total number of inputs. The on-line exploration of the tree, and therefore the computation of the label , is described below.
Remark 3: The binary label does not correspond to the mere binary conversion of , since this would not allow the efficient digital implementation (detailed in Section IV.A) of the point location algorithm described below.
At the first step, i.e., at , the MSB is checked in parallel for all the input variables. If the input belongs to a level-1 hypercube, the corresponding leaf node is uniquely determined by the first digit taken from each input component and the point location procedure ends. E.g., in the example of Fig. 2(a) , region 1 is identified by . Otherwise, at the second step, i.e., at , we must check the -th bit from each input component. If the input belongs to a level-2 hypercube, the point location procedure ends and we obtain a -long label that identifies the current leaf-node (e.g., in the example of Fig. 2 
(a), region 2 is identified by
). In general, at each step, i.e., at each clock cycle, the -th bit is checked in parallel for all the input variables. This point is stressed by the couple of gray digits beside each branch of the tree in Fig. 2(b) , where the left digit represents the -th bit of , whereas the right one is the -th bit of . In an -dimensional case we would have digits where the leftmost digit comes from while the rightmost one comes from . This search is repeated up to a maximum of clock cycles, unless a leaf node has already been reached before. Once a leaf node is reached, the address from which to retrieve the coefficients in the circuit memory can be outputted as a simple binary conversion of the number of the current region. Indeed, if we define , each region making up the multi-resolution partition can be labeled by a string of bits representing the binary conversion of the region number . We now define as the binarising operator that, given an integer scalar value, returns a string of bits corresponding to the binary unsigned representation of the provided scalar value. Therefore each multi-resolution hypercube making up the partition can be labeled by: (4) A simple example of application of the point location algorithm is given below.
Remark 4: The main drawback exhibited by the proposed approach is then represented by the total number of FSM nodes, which for large-scale applications with a high level of refinement may become too large to fit on a (particular) FPGA. In order to reduce this problem, in the next section we describe an enhancement in the FSM implementation, which allows to avoid the instantiation of the states belonging to the last level of refinement .
Example: In the following, we provide a simple two-dimensional example to show how the search strategy works online. On the domain axes of the example shown in Fig. 2(a) , the first binary digits coding the signed integer part of and normalized in the range are shown. To see how the search strategy works, suppose that the current input belongs to region 23 (gray hypercube in Fig. 2(a) ), i.e.,
. Starting from the root of the tree, we first check the MSB from each input component, thus obtaining the string "00" and the binary label . Taking the corresponding branch of the tree, we end up in a node (node 4 in Fig. 2(b) ) which is a non-leaf node, thus we go on checking the -th bit from each input component, which provides us with the string "10". Again, we take the corresponding branch of the tree and we end up in node 15, associated with the binary label which is a non-leaf node. Finally, we check the -th from each input component, thus obtaining the string "01". This final step brings us to node 30, associated with the binary label which is the leaf node corresponding to region 23. Once the leaf node is reached, the address from which to retrieve the coefficients in the circuit memory can be outputted as a string of bits: .
IV. CIRCUIT IMPLEMENTATIONS OF MPWAR AND SPWAR FUNCTIONS
In this section we give a description of the main blocks composing the digital architecture for the implementation of mPWAR functions using the solution to the point location problem proposed in Section III.B. A different architecture is proposed for the implementation of sPWAR functions. Moreover, we present an alternative implementation of the proposed architectures, which allows a lower latency at the cost of a higher number of multipliers. The first proposed architectures, mainly serial, will be denoted by letter A, whereas the alternative architectures, mainly parallel, will be denoted by letter B. The proposed architectures have been realized through a VHDL description which is fully parametric, i.e., all the parameters can be tuned at the user's demand. In this way it is possible to reprogram the digital architectures in order to represent different mPWAR and sPWAR functions by simply tuning a few parameters.
A. The mPWAR(A) Architecture
The computation of the mPWAR function is carried out through the tree steps described at the beginning of Section III. A simplified scheme of the mPWAR architecture is shown in Fig. 3 . For ease of presentation, we show the particular case of a two-dimensional mPWAR function implementation. The CK and RESET signals as well as the control signals are identified by dashed lines. Since signal lines may cross, we explicitly show when crossing lines are connected by the presence of a dot where the lines cross.
The architecture is made up of six main blocks:
• Input for a synchronous and parallel acquisition of the input vector , represented with bits; this means that all the bits of the inputs are supposed to be available at the same time; • FSM, implementing the orthogonal search tree described in Section III.B to find the -th region the input belongs to; • Memory_i blocks, storing the coefficients and ; • A Multiply-and-Accumulate (MAC) block performing the products sequentially and adding them in the internal accumulator; • A two-input Adder, adding together the value of the accumulator output from the MAC block and the constant term from Memory output; • Control, a block managing the control signals. The other inputs to the circuit are the clock signal CK, responsible for the synchronization of the whole circuit, and the RESET signal. The overall timings of the circuit are managed by the Control block, which is a FSM whose internal state, assigned to the signal sel in Fig. 3 , can assume the integer values in the range . The control signals guarantee the synchronization between the operations performed by the digital architecture. The restart signal states that the computation of the previous output has completed and the circuit is ready for the acquisition of a new input. When restart is asserted, the FSM block is reset to state 0, corresponding to the root of the orthogonal tree described in Section III.B. The go signal is sent from the FSM block to the Control block and it is asserted when the point location has concluded and the on-line exploration of the orthogonal tree has led to a leaf-node, thus allowing the architecture to retrieve from memory the coefficients for the evaluation of the mPWAR expression. The ready control signal states if the output value from the circuit is valid or not, so the output is discarded when . The timings of the architecture are shown in Fig. 4 . Once the RESET signal is deasserted, the circuit starts processing. In the first Clock Cycle , the input variable is acquired by the Input block and the MSBs of each input component are sent to the FSM through the signal , which is a bus of bits. In the same clock cycle, the whole input variable is passed to the Mux_x block so that it is already available to perform the multiplications once the coefficients are retrieved. In , the internal state of the control sel, changes to 1 and the FSM block starts the point location for the actual input. The maximum level of refinement states the maximum number of clock cycles required to perform the point location. In particular, since we have to consider also the state 0, corresponding to the root of the search tree, i.e., to the initial domain , a maximum of clock cycles will be required. Actually, a simple enhancement (described later on) allows to reduce the required clock cycles from to . For this reason, from clock cycle 1 to the circuit will be involved in the point Fig. 4 . Timings example for architecture mPWAR(A). In this case we assume from Fig. 2(a) , i.e., and .
location. In clock cycle , the point location procedure has concluded and a leaf node of the search tree has been reached, 3 so the go signal is asserted and the address of the coefficients to be retrieved from the Memory_i blocks is ready in output from the FSM block. In the next clock cycle the coefficients are retrieved from the Memory_i blocks and sent to Mux_f. From to , the products are performed sequentially and added in the accumulator within the MAC block. In the last clock cycle the constant term is added to the acc output signal from the MAC by a two-input combinatorial Adder, so that the value of the mPWAR function is ready in output. The ready and restart signals are asserted in the last clock cycle, the former stating that the evaluation of has concluded and the value in output is valid, the latter resetting the FSM block to state 0 so that a new point location can start in the next clock cycle. From the timings analysis we get that the overall circuit latency is given by , where indicates the number of clock cycles.
The main drawback of the mPWAR architecture is represented by the total number of nodes of the FSM, whose upper bound is expressed by . As an enhancement, we can avoid the instantiation of the states belonging to the last level of refinement . When the exploration of the tree leads to a leaf node located at a level of depth from 1 to , the FSM behaves as a Moore machine [43] , i.e., the address provided in output depends only on the current state, regardless of the input value. On the contrary, when the online exploration leads to a non-leaf node belonging to level , it necessarily means that its children are leaf nodes of the tree, so instead of waiting for the next clock cycle to provide the output, we can directly get the address of the first component of the coefficient vector in memory making a combinatorial check on the -th bits of the input, thus adopting a Mealy machine approach [44] . An example of this behavior is shown in the piece of pseudo-code reported in Algorithm 1, which describes a section of the FSM output process. This enhancement reduces the maximum number of clock cycles required by the point location task from to and, most importantly, allows a circuit complexity reduction avoiding the instantiation of the states belonging to the last level of refinement.
B. The sPWAR(A) Architecture
In the previous subsection the digital architecture for the multi-resolution case was described. Interestingly, when having a single-resolution PWA hyperrectangular (sPWAR) function, one can even obtain a simpler digital architecture resulting in a shorter evaluation time, as described in this section. Fig. 5 shows a simplified scheme of the sPWAR architecture in the case of a two-dimensional sPWAR function implementation. More in general, for a -dimensional case, the circuit would contain Memory blocks. The CK and RESET signals as well as the control signals are identified by dashed lines.
Most of the remarks about the mPWAR architecture still hold for this digital circuit. The main difference lies in the presence of Fig. 6 . Timings example for architecture sPWAR(A). In this case we assume (gray hyperrectangle in Fig. 1(c) ).
a combinatorial Address Generator block, which allows to solve the point location problem in one clock cycle, regardless of the dimensionality of the problem. Fig. 1(c) shows an example of a two-dimensional domain partitioned into single-resolution hyperrectangles. As we have already stated in Section III.A, the hyperrectangular region the actual input variable belongs to can be obtained simply by joining the MSBs from each of the input variables. The simple operation of joining the MSBs from the input variables into a single string identifying the current hyperrectangular partition is performed by the Address Generator block.
The timings for the sPWAR architecture are shown in Fig. 6 . The circuit inputs are the state variable and the signals CK and RESET. In this case, the internal state of the Control block assumes integer values in the range . In the first clock cycle , the components of the input are acquired in parallel by the Input block and the most significant bits of each state variable are sent to the Address Generator (signals and in Fig. 5 ). In one clock cycle, the Address Generator joins together the MSBs coming from the Input block and sends the address signal in parallel to the memories. In the coefficients are retrieved from the memories and sent to Mux_f. From to the products are performed sequentially and added in the accumulator within the MAC block. In the last clock cycle the constant term is added to the acc output signal from the MAC block by a two-input combinatorial Adder, so that the value of the sPWAR function is ready in output after a latency of .
C. The mPWAR(B) and sPWAR(B) Architectures
In architecture mPWAR(B) the point location problem is solved exactly in the same way as in architecture mPWAR(A), the only difference being the way in which the mPWAR expression is evaluated. Indeed, in the mPWAR(B) architecture the instantiation of multipliers allows to perform the products in parallel. From to the architecture performs the point location and produces the address signal in output from the FSM block. In , the coefficients are retrieved from the Memory_i blocks and the value of the expression is obtained immediately owing to the presence of multipliers and a combinatorial -input adder.
The gain in latency, from to , is obtained at the cost of a greater use of hardware resources, since multipliers are required.
Also in the case of the sPWAR(B) architecture, the only difference with respect to architecture sPWAR(A) lies in the presence of multipliers. Once the point location problem has been solved in , the sPWAR function evaluation can be performed in just one clock cycle, thus leading to an overall latency of .
V. COMPARISONS WITH OTHER PWA IMPLEMENTATIONS
Many different forms have been proposed in literature for the representation of PWA functions, but most of them are not tailored to digital circuit implementation. For comparison reasons, we focus on the PWAG and PWAS forms, as digital implementations are available for such forms.
In particular, in this section we provide a direct comparison between the new architectures for mPWAR and sPWAR functions introduced in the previous section and three digital architectures implementing PWAG and PWAS functions described in [22] for the PWAG case and in [10] (A and B) for the PWAS case.
We point out that in many cases prior to the implementation of PWAS, sPWAR and mPWAR functions, an approximation step is required as a general PWAG function is available (e.g., through explicit model predictive control), which can not be written exactly in the form of PWAS and/or PWAR functions. As a consequence, this approximation step provides PWAS or PWAR functions that are not the same as the original PWAG function and an approximation error intrinsic in the (mathematical) approximation problem is introduced. Note that this approximation error is not due to the circuit implementations. Clearly, there will be a tradeoff between the approximation error and the complexity of the resulting PWAS or PWAR functions obtained using the approximation method. For this reason, when comparing different implementations of PWA functions we must keep in mind that the overall effectiveness of the implemented function is determined both by the approximation method and the digital architecture adopted. For an overview of methods to approximate PWAG functions through canonical PWA functions defined on more regularly shaped partitions, the reader is referred to [14] , [28] , [29] , [39] , [45] , [46] (simplicial partitions), and [30] - [32] (hyperrectangular partitions).
The considered PWA forms (PWAG, PWAS, sPWAR and mPWAR) differ in representation capability, expressed in terms of the size of the PWA function class the form belongs to, and complexity, related to the number of parameters and coefficients required to represent a given PWA function. In particular, the circuit implementations of PWAG forms can obviously represent any PWAG function, whereas the realizations based on a regular partitioning (simplicial or hyperrectangular) can represent only a subclass of PWAG functions that are affine over each simplex/hyperrectangle of the simplicial/hyperrectangular partition. Regarding complexity, the generic structure of PWAG typically results in more complex/less efficient implementations, while the particular structure of the partitions in PWAS and PWAR functions can be exploited for fast and efficient on-line implementations. Moreover, the proposed hyperrectangular implementations offer the possibility to realize discontinuous PWA functions, which is not the case of the PWAS architectures realized until now, and the multi-resolution version provides a partial solution to the "curse of dimensionality", because it requires fewer parameters to store.
These implementation benefits of mPWAR, sPWAR and PWAS functions over PWAG functions were exactly the motivations for the control community to investigate how the PWAG control laws (e.g., obtained by optimization-based control methodologies such as MPC) can be approximated by mPWAR, sPWAR and PWAS functions without sacrificing much of the control performance specification [28] - [32] .
A. The PWAG Architecture
The architecture we have considered for the implementation of PWAG functions is mainly based on the digital circuit described in [22] . The main difference with respect to the mPWAR architecture is represented by the digital block responsible for the point location task. Indeed, the PWAG architecture solves this problem by a FSM implementation of a binary search tree whose depth is in general much larger than the depth of the orthogonal tree implemented by the mPWAR architecture. In the following we provide a brief description of the elements that determine the complexity of the PWAG architecture. For a more detailed description of the blocks making up the circuit, the reader is referred to [22] .
Each edge of the polytopic partition in the PWAG case (see Fig. 1(a) ) is an -dimensional hyperplane in the form , where and is the total number of edges of the polytopic partition. The next-step decision at each non-leaf node requires clock cycles for the evaluation of the inequality , which determines the branch to be taken [22] . The total time required by the circuit to provide the output , i.e., the latency of the circuit, is , where is the maximum depth of the binary search tree. An upper bound to the number of nodes of the tree and, consequently, to the number of states of the FSM is given by [22] . In the mPWAR architecture the time required by the next-step decision is independent of the problem dimensionality thus leading to a faster on-line search. Moreover, in the PWAG architecture the coefficients must be stored for each of the edges determining the polytopic partition of the domain, thus increasing the memory requirements up to cells.
B. The PWAS Architectures
For a simplicial partition, the algorithm usually adopted to locate the simplex a given input vector belongs to is based on Kuhn's lemmas [33] , [41] , [47] and is optimal with respect to the number of inputs [48] . The value of any PWAS function can be calculated as a linear interpolation of the values at the vertices of the simplex containing the input vector , i.e., as a linear interpolation of a subset of coefficients :
where and ( being the total number of vertices of the partition) is the set of indices corresponding to the vertices of the simplex in the scaled domain . The values are scalar interpolation weights and . The evaluation of a PWAS function through this algorithm requires three main elements: 1) a memory where the coefficients are stored; 2) a block that finds, for every given input , the set of indices and the interpolation weights (this step requires a sorting algorithm [34] ); 3) a block performing the weighted sum (5). We assume that the function to be implemented is already scaled and defined over a -dimensional compact domain . Each component of is represented by a digital word of bits: codes the integer part of bits are used for the decimal part. We note here that in the point location strategy the role played by the first bits in the PWAS case is similar to that played by the first bits in the sPWAR case; they allow to find the hyperrectangle containing the current input variable.
However, the simplicial architecture requires also the sorting of the decimal parts of the input to find the correct simplex within the actual hyperrectangle. In summary, determines the number of sub-intervals along each dimension ( for any ) as well as determines the number of subdivisions along each dimensional component . Concerning the three elements listed above, the weights are stored in an internal memory; the decreasing ordering of the decimal parts of is obtained by using a sorter suitable for digital implementations [49] , the weighted sum is performed by the high-speed multipliers and accumulators internal to the FPGA. For a detailed description of the blocks making up the circuit, see [10] .
We note here that the difference between architectures A and B proposed in [10] is that the former is serial, while the latter is completely parallel, thus requiring the instantiation of Multiplier as well as Memory blocks. For architecture A, the on-line computation time grows with and the memory size increases according to the exponential law . For architecture B, the latency of the circuit is 3 clock cycles while the memory size is . The serial architecture results in a slower but smaller circuit, while the parallel one achieves extremely small computation times but has larger memory requirements.
C. Comparisons
Summing up, the PWAS and sPWAR architectures are suitable for small-scale problems with high real-time requirements owing to their higher speed. The sPWAR architecture presents an extremely simple structure, since all the operations can be carried out with very simple functional blocks, and offers the further capability to implement discontinuous functions. On the other hand, both architectures are affected by the "curse of dimensionality", since the memory size grows exponentially with the input dimension . The PWAG architecture has the great advantage of implementing every generic PWA function, but the latency of the circuit grows very fast for problems with a highly irregular domain partitioning or a large number of regions. The mPWAR architecture allows much shorter computation times as well as considerable memory savings with respect to the I  CIRCUIT PARAMETERS   TABLE II  CIRCUIT PERFORMANCES to the single-resolution approach. The main drawback in this case is due to the dimensions of the FSM, which can become too large for high-dimensional problems requiring a high level of refinement.
All the presented architectures have been obtained through fully parametric VHDL descriptions, thus allowing to represent different PWA functions simply by tuning few parameters and to automatically write the VHDL descriptions.
The most significant parameters for the implemented architectures are reported in Table I. Table II summarizes how the circuit performances are influenced by the most significant parameters. 4 These tables are a key result, as they provide a detailed comparison of the complexities of the considered digital implementations. The column in Table II shows the number of clock cycles required by each architecture to complete the computation of the output function. The column Memory Cells reports the number of elements that must be stored for each architecture. In the fourth column, we list the maximum number of nodes for the architectures implementing a FSM (this is a worst-case measure since not all the states would be instantiated in a typical implementation). The last column shows the number of multipliers instantiated for each implementation.
From the results reported in Table II it is possible to infer the complexity and performance of a PWAG, PWAS or PWAR implementation as the parameters reported in Table I change. In TABLE III  DOUBLE INTEGRATOR the next section we provide an experimental verification of these results by the FPGA implementation of two MPC controllers.
VI. EXAMPLES
We consider two examples of PWA functions to quantitatively illustrate some of the aspects discussed in the previous sections. The first example is related to a two-dimensional system, which is a benchmark problem in the context of MPC. The second example is related to MPC of a three-dimensional PWA system [50] . In both examples, all the different PWA functions are implemented on a Xilinx Spartan III FPGA (xc3s200). The developed VHDL description has been realized with the software tool Xilinx ISE 12.3 and ModelSim has been used for simulation and debugging.
The original PWA control functions are obtained by explicit MPC (EMPC) [19] , [20] algorithms and implemented using the PWAG architecture based on a binary search tree proposed in [22] , as discussed shortly in Section V.A. The PWAS approximations have been obtained through the techniques described in [28] whereas the implementations rely on the architectures proposed in [10] and shortly described in Section V.B. The mPWAR and sPWAR approximations are based on the procedures described in [31] and implemented using the architectures proposed in this paper.
In order to provide a comparison between the approximation capabilities of the different architectures, we use two different quantitative measures of the approximation errors, the first one being (6) where is the PWAG optimal control law, is the PWAS, mPWAR or sPWAR approximation. Hence is the maximum approximation error for each case.
The second quantitative measure is the relative error evaluated over a uniform grid of 100 points for each domain axis, given by (7) where are the sample points.
A. Double Integrator 1) Model and Control Laws:
The discrete-time double integrator with sampling time of one time unit can be represented as follows [20] :
where denotes the state variable and the control input at discrete time . The control objective is to regulate the system to the origin under the hard constraint . The state domain is given by 5 . An optimal explicit control law (PWAG) has been obtained with EMPC as in [51] using the Matlab Hybrid Toolbox (v1.2.8) [52] . 6 The PWAG control law is characterized by regions and edges while the off-line search tree results in 87 nodes with a maximum depth of . In order to provide a fair comparison between the approximations and , we have selected coherent values for the architectures parameters. The PWAS approximation is obtained with subintervals along each dimension, while for the sPWAR case we choose , i.e., . The mPWAR approximation, characterized by a maximum refinement level of , is defined over multi-resolution hyperrectangles, and the orthogonal search tree is made up of 53 nodes. The fact that the number of regions is larger than the number of nodes is due to the enhancement described in Section IV.A which allows to avoid the instantiation of the nodes belonging to the last level of refinement.
The obtained approximation errors are shown in Table III . The time evolution of the system controlled with the optimal (PWAG) and the approximated (PWAS, sPWAR, mPWAR) control laws is shown in Fig. 7 . Panels (a) and (b) show the evolution of the state variables, whereas panel (c) shows the optimal and the approximated control laws. The black solid lines are related to the optimal MPC control, the gray solid lines correspond to the PWAS approximations while the dashed lines are related 5 For simplicity we use the notation to refer to the state variable in the original non-scaled domain. 6 Using cost function , where is chosen as the solution of the LQR problem with weights , and the horizon is taken . to the sPWAR (black) and mPWAR (gray) approximations. The constraints to be fulfilled are represented by horizontal dashed lines where appropriate.
2) Circuit Implementations:
In all cases we have used bits to represent the data. The performance of the architectures on the chosen FPGA are summarized in Table III . We remark here that the error due to the 12-bit representation is negligible in all cases with respect to the approximation errors shown in Tables III and IV [53] .
The column Slices reports the percentage of space occupied on the device after the place-and-route process performed in Xilinx ISE. The Latency values are obtained by multiplying the number of clock cycles needed to perform the PWA computation, reported in Table II , by a clock period of 50 ns. The chosen clock frequency, i.e., 20 MHz, allows to obtain correct results on each of the digital architectures after the post-route simulation on the FPGA. The values expressed in the column Memory are obtained by multiplying the number of required cells, reported in Table II , by the number of bits used to represent the coefficients in memory, i.e., . The last column shows the total power consumption estimated by Xilinx XPower Estimator (XPE) 11.1 after the place-and-route process performed with a clock frequency of 20 MHz and an external temperature of 25 C.
In this simple example all the architectures use a limited amount of the resources available on the FPGA, but the digital architectures implementing PWAS and PWAR functions already offer a considerable gain in terms of computation time with respect to the PWAG architecture. Moreover, the mPWAR architectures require smaller memories if compared to the single-resolution approaches.
B. 3D PWA System 1) Model and Control Laws:
The second example is the control of a third-order PWA system from [50] , defined by The state domain is given by . The control constraint is . The explicit solution has been obtained as in [54] using the Matlab Multi-Parametric toolbox (v2.6.3) [55] . 7 The resulting control law is a PWAG function of the state , defined over the domain partitioned into polytopes (after simplification of the original 719) by edges. The binary search tree computed offline results in 949 nodes with a maximum depth of . The approximated PWAS control function is defined over a domain partitioned into and subintervals, where the subscript stands for the input component, thus implying 19530 simplices and 4096 vertices.
The sPWAR approximation is characterized by , thus leading to a total of 4096 hy- 7 Using cost function , with (no stability enforced). The horizon length is chosen . perrectangular regions and 16384 coefficients. In the mPWAR case, the off-line tree has a maximum depth of and is made up of 1337 nodes, and the input domain is partitioned into regions. The approximation errors, computed according to (6) and (7), are summarized in Table IV . From Fig. 8 we can see how in this case the approximated control laws result in slight oscillations around the origin, but they are still effective in stabilizing the state to a region close to the origin. Because no (asymptotic) stability requirements were put on the approximations, the approximation errors around the origin are allowed to remain relatively high, resulting in these oscillations. Note that in certain cases it is possible to impose stability on the PWA approximations [29] , [31] , [32] , thereby removing these oscillations.
2) Circuit Implementations: Also in this case we choose bits to represent the data. The PWAG architecture occupies around half of the space available on the FPGA board and the memory requirements are still quite small since the number of polytopic regions of the domain is limited. On the other hand, the latency of the circuit is high if compared to the other architectures, since the maximum depth of the binary search tree associated with the domain partition is and the evaluation of an affine expression requiring clock cycles is performed at each step within the search tree.
The mPWAR architectures offer the advantage of a much faster computation at the cost of a slightly larger memory and a larger space occupation on the FPGA. The sPWAR architectures are much simpler than the other implementations and the space occupied on the circuit is extremely small. In the sPWAR(B) implementation the latency is more than one order of magnitude less than the PWAG case but the great advantages offered by the sPWAR architectures are balanced by a significant increase in the memory requirements.
The PWAS architectures still offer the advantage of a lower latency if compared to the PWAG case and the memory requirements are smaller than for the sPWAR case, especially in the PWAS(A) implementation, at the cost of a larger maximum approximation error.
In conclusion, the considered examples provide a validation of Table II and demonstrate the advantages and disadvantages offered by the different implementations.
VII. CONCLUSION
In this paper we have introduced two new digital architectures for the implementation of PWA functions defined over both single-resolution and multi-resolution hyperrectangular partitions. The proposed architectures allow a significant reduction of the computation time of the PWA function with respect to previous solutions implementing generic PWA functions. Moreover, the chosen hyperrectangular partition allows the realization of very simple digital architectures in the single-resolution case, which makes our solution more appealing for large-scale problems with respect to previous digital architectures implementing PWA functions defined over domain partitioned into different types of polytopes. In order to mitigate the curse of dimensionality, which affects the architectures for the evaluation of PWA functions defined over regularly-shaped partitions, we have introduced a multi-resolution architecture, which allows a better control over the memory requirements at the cost of a slightly higher latency and a larger circuit complexity. These conclusions were illustrated by two numerical case studies.
The proposed architectures are completely reconfigurable and the chosen FPGA implementation allows customization of the hardware at an attractive price even in low quantities and this makes them effective for cost and time-to-market savings when making individual configurations of standard products [56] . These advantages make the proposed architectures of particular interest for the implementation of optimization-based control methodologies such as EMPC for fast, small-size and low-power applications.
Future research will involve a further integration between the algorithms for the control approximation and the circuit implementation, as this will lead to a highly automated procedure for the creation of embedded controllers, making the quantitative properties of the designed controllers predictable without having to realize the circuit implementations. This integration should lead to the development of a toolbox to support the whole design chain.
