We 
Introduction
With the combined explosion of the electronic appliances market and of the complexity of systems implemented on a chip today, system designers face the challenge of silicon compilation. The long and error-prone manual design of VLSI systems will be replaced by a step-by-step compilation from high-level code, in order to speed up and secure the design process. In this design process, one difficult stage is the determination of the data-path bit width. As pointed in [17] , when processor memory was expensive, programmers paid attention to this problem even for software implementations. The problem however did not receive much attention from the software community because of the standardization of the 32 bits (and then 64 bits) architectures. In silicon design however, the area of a circuit is always very important because it is directly proportional to the price of the chip.
The goal of the bit width determination stage in the design process is to find a tradeoff between a large data path ensuring correctness and precision, and a smaller data path yielding low power and cheap circuits. When going from a high level algorithmic description to a hardware implementation, the designer has to transform variables with infinite precision and unbounded values into bit vectors of fixed size. In this process, two distinct problems may arise: overflows, which appear if the size of the integral part is too small, and precision loss if the fractional part is too small. These problems are solved during the determination of, respectively, the range of the signal and the precision of the signal. This process is tightly coupled with the global design methodology hence it is difficult to precisely formalize it.
In this paper, we consider this problem in the context of the derivation of an architectural description from a high level (functional) description. Starting from an algorithmic description of a system, we refine it down to an architectural description in synthesizable VHDL. We focus on computation 1 intensive algorithms, i.e., loop nests. Moreover, our descriptions are generic, i.e., they handle symbolic parameters. To ensure reusability of the design, these parameters must be instantiated as late as possible. Parametric loop nest manipulations use the polyhedral model [8, 4] . Existing tools for high level synthesis do not handle parameterized loop nests because of the complexity of this model. In this paper, we propose new techniques for the symbolic determination of integral bit width in VLSI implementation of programs containing loop nests.
The paper is organized as follows. Section 2 gives an overview of related work, Section 3 presents our modeling, Section 4 illustrates our methodology on a simple example, Section 5 presents our main contribution, and Section 6 gives concluding remarks.
Related work
Starting from a floating point (or theoretical infinite precision) representation of a system, one must transform it into fixed point representation. We will assume that the numbers are coded in the two's complement system, which is widely accepted as the "default" standard.
A real value r will be represented by a valuer.r is a Q n m signal if it is composed of n binary digits before the dot (integer bit width needed to code the range of the signal) and m binary digits after the dot (fractional bit width used to set the precision of the implementation). The VLSI designer chooses these different bit widths from information obtained by static analysis tools (like the computation of the signal to noise ratio) or extensive simulations. This method is difficult to apply to algorithms containing loop nests for which some parameters (number of iterations for instance) are fixed later in the design process. The integral bit width determination of parameterized loop nests algorithms was the major motivation of the present work.
Bit-width determination methods can be classified into two categories: the formal methods attempt to statically extract bit-width information from the code and the simulation-based methods try to reach an admissible bit width by successive simulations. It turns out that a combination of both types of methods is mandatory. Formal methods are very useful to orient the design space exploration for the simulation phase. The approach presented here is a formal method.
Among simulation-based methodologies, one can find the following: [10, 9, 15, 3, 11] and the proposed implementations in Ptolemy [6] , DSP Station, or SPW. In addition the Ptolemy environment provides an interesting typing mechanism that defines a hierarchy of types and allows data refining within this hierarchy. As these methods are based on simulation, input samples have to be carefully chosen and no general symbolic result can be obtained for parameterized architectures.
The Gaut synthesis tool [16] uses a formal strategy to derive signal bit width in DSP applications. This strategy defines a model for operators and computes noise standard deviation using algorithms that operate on the data-flow graph of the architectural description. Given constraints such as the output signal-to-noise ratio, Gaut is able to symbolically compute the signal bit widths. The main restriction of this approach is the way it handles loops: loops are unrolled, and as a consequence loops with parameterized bounds cannot be handled.
An interesting methodology was presented by Amarasinghe et al. [17] and implemented in a silicon compiler. Once again, a typing mechanism is proposed for finite precision numbers, together with a refinement algorithm that allows the shortening of signal bit width by a bidirectional value range propagation method and a heuristic classification of variables appearing in loops. The bit widths of variables appearing in loops are set according to their type (constant, linear, polynomial,...) and the loop bounds but these bounds are not handled as symbolic parameters.
The HP labs [12] propose a formal method for integral bit-width selection in the PICO system. [13] . They used mathematical tools from the signal processing theory to compute the noise resulting from a linear 1 system. By using the transfer function of the linear system, they explain how to express, as a closed-form expression, the noise produced as the output of this system. Their method, used in conjunction with the one presented here, will enable the derivation of the complete bit width of signals for hardware implementation of loop nest programs representing linear systems (the case of non-linear systems will still need a simulation phase to derive the bit width of the fractional part).
Signals and operators modeling
As we just mentioned, the designer goal is to determine the bit widths that are necessary to both the range and the precision by means of formal techniques. In this paper, we focus on the range determination. In the following, we first briefly mention how one can define a modeling for the determination of precision, then we concentrate on modeling with respect to range.
Modeling w.r.t. precision. Given any signal sequence X(n) of M successive values, we denote byX(n) the approximated value of X(n) obtained, for instance, in a VLSI implementation of a function computing X(n). We call noise associated with the signal X(n) the value e X (n): e X (n) = X(n) ;X(n). The noise is abstracted by its standard deviation X . The signal-to-noise ratio (SNR): R X = 1 0 log 10
for a normalized signal is commonly used to evaluate the validity of an implementation, by deriving static results concerning the necessary bit widths of the signals. For each operator, an abstract version that models its behavior with respect to the noise standard deviation is given. Many authors propose such a modeling and methods to use it for the determination of fractional bit width [18, 13, 9] .
Modeling w.r.t. range. The integral bit width for signal X will be noted L X (by convention we include the sign bit of the two's complement representation in this length). As for the precision, the range of signal can be abstracted by a probabilistic distribution [9] but, in that case, the result should systematically be validated by an extensive simulations. A conservative approach is usually chosen to prevent overflows, which leads to the following equations 2 :
These equations define, for each operator op, an abstract modelõp which specifies the behavior of this operator with respect to the integral bit width (e.g. a+b = max(a b) + 1 ). This abstract model is called "Opcode transfer function" in [12] . It can be naturally extended to complex arithmetic expressions.
Solving the equations. Given a particular implementation of an algorithm, we have to compute two quantities for each variable X: X and L X . L X will give the integer bit width of the signal and X will indicate to the designer if the result is precise enough. For simple algorithms (non parameterized loops), equations defining X can be solved with traditional algebraic techniques [18] and equations defining L X can be solved by iterative constraint propagation [12] . However, if we want to implement a loop for which we do not exactly know the number of iterations (parameterized loop), the determination of L X is difficult due to the presence of max operations. We explain how to overcome this difficulty in the next sections.
Parameterized nested loops: an illustrative example
In this section, we explain on a simple example how we propose to determine the integral bit width of each variable in a high-level design flow from a nested loop program. The example we chose is the Finite Impulse Response (FIR) filter, which is a simple convolution filter used in many digital signal processing algorithms (correlation, adaptive filtering, etc.).
The problem is the following: given a possibly infinite sequence of values x(n), (n 0) and N weights w(i) (0 i N ; 1), compute the sequence y(n), (n N ; 1) where
If we want to design a VLSI circuit for this computation, we are given some information about the inputs (w and x) and some requirements about the output (y). During the design of the circuit, we will have to determine how to encode these signals.
For our example, we will assume that input signals x and w are composed of rational values that are encoded in a specific finite precision format and are associated with a similar bit width
The program shown on figure 1 (expressed in the recurrence equations formalism) is a possible implementation of the FIR algorithm. It expresses the computations, but in addition it explicitly defines when and where each computation should take place. Indeed, if we interpret index t as time and index p as space (processor number), we get a very precise description of a pipelined systolic execution of the FIR filter on a SIMD-like machine.
Figure 1. Recurrence equation specification for the FIR of equation (1)
Consider for instance the equation that defines XP (pipelined value of x) in Figure 1 . In each processor, this equation represents a signal, hence we can associate with it a range L X P (p) related to its integral part, which depends on the processor number. The equation that defines XP indicates that this value is simply propagated from one processor to the next with a delay of two clock cycles.
This equation directly leads to the equation defining L X P :
This equation can be solved easily, giving L X P (p) = L x for any processor p. Moreover, we know from our assumption about the input signal x that L x = L 0 . Similarly, the same quantity may be computed for W: L W (p) = L w = L 0 for any value of p 3 .
Observe now the equation that defines Y in Figure 1 . It has recently been shown [13] that the parameterized fractional bit width could be computed automatically (as a function of the desired signal to noise ratio). The equations defining the range L Y (p) of Y are the following:
We will explain in Section 5 how to solve it for a given value of L 0 . With traditional algebraic techniques, we can find the result if L 0 = 0. Indeed, in that case, we will necessarily have L Y (p) = p. Now, if we are able to solve Equation (3) for any value of L 0 , we will really have an automatic procedure for determining L for loop nest programs as soon as we have a space-time representation of the architecture. Note now that Equation (3) can be automatically generated. Indeed, Equation (3) is very close to the specification of Figure 1 (intuitively, they are obtained by "removing" index t of the program). This property is very important since it implies that the whole process that determines the bit width can be automatically driven from the recurrence equation operational specification of Figure 1 . This is the main contribution of the paper, and it is detailed hereafter.
Solving integral bit width equations for parameterized nested loops
In the previous section, we have shown how to generate integral bit width equations for the FIR example. In this section, we generalize this process and solve the equations for any algorithm containing loop nests implemented on a linear array architecture.
Generation of integral bit width equations
We start from an operational description of the linear architecture: a recurrence equation specification (i.e. single assignment specification) where time and space indices have been identified. : : : esac : : : Each variable V i is defined over a polyhedral domain D i and can be considered as a generalized array whose shape is not rectangular but polyhedral. In these equations, each D i j is a sub-domain of domain D i , and F i j are arithmetic functions. Index t is the time index, while p is the processor index, and each f i j k is an affine function on indices called dependency function. These two features (affine dependencies between indices and polyhedral domains) are the basis for a rich set of formal transformations that form the core of the MMAlpha environment that has been designed to manipulate such recurrence equations [19, 5] .
For our purposes, we restrict ourselves to the case of uniform programs, i.e., programs where all functions f i j k are translations on indices: f i j k (z) = z ; d i j k where d i j k is a constant value. Note however that any recurrence equation program can automatically be translated into an equivalent uniform one. Such a uniform recurrence equation program can be associated to its reduced dependence graph (dependence graph for short) composed of vertices corresponding to the variables V i and arcs corresponding to dependencies between these variables (each dependency f i j k leading to an arc labeled by d i j k ).
From these equations, we deduce the integral bit width equations of the system by using the modeling introduced in section 3. To do that, we substitute each signal V i by its integer bit width L V i and each arithmetic expression F i j by its abstract versionF i j . A variable V i (t p) corresponds to a signal V i in processor p at time step t. In the final architecture, L V i (p) must be the maximum bit width of V i (t p) over time. This can systematically be obtained by projecting every polyhedral domain and every dependency function on index p along the time axis (we note d 0 i j k for the projection of d i j k on index p). The action of projecting parameterized domains can be easily performed with the polyhedral library Polylib [19] which is used in many loop manipulation environments [5] . After this projection, the case branches are merged into a single expression which is the maximum of all the expressions inferred from each branch. The system of equations we get after sup- 
Solving the bit width equations in the (max +) algebra
To be able to solve this system, we will make two assumptions.
The first assumption we make is that we have a linear monodirectional array: data are carried in the array from one processor to its neighbor following a unique direction. Each L V i (p)
depends on values L V j (p ; 1). It can be easily shown that any system of uniform recurrence equations can be rewritten with unit dependencies by introducing temporary variables. Note, however, that it may happen that L V i (p) depends on itself (d 0 i j k may be equal to 0). This happens, for instance, in auto-adaptive algorithms such as the DLMS filter [14] . In that case, we will not be able to find the integer bit width (note however that none of the existing techniques will be able to automatically find the fractional bit width either as DLMS is a non linear system).
Our second assumption is that no multiplication occurs in a strongly connected component of the dependence graph (they only occur between strongly connected components of the graph). This situation is almost never true in treatments implemented in VLSI (signal processing, coding, etc.). Indeed, this would lead to bit width growing too fast (values of order A p after p operations). In the following, we will decompose the dependence graph in strongly connected components and treat them in sequence. This is exactly what we did in the example of Section 4: our dependence graph contained three strongly connected components, the three vertices X, W and Y. This decomposition into strongly connected components is also present in the work of Amarasinghe et al [17] (they use the word sequences).
Under these assumptions, the system of bit width equations for variables contained in a single strongly connected component of the dependence graph is such that no addition between two bit widths occur in the right hand side of any equation. As the addition operation is distributive with respect to the max operation, we can extract all the max operations and keep only one maximum applied on several sums. If two occurrences of a particular L V j (p ; 1) appear in the definition of L V i (p), one of them can be statically eliminated by taking the maximum of the two. The general form of the system can finally be restricted to: : : :
The idea that we introduce here is to use the (max +) algebra theory [7, 1] to solve this system. (max +) is an algebra where max stands for the additive operation and + stands for the multiplicative operation. We will show that, with the help of the Perron-Frobenius theorem adapted to (max +), we will be able to solve these equations. If we call L V (p) the vector composed of the L V i (p) values, we can write the definition of L V (p) in terms of L V (p ; 1) as a matrix vector multiplication in the (max +) algebra. From now on, we will note for the max operator, for the + operator and for the (max +) matrix vector product (hence,
We denote the zero of (usually ;1).
The good thing with a matrix-vector product is that it is a linear operation and hence, algebraic theory provides tools for manipulating it in a (max +) algebra. In a first step, we assume that all the constant coefficients i m+1 are . We explain hereafter how to generalize the technique to the case where these coefficients are not equal to . Under this assumption, the integer bit width system (4) can be expressed as: Let us now consider the graph whose adjacency matrix is M (a non-value in M i j corresponds to an arc in this graph, between vertex i and vertex j labeled with M i j ). This graph is isomorphic to the subgraph of the dependence graph corresponding to the strongly connected component we study (an arc from vertex i to vertex j correspond to a dependency between V i and V j ). A property of adjacency matrices of strongly connected graphs is that they are irreducible. In classical algebra, the irreducibility of a matrix A is used in the Perron-Frobenius theorem to show that the spectral radius A of A is an eigenvalue of A, and the associated eigenspace is a ray whose components are all non-negative. This result can be used to approximate the value of the vector recursively defined by b(n) = A b(n ; 1) for large n (intuitively b(n) ' ( n A )).
In the (max +) algebra, a similar result can be shown [7] . The spectral ray of the matrix M above is the maximal geometric mean of weights of elementary cycles in the graph corresponding to the matrix. This gives an easy way for computing the spectral ray M of matrix M. The concept of cyclicity is also useful. The cyclicity c(A) of a matrix A is the the GCD of the length of elementary cycles in the graph associated to the matrix.
We can now use the extension to (max +) algebra of the Perron-Frobenius theorem: 
With this theorem, we can express the asymptotic behavior of the bit width L V i (p) for the variables V i contained in one strongly connected component of the dependence graph. Their exact value is computed by hand for small values of p. In summary, we have the following result: for every 2-dimensional uniform loop nest that satisfies the following conditions, the loop nest contains parallelism (this implies that it will lead to a linear array with the standard systolic design methodology), the loop nest leads to a computable system of equations for the integral bit width (i.e., bit width L i (p) does not depend on itself as for auto-adaptive filters), the strongly connected cycles of the reduced dependence graph do not contain multiplication, we can automatically find the bit width of each signal even if the size of the array is unknown. Having a parameterized bit selection is essential in a high level design process as well as for the re-usability of the design.
Solving the example
We illustrate how to use this technique on the FIR example. In this example, we had to solve As the system given by Equation (3) is not linear but affine, we will have to perform a classical change in the specification. The constant will be considered as a variable: C(p), and we will make sure that the value of C(p) is always 0. The new equation in (max +) is (with L Y (;1) = 0 and C(;1) = 0:
The new matrix M 0 is not reducible anymore but, as the spectral ray of M is positive, the result of Theorem 5.1 still holds with M (intuitively, as M is strictly positive, L Y (p) is increasing with p, hence, taking the maximum with a constant can change the behavior for small values of p but not asymptotically). Now, using theorem 5.1, we know that there exists N such that
This is the result we had in section 4with L 0 = 0 (with corresponding N equal to -1). In the general case, N upper bound can be computed for N [2] . As we deal with small matrices and small integer values, this bound is usually small in practice. Hence the most efficient way to find it is to compute 
Conclusion
We have presented a method for the integral bit width determination of signals for loop nest programs implemented on one-dimensional linear arrays. The main originality of the paper lies in the fact that we use results of the (max +) algebra theory to solve the problem for parameterized loop nests (i.e., loop nests for which the bounds of loop indices depend on parameters not necessarily known at compile time). Having a parameterized bit selection is essential in a high level design process as well as for the re-usability of the design. We have demonstrated the method on a classical signal processing treatment: the FIR filter. To be complete the method presented here should be used in conjunction with the result of [13] for the determination of fractional bit widths of linear signal processing systems. We insist on the fact the the possibility of handling parameterized designs is the major improvement presented here. We are able to derive, for instance, the integer bit width of signals for the FIR filter, even without knowing the actual number of filter taps.
The work presented here has been initiated in the context of a high-level design methodology using recurrence equation, but it can be used in any high level design environment using a different modeling for operators. However, its application requires an equational description of the behavior of the architecture. Once this description is obtained, the generation and the resolution of the bit width equations are similar to the work presented here.
Some open questions remain. First, of course, we have to prove the applicability of the technique in practice. The number N of manual iterations to unroll must be reasonable. We are quite confident in that because the systems to solve are usually very simple compared to other systems arising in (max +) theory [7] . But the most important question concern the restriction of the method to monodimensional arrays: this restriction will disappear if we show how to solve multi-dimensional (max +) linear systems of equations. We are currently working on this topic.
