This article deals with the fixed-point computation of the sum-of-products, necessary for the implementation of several algorithms, including linear filters. Fixed-point arithmetic implies output errors to be controlled. So, a new method is proposed to perform accurate computation of the filter and minimize the word-lengths of the operations. This is done by removing bits from operands that don't impact the final result under a given limit. Then, the final output of linear filter is guaranteed to be a faithful rounding of the real output.
INTRODUCTION
Usually, embedded digital signal processing algorithms are specified using floating-point arithmetic and next implemented using fixed-point (FxP) arithmetic (Padgett and Anderson, 2009 ) for cost, size and power consumption reasons. FxP arithmetic is used as an approximation of real numbers based on integers and implicit fixed scaling by a power of 2. Of course, the quantization of coefficients and the rounding errors due to FxP computations lead to a degraded numerical accuracy of the implemented algorithm. Therefore, it is a great interest for the designer of embedded system to determine and control the implementation error while maintaining low computational effort.
In fixed-point arithmetic, a main current problem is to minimize the word-lengths of operands under constraints of precision in order to minimize area and/or power consumption (Constantinides et al., 2004) . In this paper, a new method to reduce the number of bits to consider in each sum-of-products (SoP, also called Multiply-And-Accumulate) is proposed. The SoPs are one of the elementary operations of DSP algorithms. The main point of our approach is that if the final fixed-point format is known, then the bits having no impact in the final result can be detected and therefore discarded. Each term of the sum-of-products can be then reformatted into a new fixed-point format having less bits.
Some fixed-point arithmetic definitions and notations are reminded in section ??. Section 3 formalizes the proposed approach, which is decomposed into two formatting, for most significant bits and least significant bits, respectively. Section 4 describes the error analysis for Direct Form I filters implemented with the bit formatting technique. Finally, an illustrative example is given with a 4 th order Butterworth filter, before conclusion in section 6.
FIXED-POINT ARITHMETIC AND SUM-OF-PRODUCTS
In this article we consider signed FxP arithmetic in two's complement representation. Let x be such a FxP number with w bits as word-length:
where x i ∈ B {0, 1} is the i th bit of x, m and are the position of the most significant bit (MSB) and least significant bits (LSB), respectively ( Fig. 1 ). It can be noted that m > and
In a digital system, x is represented by an integer X, composed by the w bits {x i } i m . In other words, X = x.2 − , or equivalently Through this paper the notation (m, ) is used to denote the Fixed-Point Format (FPF) of such a fixedpoint number and m, and w will be suffixed by the variable or constant they refer to.
Remark 1. In FxP arithmetic, there is no restriction on the the position of the MSB and LSB. The FPF is often chosen with m 0 and 0. FPF with > 0 are also possible (the quantization step is greater than 1) or m < 0 (the largest represented number is lower than 1 2 ).
Conversion from Real to Fixed-point
Many SoP-based DSP algorithms involve real coefficients that have to be converted into FxP arithmetic. Let consider a real constant c ∈ R * . The position of the most significant bit m of its w-bit wide FxP representation in binary two's complement is:
where · and · are the round to the integer towards minus infinity and round towards plus infinity operators, respectively. For some very special cases, eq. (4) should be adapted (Hilaire and Lopez, 2013) . The position of the least significant bit is deduced from eq. (2) and the w-bit integer C representing c is computed:
where · is the round to the nearest integer operator.
Sum-of-Products
In digital signal processing, the computation of filter or controller algorithms requires the evaluation of one or several SoP. Their type and number depend on the algorithm chosen (Hanselmann, 1987; Istepanian and Whidborne, 2001; Gevers and Li, 1993) . For instance, the direct forms require only 1 SoP, whereas the n-th order state-space require n + 1 SoPs. The products considered in such a SoP are products of real constants and real variables. But, in the context of fixed-point design, only fixed-point variables and fixed-point constants are considered. In this article, we consider SoPs whose constants have already been converted in FxP format.
More formally, we consider SoPs
where {c i } 1 i n are given non-null FxP constants and {v i } 1 i n FxP variables only known to be in known intervals [v i ; v i ]. We focus on the best way (i.e. employing the minimum word-lengths) to obtain a rounding of the exact sum s at a given format.
Remark 2. It is also possible to consider the {c i } to be real constants instead of FxP constants, so as to analyze the impact of their quantization that is not considered here. However, this impact is well studied with sensitivity measures such as the transfer function sensitivity (Tavşanoglu and Thiele, 1984; Gevers and Li, 1993; Hinamoto et al., 2006) , the pole/zero sensitivity (Gevers and Li, 1993; Li, 1998) or IIR stability (Lu and Hinamoto, 2003) .
BITS FORMATTING
The main point of the proposed approach is that if the final fixed-point format of a sum s = ∑ p i , denoted FPF f = (m f , f ) is known, then it is probably possible to discard some useless bits.
More formally, this paper is focused on bits of p i s with positions lower than f (section 3.2) and greater than m f (section 3.3) in order to determine their impact on the result. We determine the useless bits and remove them from p i s before the sum is computed. The p i s are rounded into an intermediate format (m i , f − δ), where δ is the number of non-useless bits with position lower than f . Then, the sum of these modified p i s is computed and rounded into the final format FPF f in order to obtain the final result (see Figure 2 (b)).
Definitions and notations
In this section the fixed-point rounding modes used in this article are defined. are defined by:
The operator d (x) is the faithful rounding of x at the d th bit, i.e.
The round-to-the-nearest operation always returns the nearest representable point of the real exact value, while the faithful rounding operation produces either the nearest or next-nearest point.
Notations. Some notations need to be explicitly defined before explaining the proposed approach:
• p i c i × v i denotes the result of the fixed-point product of c i and v i . According to the fixed-point multiplication rule (Lopez et al., 2012) , the fixedpoint format of p i is defined as
• p i, j is the j th bit of p i , for 1 i n and i j m i .
and
log 2 (n) corresponds to the number of carry bits to consider for the sum of n terms.
Moreover, three different sums are also considered, where d is a given common rounding mode (roundto-nearest or truncate):
where δ is a given positive constant to be discussed later.
• s f f (s δ ) is the rounding of the sum s δ into the final format FPF f . Modular fixed-point sum. As reminded in equation (3), a fixed-point number x is coded in computer with a w-bit signed integer X. As a consequence, all the operations are done modulo 2 w on this integer. Proposition 1 specifies the modular fixed-point sum as an extension of the modular sum on integers.
Proposition 1 (Modular fixed-point sum). The sum modulo 2 d of two fixed-point numbers x and y sharing the same FPF (m, ), is noted x d ⊕ y and is computed as:
(13) Moreover, the modular fixed-point sum of n fixedpoint numbers x i is noted and computed as:
Proof: The fixed-point sum modulo d, x d ⊕ y, corresponds to the d − -bits sum of the integers X = x.2 − and Y = y.2 − . Therefore, adding two signed fixed-point numbers requires to convert them in positive integers, add them modulo 2 d− +1 , and convert the result back into signed fixed-point number.
Example 1. Adding 12.5 and 3.75 in FPF (4, −3) (two's complement with 8 bits) is given by 12.5 4 ⊕ 3.75 and leads to −15.75 according to eq. (13) because of the overflow. 12.5 is coded by 01100.100 B , 3.75 is coded by 00011.110 B , so the modular sum leads to 10000.010 B into format (4, −3), that is interpreted as −15.75.
LSBs Formatting
Let consider the final FxP format (m f , f ) of a SoP. It appears that not all the least significant bits are usefull in order to correctly round the result of a SoP to (m f , f ). The value δ is the position such that all bits with a position lower than f − δ are insignificant and can be removed from p i s (Fig. 2(b) ). Therefore, only p i s such that i < f − δ are rounded, the other remain unchanged. The sum s δ of rounded p i s is computed on format (m f , f − δ) and finally rounded onto the final format (m f , f ).
The following proposition formalizes the choice of δ.
Proof: This proof is done for truncation rounding mode.The same reasoning can be established for round-to-nearest mode.
Computation of s involves all bits p i, j whereas s δ requires only bits p i, j for j 2 l f −δ , so:
The trivial case s = s δ implies s f = s f , so thereafter only the case s − s δ > 0 is considered and the difference s − s δ is evaluated precisely as the sum of bits p i, j for 1 i n and L j f − δ − 1:
Since ∑ f −δ−1 j=L 2 j p i, j < 2 f −δ for 1 i n, s − s δ can be bounded as follows:
Now, the difference s f − s f corresponds to the rounding of the difference s − s δ according to the f th bit:
It also can be viewed as the carry bits greater than 2 f implied by the difference s − s δ . Using equation (18), the difference s f − s f can also be bounded:
Since δ needs to be determined in order to verify |s f − s f | 2 f , the following inequality comes from equation (21):
The smallest integer solving inequality (22) is δ = log 2 (n f ) .
Remark 3. With Proposition 2, it may happened that one p i (or more) has a MSB lesser than f − δ, and so all bits of this p i will be removed by applying this technique. Therefore, a good idea will be to redetermine δ with n f minus the number of removed p i . So equation (15) can be replaced by Algorithm 1.
Algorithm 1: Evaluation of the integer δ.
Input:
δ ← log 2 (n ) ; 5 n δ ← Card({i | 1 i n and m i < f − δ}); 6 until n = n δ ; 7 return δ Formating method. The first step of LSB formatting is a direct application of proposition 2: it removes useless bits. After this step i f − δ, ∀1 i n. The second step involves having i = f − δ, ∀1 i n. To do this, either FPF p i can be changed from
consisting to add i − f +δ zeros to the right of these p i s), or multipliers can be rewritten to perform operation into a given word-length. Let M i be the multiplier computing p i = c i × v i . Then, w M i , the word-length of the result of M i is given by:
where m c i and m v i are MSBs of c i and v i respectively.
FormattingBitstoBetterImplementSignalProcessingAlgorithms
Remark 4. For a better accuracy, m M i can be evaluated using formulas from section 2.1, it avoids double sign bit in general case (given by the +1 in eq. (24)). Moreover, if c i + v i > M i , then c i + v i − M i zeros are added to the right of p i s to ensure i = f − δ for these p i s.
Error evaluation. Adding two numbers in FxP arithmetic requires to align them onto the same LSB using right-shifts. A rounding error may occur, which introduces a numerical error. After the second step of LSB formatting, where rounding errors may be introduced, all p i s have the same LSB, i.e. f − δ. There is no need of right-shift to align operands of additions and therefore no additional rounding errors are introduced by the global sum of p i s. The total number of right-shifts involved in the first step of our method can be bounded as follows.
Proposition 3. With this LSB formatting technique and for a n th -order SoP, the number of right-shifts is bounded by n + 1, at most one right-shift by multiplier (denoted d i for multiplier M i ) and exactly one final right-shift (denoted d f ). Their values are:
where I = {i | 1 i n and f − δ > c i + v i }.
Proof: d i is the right-shift in multiplier M i if a rightshift is necessary, i.e. if f − δ > c i + v i so the number of bits to remove to ensure
All the additions are computed on w f + δ bits, and the result is w f bits long, so the final right-shift value is δ.
Remark 5. In Proposition 3 only non-zero rightshifts are considered. If d i is defined as max
multipliers have a right-shift, possibly null, and so the exact number of right-shifts is n + 1.
Finally, it is possible to bound the error introduced by our method. As seen in (Hilaire and Lopez, 2013) 
Round to nearest: Proof: By using (28) on a multiplier i, e equals to −2 i +d i −1 + 2 i where d i is the right-shift value given by Proposition 3 and p i is the initial LSB of the multiplier result, i.e. the optimal LSB which is the sum of LSBs of product operands c i and v i . For the final right-shift, the initial LSB equals to f − δ and the final result is δ bits right-shifted. Remark 6. The precise bounds of the global interval error shown in Proposition 4 can be bounded by a power of 2. Indeed, for truncation rounding mode the global interval error is included in ] − 2 f +1 ; 0], whereas for round-to-nearest rounding mode it is in-
In (Lopez et al., 2012) , all p i s have different LSBs, and therefore the global error depends on the order of the additions. Consequently, all the different evaluation schemes (ES), i.e. all the different possible orders of the additions, are generated and the choice is made meanly for ES with a minimal error. Here, all ES have the same global error value (Proposition 4), so error can not be a criteria to choose the best ES representing the sum. The criteria chosen in the section 5 is the infinite parallelism criteria, i.e. the most parallelizable ES.
MSBs Formatting
The MSBs of p i s having a greater positions than the final MSB, m f can be removed using a new formalization of the Jackson's Rule (Jackson, 1970) . This Rule states that in consecutive additions and/or subtractions in two's complement arithmetic, some intermediate results and operands may overflow. As long as the final result representation can handle the final result without overflow, then the result is valid. Example 2. Let us consider a sum S of three 8-bit integers with two's complement arithmetic, for example 104 + 82 − 94. The result S = 92 is in the range of 8bit signed numbers, but the intermediate sum 104+82
produces an overflow and equals to −70 into this format (instead of 186 that cannot be represented). The final sum −70 − 94 also produces an overflow and equals to 92 into the final format, that is the correct result.
With this paper's notations, it means that bits with a greater position than m f can be removed from concerned p i s. Proposition 5 (Fixed-Point Jackson's Rule). Let s be a sum of n fixed-point number p i s, in format (M, L). If s is known to have a final MSB equals to m f with m f < M, then:
Proof: s = ∑ n i=1 p i , so, from (1):
All bits of s greater than 2 m f (from p i, j with j m f + 1 and from the carry bits produced by p i, j with j < m f + 1) are repetitions of the sign bits, since −2 m f s < 2 m f (by definition of the final FPF). Therefore, s can be only computed with p i, j with j < m f + 1 in the format FPF f . Thus, in our method, the MSB formatting is an application of the propostion to a sum s δ previously LSB-formatted p i s, i.e. with L = f − δ. Therefore, s δ can be computed only using bits p i, j with 1 i n, f − δ j m f without considering intermediate overflows.
OUTPUT ERROR ANALYSIS
Let us consider a n-th order IIR 1 filter having H as a transfer function:
This filter is usually realized with the following algorithm
where u(k) is the input at step k and y(k) the output at step k. So the evaluation of the filter relies on the evaluation of a SoP. As seen in previous sections, the fixedpoint evaluation of eq. (36) implies the add of an error 1 Infinite Impulse Response e(k) at time k, and only y † (the output contaminated with roundoff error) can be computed:
. (37) In (Lopez et al., 2012) , it has been shown that the implemented system eq. (37) can be seen as the initial system (36) with an error added on the output, as shown in Figure 3 : by subtracting equations (37) and (36), it comes
So the output error ∆y(k) y † (k) − y(k) can be seen as the result of the error e(k) through the filter H e defined by H e (z) = 1 1 + a 1 z −1 + · · · + a n z −n , ∀z ∈ C. (39) Since the error e(k) done in the evaluation of the SoP is known to be in a given interval [e; e] (see Proposition 4), then the following proposition (Hilaire and Lopez, 2013) gives the output error interval: Proposition 6 (Output error interval). ∆y(k) is the output of the error e(k) through the filter H e . If the error e(k) is in [e; e], then ∆y(k) is in [∆y; ∆y] with: Figure 4 : Bits representation of the sum of the example.
RESULTS AND COMPARISONS
A 4-th order Butterworth filter is used as an illustrative example. The chosen realization to compute this filter is the Direct Form I:
Its coefficients a i s and b i s are given by the Matlab command butter(4,0.136):
For implementations, the output variables y(i)s, input variables u(i)s, constants a i s and b i s are 16-bit words.
Moreover, variables u(i)s are considered in this example to be in the interval [−13; 13] (the corresponding FPF is (4, −11)), and variables y(i)s (including result output y(k)) are known to be in the interval [−17.123541221107534; 17.123541221107534] (corresponding to the final FPF (m f , f ) = (5, −10)).
From these informations, the operands to be summed, p i s, can be obtained with their respective FPF: Four implementations are compared: a double precision implementation and three fixed-point implementations using bit formatting approach using different values of δ. In the first FxP implementation (denoted Fix 1 ), all bits are considered. This means that δ 1 is chosen such that f − δ 1 = min i ( i ). In other word, we have δ 1 = f − min i ( i ) = 25. The Fix 1 implementation corresponds to the computation of the sum s with no LSB formatting.
In the second FxP implementation Fix 2 , no additional bits are considered. The intermediate format is the final format, (m f , f ), which corresponds to δ 2 = 0. A large LSB reduction is performed.
The third FxP implementation Fix 3 is the faithful implementation, with δ 3 determined from Proposition 2, i.e. δ 3 = log 2 (n f ) with n f = n = 9, so δ 3 = 4. Only 4 guards bits are used in the LSB formatting. Figure 4 illustrates this example. For Fix 1 , since the intermediate format is (5, −35), additional bits equal to 0 are considered to align all p i s onto this format. For Fix 2 and Fix 3 , the intermediate formats, (5, −10) and (5, −14) respectively, permit to remove bits, blue and green hatched bits and blue hatched bits respectively. Finally, the intermediate sums are rounded to the final format (5, −10), except s δ 2 which is already in the final format.
The global interval error [e; e] is computed, for the implementation Fix 3 , using Proposition 4 : e = −1.4645302 × 10 −3 , e = 0.
From equation (43), DC-gain and worst-case peak gain of H e are obtained :
|H e | DC = 49.5647, H e ∞ = 66.8474.
Finally, the output error interval (Proposition 6) [∆y; ∆y] is computed from equations (45) and (46) : ∆y = −8.52445240×10 −2 , ∆y = 1.26555189×10 −2 .
(47) As illustration (but not proof), of the theoretical result (46), a simulation in FxP and floating point arithmetic has been done with a white noise input u(k) in [−13; 13] . The error between the double floating result and each of the FxP implementations is shown in Figure 5 .
The number of additional bits considered in Fix 3 is small compared with Fix 1 which considers all bits, but it is good enough to have errors measures far better than Fix 2 and really close to Fix 1 . The plotted error for implementation Fix 3 is in the bound predicted by the theory.
The fixed-point implementation Fix 3 is given by algorithm 2. In this implementation, sum modulo 2 m f +1 (i.e. 2 6 ) is performed, but since algorithm considers integer computations, the sum is performed modulo 2 m f +1− f +δ (i.e. 2 20 ).
Algorithm 2: Fixed-point algorithm.
Input: U0 to U4: 16-bit input (4, −11) Y 1 to Y 4: 16-bit input (5, −10) Output: Y : 16-bit output (5, −10) Data: Rx: 20-bit registers ⊕: the 20-bit sum
