Error analysis of digital filters using HOL theorem proving  by Akbarpour, Behzad & Tahar, Sofiène
Journal of Applied Logic 5 (2007) 651–666
www.elsevier.com/locate/jal
Error analysis of digital filters using HOL theorem proving ✩
Behzad Akbarpour ∗, Sofiène Tahar
Electrical and Computer Engineering Department, Concordia University, Montreal, Canada
Received 12 July 2005; accepted 13 November 2006
Available online 12 January 2007
Abstract
When a digital filter is realized with floating-point or fixed-point arithmetics, errors and constraints due to finite word length are
unavoidable. In this paper, we show how these errors can be mechanically analysed using the HOL theorem prover. We first model
the ideal real filter specification and the corresponding floating-point and fixed-point implementations as predicates in higher-order
logic. We use valuation functions to find the real values of the floating-point and fixed-point filter outputs and define the error as the
difference between these values and the corresponding output of the ideal real specification. Fundamental analysis lemmas have
been established to derive expressions for the accumulation of roundoff error in parametric Lth-order digital filters, for each of the
three canonical forms of realization: direct, parallel, and cascade. The HOL formalization and proofs are found to be in a good
agreement with existing theoretical paper-and-pencil counterparts.
© 2006 Elsevier B.V. All rights reserved.
Keywords: Error analysis; Digital filters; Theorem proving; Fixed-point; Floating-point; HOL
1. Introduction
Signal processing through digital techniques has become increasingly attractive with the rapid technological
advancement in digital integrated circuits, devices, and systems. The availability of large scale general purpose com-
puters and special purpose hardware has made real time digital filtering both practical and economical. Digital filters
are a particularly important class of DSP (Digital Signal Processing) systems. A digital filter is a discrete time system
that transforms a sequence of input numbers into another sequence of output, by means of a computational algorithm
[13]. Digital filters are used in a wide variety of signal processing applications, such as spectrum analysis, digital
image and speech processing, and pattern recognition. Due to their well-known advantages, digital filters are often
replacing classical analog filters. The three distinct and most outstanding advantages of the digital filters are their
flexibility, reliability, and modularity. Excellent methods have been developed to design these filters with desired
characteristics. The design of a filter is the process of determination of a transfer function from a set of specifications
given either in the frequency domain, or in the time domain, or for some applications, in both. The design of a digital
✩ This paper is an extended version of an earlier publication in TPHOLs 2004 [B. Akbarpour, S. Tahar, Error analysis of digital filters using
theorem proving, in: Theorem Proving in Higher Order Logics, in: Lecture Notes in Computer Science, vol. 3223, Springer-Verlag, 2004,
pp. 1–16].
* Corresponding author.
E-mail addresses: behzad@ece.concordia.ca (B. Akbarpour), tahar@ece.concordia.ca (S. Tahar).1570-8683/$ – see front matter © 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.jal.2006.11.001
652 B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666Fig. 1. Error analysis approach.
filter starts from an ideal real specification. In a theoretical analysis of the digital filters, we generally assume that
signal values and system coefficients are represented in the real number system and are expressed to an infinite pre-
cision. When implemented as a special-purpose digital hardware or as a computer algorithm, we must represent the
signals and coefficients in some digital number system that must always be of a finite precision. Therefore, arithmetic
operations must be carried out with an accuracy limited by this finite word length. There is a variety of types of arith-
metic used in the implementation of digital systems. Among the most common are the floating-point and fixed-point.
Here, all operands are represented by a special format or assigned a fixed word length and a fixed exponent, while
the control structure and the operations of the ideal program remain unchanged. The transformation from the real to
the floating-point and fixed-point forms is quite tedious and error-prone. On the implementation side, the fixed-point
model of the algorithm has to be transformed into the best suited target description, either using a hardware description
or a programming language. This design process can be aided by a number of specialized CAD tools such as SPW
(Cadence) [3], CoCentric (Synopsys) [20], Matlab-Simulink (Mathworks) [16], and FRIDGE (Aachen UT) [22].
In this paper, we propose a methodology for the error analysis of digital filters using the HOL theorem proving
environment [5] based on the commutating diagram shown in Fig. 1. Thereafter, we first model the ideal real filter
specification and the corresponding floating-point and fixed-point implementations as predicates in higher-order logic.
For this, we make use of existing theories in HOL on the construction of real numbers [7], the formalization of
IEEE-754 standard based floating-point arithmetic [8,9], and the formalization of fixed-point arithmetic [2]. We use
valuation functions to find the real values of the floating-point and fixed-point filter outputs and define the errors as
the differences between these values and the corresponding output of the ideal real specification. Then we establish
fundamental lemmas on the error analysis of the floating-point and fixed-point roundings and arithmetic operations
against their abstract mathematical counterparts. Finally, we use these lemmas as a model to derive expressions for
the accumulation of the roundoff error in parametric Lth-order digital filters, for each of the three canonical forms
of realization: direct, parallel, and cascade [18]. Using these forms, our verification methodology can be scaled up to
any larger-order filter, either directly or by decomposing the design into a combination of internal sub-blocks. While
the theoretical work on computing the errors due to finite precision effects has been extensively studied since the late
sixties [15], it is for the first time in this paper, that a formalization and proof of this analysis for digital filters is done
using a mechanical theorem prover, here the HOL. Our results are found to be in a good agreement with the theoretical
ones.
The rest of this paper is organized as follows: Section 2 gives a review of the related work. Section 3 introduces
the fundamental lemmas in HOL for the error analysis of the floating-point and fixed-point rounding and arithmetic
operations. Section 4 describes the details of the error analysis in HOL of the class of linear difference equation digital
filters implemented in the three canonical forms of realization. Finally, Section 5 concludes the paper.
2. Related work
Work on the analysis of the errors due to the finite precision effects in the realization of the digital filters has
always existed since their early days, however, using theoretical paper-and-pencil proofs and simulation techniques.
For digital filters realized with the fixed-point arithmetic, error problems have been studied extensively. For instance,
Knowles and Edwards [14] proposed a method for analysis of the finite word length effects in fixed-point digital
B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666 653filters. Gold and Radar [6] carried out a detailed analysis of the roundoff error for the first-order and second-order
fixed-point filters. Jackson [12] analyzed the roundoff noise for the cascade and parallel realizations of the fixed-
point digital filters. While the roundoff noise for the fixed-point arithmetic enters into the system additively, it is a
multiplicative component in the case of the floating-point arithmetic. This problem is analyzed first by Sandberg [19],
who discussed the roundoff error accumulation and input quantization effects in the direct realization of the filter
excited by a deterministic input. He also derived a bound on the time average of the squared error at the output. Liu
and Kaneko [15] presented a general approach to the error analysis problem of digital filters using the floating-point
arithmetic and calculated the error at the output due to the roundoff accumulation and input quantization. Expressions
are derived for the mean square error for each of the three canonical forms of realization: direct, cascade, and parallel.
Upper bounds that are useful for a special class of the filters are given. Oppenheim and Weinstein [17] discussed in
some details the effects of the finite register length on implementations of the linear recursive difference equation
digital filters, and the fast Fourier transform (FFT) algorithm. Comparisons of the roundoff noise in the digital filters
using the different types of arithmetics have also been reported in [21].
In order to validate the error analysis, most of the above work compare the theoretical results with corresponding
experimental simulations. In this paper, we show how the above error analysis can be mechanically performed using
the HOL theorem prover, providing a superior approach to validation by simulation. Our focus will be on the process
of translating the hand proofs into equivalent proofs in HOL. The analysis we propose is mostly inspired by the work
done by Liu and Kaneko [15], who defined a general approach to the error analysis problem of digital filters using
the floating-point arithmetic. Following a similar approach, we have extended this theoretical analysis for fixed-point
digital filters. In both cases, a good agreement between the HOL formalized and the theoretical results are obtained.
Through our work, we confirmed and strengthened the main results of the previously published theoretical error
analysis, though we uncovered some minor errors in the hand proofs and located a few subtle corners that were
overlooked informally. For example, in the theoretical fixed-point error analysis it is always assumed that the fixed-
point addition causes no error and only the roundoff error in the fixed-point multiplication is analyzed [17]. This is
under the assumption that there is no overflow in the result and also the input operands have the same attributes as the
output. Using a mechanical theorem prover, we provide a more general error analysis in which we cover the roundoff
errors in both the fixed-point addition and multiplication operations. On top of that, for the floating-point error analysis,
we have used the formalization in HOL of the IEEE-754 [8], a standard which has not yet been established at the time
of the above mentioned theoretical error analysis. This enabled us to cover a more complete set of rounding and
overflow modes and degenerate cases which are not discussed in earlier theoretical work.
Previous work on the error analysis in formal verification was done by Harrison [9] who verified the floating-point
algorithms such as the exponential function against their abstract mathematical counterparts using the HOL Light
theorem prover. As the main theorem, he proved that the floating-point exponential function has a correct overflow
behavior, and in the absence of overflow the error in the result is bounded to a certain amount. He also reported on an
error in the hand proof mostly related to forgetting some special cases in the analysis. This error analysis is very similar
to the type of analysis performed for DSP algorithms. The major difference, however, is the use of statistical methods
and mean square error analysis for DSP algorithms which is not covered in the error analysis of the mathematical
functions used by Harrison. In this method, the error quantities are treated as independent random variables uniformly
distributed over a specific interval depending on the type of arithmetic and the rounding mode. Then the error analysis
is performed to derive expressions for the variance and mean square error. To perform such an analysis in HOL,
we need to develop a mechanized theory on the properties of random variables and random processes. This type of
analysis is not addressed in this paper and is a part of our work in progress. Huhn et al. [11] proposed a hybrid formal
verification method combining different state-of-the-art techniques to guide the complete design flow of imprecisely
working arithmetic circuits starting at the algorithmic down to the register transfer level. The usefulness of the method
is illustrated with the example of the discrete cosine transform algorithms. In particular, the authors have shown the
use of computer algebra systems like Mathematica or Maple at the algorithmic level to reason about real numbers
and to determine certain error bounds for the results of numerical operations. In contrast to [11], we propose an error
analysis for digital filters using the HOL theorem prover. Although the computer algebraic systems such as Maple or
Mathematica are much more popular and have many powerful decision procedures and heuristics, theorem provers
are more expressive, more precise, and more reliable [10]. One option is to combine the rigour of the theorem provers
with the power of computer algebraic systems as proposed in [10].
654 B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–6663. Error analysis models
In this section we introduce the fundamental error analysis theorems [4,23], and the corresponding lemmas in
HOL for the floating-point [8,9] and fixed-point [2] arithmetics. These theorems are then used in the next sections as
a model for the analysis of the roundoff error in digital filters.
3.1. Floating-point error model
In analyzing the effects of floating-point roundoff, the effects of rounding will be represented multiplicatively. The
following theorem is the most fundamental in the floating-point rounding-error theory [4,23].
Theorem 1. If the real number x located within the floating-point range is rounded to the closest floating-point number
xR , then
(1)xR = x(1 + δ), where |δ| 2−p
and p is the precision of the floating-point format.
In HOL, we established this theorem according to the available formalization of IEEE 754 floating-point standard
[8,9], as follows:
 (normalizes x) ⇒
∃e.
(abs e  inv (2 pow ((fracwidth X) + 1))) ∧
(Val (float (round X mode x)) = x * (1 + e))
where the function normalizes defines the criteria for an arbitrary real number to be in the normalized range of
floating-point numbers, fracwidth extracts the fraction width parameter from the floating-point format X, Val is
the floating-point valuation function, float is the bijection function that converts a triplet of natural numbers into the
floating-point type, round is the floating-point rounding function, and mode is the corresponding rounding mode.
To prove this theorem [4], we first proved the following lemma which locates a real number in a binade (the
floating-point numbers between two adjacent powers of 2):
 (normalizes x) ⇒
∃j.
(j  ((emax X) − 2)) ∧
((2 pow (j + 1) / 2 pow (bias X))  abs x) ∧
(abs x < (2 pow (j + 2) / 2 pow (bias X)))
where the function emax defines the maximum exponent in a given floating-point format, and bias defines the ex-
ponent bias in the floating-point format which is a constant used to make the exponent’s range nonnegative. Using this
lemma we can rewrite the general floating-point absolute error bound theorem (ERROR_BOUND_NORM_STRONG)
developed in [9] as follows:
 (normalizes x) ⇒
∃j.
(abs (error x)  (2 pow j / 2 pow (bias X + fracwidth X)))
which states that if the absolute value of a real number is in the representable range of the normalized floating-point
numbers, then the absolute value of the error is less than or equal to 2j /2(biasX + fracwidthX). The function error,
defines the error resulting from rounding a real number to a floating-point value which is defined as follows [9]:
def error x = (Val (float (round X mode x)) − x)
Since (2(j+1) / 2(bias X))  |x| for the real numbers in the normalized region, we have (|error x| / |x|) 
(2j / 2(bias X + fracwidthX)) /(2(j+1) / 2(biasX)) or (|error x| / |x|)  (1 / 2((fracwidth X) + 1)). Finally, defining
e = (error x / x) will complete the proof of the floating-point relative rounding error bound theorem.
B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666 655Next, we apply the floating-point relative rounding error analysis theorem (Theorem 1) to the verification of the
arithmetic operations. The goal is to prove the following theorem in which floating-point arithmetic operations such
as addition, subtraction, multiplication, and division are related to their abstract mathematical counterparts according
to the corresponding errors.
Theorem 2. Let ∗ denote any of the floating-point operations +, −, ×, /. Then
(2)f l(x ∗ y) = (x ∗ y)(1 + δ), where |δ| 2−p
and p is the precision of the floating-point format. The notation f l(.) is used to denote that the operation is performed
using the floating-point arithmetic.
To prove this theorem in HOL, we start from the already proved lemmas on absolute analysis of rounding error in
floating-point arithmetic operations (FLOAT_ADD,FLOAT_SUB,FLOAT_MUL,FLOAT_DIV) developed in [9]. We
have converted these lemmas to the following relative error analysis version, using the relative error bound analysis
of floating-point rounding (Theorem 1):
 [(Finite a) ∧ (Finite b) ∧ (normalizes (Val a + Val b))] ⇒
(Finite (a + b)) ∧
∃e.
abs e  inv (2 pow ((fracwidth X) + 1)) ∧
(Val (a + b) = (Val a + Val b) * (1 + e))
 [(Finite a) ∧ (Finite b) ∧ (normalizes (Val a - Val b))] ⇒
(Finite (a − b)) ∧
∃e.
abs e  inv (2 pow ((fracwidth X) + 1)) ∧
(Val (a − b) = (Val a − Val b) * (1 + e))
 [(Finite a) ∧ (Finite b) ∧ (normalizes (Val a * Val b))] ⇒
(Finite (a * b)) ∧
∃e.
abs e  inv (2 pow ((fracwidth X) + 1)) ∧
(Val (a * b) = (Val a * Val b) * (1 + e))
 [(Finite a) ∧ (Finite b) ∧ (¬ Iszero b) ∧ (normalizes (Val a / Val b))] ⇒
(Finite (a / b)) ∧
∃e.
abs e  inv (2 pow ((fracwidth X) + 1)) ∧
(Val (a / b) = (Val a / Val b) * (1 + e))
where the function Finite defines the finiteness criteria for the floating-point numbers, and the function Iszero
checks if a given floating-point number is equal to zero. Note that we use the conventional symbols for arithmetic
operations on floating-point numbers using the operator overloading feature of HOL. The lemmas are composed of
two parts. The first part is about the finiteness of the floating-point operation output. It states that for each pair of finite
floating-point numbers, if the real result is in the representable range of normalized floating-point numbers, then the
output result is also finite. For floating-point division, the second operand should be nonzero to avoid the division by
zero. The second part of the lemmas states that the result of a floating-point operation is the exact result, perturbed by
a relative error of bounded magnitude. It is of great importance to note that in these theorems the format parameter X
is hidden inside the floating-point arithmetic operations (a + b, etc.) using the HOL operator overloading feature.
3.2. Fixed-point error model
While the rounding error for the floating-point arithmetic enters into the system multiplicatively, it is an addi-
tive component for the fixed-point arithmetic. In this case the fundamental error analysis theorem can be stated as
follows [23].
656 B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666Theorem 3. If the real number x located in the range of the fixed-point numbers with format X’ is rounded to the
closest fixed-point number x′R , then
(3)x′R = x + , where ||  2−fracbits (X
′)
and fracbits is a function that extracts the number of bits that are to the right of the binary point in the given fixed-point
format.
This theorem is proved in HOL as follows [2]:
 [(validAttr X’) ∧ (representable X’ x)] ⇒
∃e.
abs e  inv (2 pow fracbits X’) ∧
(value (Fxp_round X’ x) = x + e)
where the function validAttr defines the validity of the fixed-point format X’, representable defines the
criteria for a real number to be in the representable range of the fixed-point format, and Fxp_round is the fixed-
point rounding function.
The verification of the fixed-point arithmetic operations using the absolute error analysis of the fixed-point rounding
(Theorem 3) can be stated as the following theorem in which the fixed-point arithmetic operations are related to their
abstract mathematical counterparts according to the corresponding errors.
Theorem 4. Let ∗ denote any of the fixed-point operations +, −, ×, /, with a given format X′. Then
(4)fxp(x ∗ y) = (x ∗ y) + , where || 2−fracbits (X′)
and the notation fxp (.) is used to denote that the operation is performed using the fixed-point arithmetic.
This theorem is proved in HOL using the following lemmas [2]:
 [(Isvalid a) ∧ (Isvalid b) ∧ validAttr (X’) ∧
(representable X’ (value a + value b))] ⇒
[(Isvalid (FxpAdd X’ a b)) ∧
∃e.
abs e  inv (2 pow (fracbits X’)) ∧
value (FxpAdd X’ a b) = (value a + value b) + e]
 [(Isvalid a) ∧ (Isvalid b) ∧ validAttr (X’) ∧
(representable X’ (value a − value b))] ⇒
[(Isvalid (FxpSub X’ a b)) ∧
∃e.
abs e  inv (2 pow (fracbits X’)) ∧
value (FxpSub X’ a b) = (value a − value b) + e]
 [(Isvalid a) ∧ (Isvalid b) ∧ validAttr (X’) ∧
(representable X’ (value a * value b))] ⇒
[(Isvalid (FxpMul X’ a b)) ∧
∃e.
abs e  inv (2 pow (fracbits X’)) ∧
value (FxpMul X’ a b) = (value a * value b) + e]
 [(Isvalid a) ∧ (Isvalid b) ∧ ¬ (value b = 0) ∧ validAttr (X’) ∧
(representable X’ (value a / value b))] ⇒
[(Isvalid (FxpDiv X’ a b)) ∧
∃e.
abs e  inv (2 pow (fracbits X’)) ∧
value (FxpDiv X’ a b) = (value a / value b) + e]
B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666 657where the function Isvalid defines the validity of a fixed-point number, value is the fixed-point valuation func-
tion, and FxpAdd, FxpSub, FxpMul, and FxpDiv are the corresponding functions for fixed-point addition,
subtraction, multiplication, and division operations, respectively. According to these lemmas, if the input fixed-point
numbers and the output attributes are valid, then the result of fixed-point operations is valid. For fixed-point division,
the second operand should be nonzero to avoid the division by zero. The result of the fixed-point operations is the
exact result, perturbed by an absolute error of bounded magnitude.
As explained before, in the theoretical fixed-point error analysis of digital filters, it is always assumed that the fixed-
point addition causes no error and only the roundoff error in the fixed-point multiplication is analyzed. This is under
the assumption that there is no overflow in the result and also the input operands have the same attributes as the output.
However, as explained in details in [2], each fixed-point number is defined as a pair consisting of a binary string and
a set of attributes (Binary String, Attributes). The attributes specify how the binary string is interpreted. Therefore,
each fixed-point number has its own attributes based on which the corresponding real value can be computed. On the
other hand, each fixed-point arithmetic operation such as addition, takes two fixed-point input operands and stores the
result into a third. The attributes of the inputs and output need not match one another. The result is formatted into the
output as specified by the output attributes and by the overflow and quantization mode parameters. Therefore, when
we write FxpAdd X’ a b, X’ is the fixed-point addition operation output attributes which can be different from
the attributes of a and b. In fact, the attributes of a and b are hidden in their fixed-point data type representations.
For example, if a = (111101,(6,3,u)) = 111.101 = + 7.625 and b = (110010,(5,3,u)) =
110.01 = + 6.250 and X1 = (7,4,u) then the result will be (1101111,(7,4,u)) = 1101.111 =
+ 13.875 and there is no roundoff error in the result. But if we select X2 = (6,4,u) then the result will
be (1101111,(6,4,u)) = 1101.11 = + 13.75 and the corresponding error is error = 0.125. Note
that in this example both inputs together with the output are unsigned. The situation could be even worse if we choose
different sign formats for them. In general, if the number of fractional bits in the output attributes is less than the one of
inputs then we should expect error in the fixed-point addition result. Using a mechanical theorem prover, we provide
a more general error analysis in which we cover the roundoff error in both the fixed-point addition and multiplication
operations.
4. Error analysis of digital filters using HOL
In this section, the principal results for roundoff accumulation in digital filters using theorem proving are derived
and summarized. We shall employ the models for floating- and fixed-point roundoff errors in HOL presented in the
previous section. To illustrate our approach, we will first consider the case of first- and second-order digital filters.
Then, we extend this analysis to the general case of the direct form realization of a parametric Lth-order filter of
which the first- and second-order filters are special cases. Finally, we apply our approach to the parallel and cascade
forms. Using these forms, larger-order filters can be treated as a combination of first- and second-order filters. Then,
the total error is computed by accumulating the error in all internal sub-filters. In the following, we will first describe
in details the theory behind the analysis and then explain how each step of this analysis is performed in HOL. For the
sake of space, in this paper we will not show all the details. A complete analysis can be found in [1].
The class of digital filters considered in this paper is that of linear constant coefficient filters specified by the
difference equation:
(5)wn =
M∑
i=0
bixn−i −
L∑
i=1
aiwn−i
where {xn} is the input sequence and {wn} is the output sequence. L is the order of the filter, and M can be any
positive number less than L. There are three canonical forms of realizing a digital filter, namely the direct, parallel,
and cascade forms (Fig. 2) [18].
If the output sequence is calculated by using Eq. (5), the digital filter is said to be realized in the direct form.
Fig. 2(a) illustrates the direct form realization of the filter using the corresponding blocks for the addition, multiplica-
tion by a constant operations, and the delay element.
The implementation of a digital filter in the parallel form is shown in Fig. 2(b) in which the entire filter is visual-
ized as the parallel connection of the simpler filters Hi of a lower order. In this case, K intermediate outputs {win},
658 B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666Fig. 2. Canonical forms of digital filter realizations.
i = 1,2, . . . ,K , are first calculated and then summed to form the total output {wn}. Therefore, for the input sequence
{xn} we have:
(6)win = fixn + gixn−1 − ciwin−1 − diwin−2
where the parameters fi, gi, ci , and di are obtained from the parameters ai and bi in Eq. (5) using the parallel expan-
sion. The output of the entire filter wn, is then related to win by:
(7)wn = w1n + w2n + · · · + wKn
The implementation of a digital filter in the cascade form is shown in Fig. 2(c) in which the filter is visualized as
a cascade of lower filters. From the input {xn}, the intermediate output {w1n} is first calculated, and then this is the
input to the second filter. Continuing in this manner, the final output wKn = wn is calculated. Since the output of the
ith section (win) is the input of the (i + 1)th section, the following equation holds:
(8)wi+1n = win + kiwin−1 + liwin−2 − ciwi+1n−1 − diwi+1n−2
where the parameters ki, li , ci , and di are obtained from the parameters ai and bi in Eq. (5) using the serial expansion.
There are three common sources of errors associated with the filter of Eq. (5), namely [15]:
1. Input quantization: caused by the quantization of the input signal {xn} into a set of discrete levels.
B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666 6592. Coefficient inaccuracy: caused by the representation of the filter coefficients {ak} and {bk} by a finite word length.
3. Roundoff accumulation: caused by the accumulation of roundoff errors at arithmetic operations.
In the following analysis, we will first focus on the roundoff accumulation error, and then describe how the results
can be modified by considering the effects of the other two above mentioned error sources. Therefore, for the digital
filter of Eq. (5) the actual computed output reference is in general different from {wn}. We denote the actual floating-
point and fixed-point outputs by {yn} and {vn}, respectively. Then, we define the corresponding errors at the nth output
sample as:
(9)en = yn − wn
(10)e′n = vn − wn
(11)e′′n = vn − yn
where en and e′n are defined as the errors between the actual floating-point and fixed-point implementations and the
ideal real specification, respectively. e′′n is the error in the transition from the floating-point to fixed-point levels.
It is clear from the above discussion that for the digital filter of Eq. (5) realized in the direct form, we have:
(12)yn = f l
(
M∑
k=0
bkxn−k −
L∑
k=1
akyn−k
)
and
(13)vn = fxp
(
M∑
k=0
bkxn−k −
L∑
k=1
akvn−k
)
The calculation of Eq. (12) is to be performed in the following manner. First, the output products ak yn−k ,
k = 1,2, . . . ,L, are calculated separately and then summed. Next, the same is done for the input products bk xn−k ,
k = 0,1, . . . ,M . Finally, the output summation is subtracted from the input one to obtain the main floating-point
output yn. Similar discussion can be applied for the calculation of the fixed-point output vn according to Eq. (13).
The corresponding flowgraph showing the effect of roundoff error using the fundamental error analysis theorems
(Theorems 2 and 4) according to Eqs. (2) and (4), is given by Fig. 3 which also indicates the order of the calculation.
Formally, a flowgraph is a network of directed branches that connect at nodes. Associated with each node is a
variable or node value. Each branch has an input signal and an output signal with a direction indicated by an arrowhead
on it. In a linear flowgraph, the output of a branch is a linear transformation of the input to the branch. The simplest
examples are constant multipliers and adders, i.e., when the output of the branch is simply a multiplication or an
addition of the input to the branch with a constant value, which are the only classes we consider in this paper. The
linear operation represented by the branch is typically indicated next to the arrowhead showing the direction of the
branch. For the case of a constant multiplier and adder, the constant is simply shown next to the arrowhead. When
an explicit indication of the branch operation is omitted, this indicates a branch transmittance of unity, or identity
transformation. By definition, the value at each node in a flowgraph is the sum of the outputs of all the branches
entering the node. To complete the definition of the flowgraph notation, we define two special types of nodes. (1)
Source nodes that have no entering branches. They are used to represent the injection of the external inputs or signal
sources into a flowgraph. (2) Sink nodes that have only entering branches. They are used to extract the outputs from a
flowgraph [18].
The quantities δn,k , k = 0,1, . . . ,M , n,k , k = 1,2, . . . ,L, ζn,k , k = 1,2, . . . ,M , ηn,k , k = 2,3, . . . ,L, and ξn are
errors caused by floating-point roundoff at each arithmetic step. The corresponding error quantities for fixed-point
roundoff are δ′n,k , k = 0,1, . . . ,M , ′n,k , k = 1,2, . . . ,L, ζ ′n,k , k = 1,2, . . . ,M , η′n,k , k = 2,3, . . . ,L, and ξ ′n.
Note that we have used one flowgraph to represent both the floating-point and fixed-point cases, simultaneously.
For floating-point errors, the branch operations are interpreted as constant multiplications, while for fixed-point errors
the branch operations are interpreted as constant additions. We have surrounded the fixed-point error quantities and
output samples by parentheses to distinguish them from their floating-point counterparts.
660 B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666Fig. 3. Error flowgraph for Lth-order filter (direct form).
Therefore, the actual outputs yn and vn are seen to be given explicitly by:
(14)yn =
M∑
k=0
bkθn,kxn−k −
L∑
k=1
akφn,kyn−k
where
θn,0 = (1 + ξn)(1 + δn,0)
M∏
i=1
(1 + ζn,i)
θn,j = (1 + ξn)(1 + δn,j )
M∏
i=j
(1 + ζn,i), j = 1,2, . . . ,M
φn,1 = (1 + ξn)(1 + n,1)
L∏
i=2
(1 + ηn,i)
φn,j = (1 + ξn)(1 + n,j )
L∏
i=j
(1 + ηn,i), j = 2,3, . . . ,L
and
(15)vn =
M∑
k=0
bk xn−k −
L∑
k=1
akvn−k +
M∑
k=0
δ′n,k +
M∑
k=1
ζ ′n,k +
L∑
k=1
′n,k +
L∑
k=2
η′n,k + ξ ′n
For the error analysis, we need to calculate the yn and vn sequences from Eqs. (14) and (15), and compare them
with the ideal output sequence wn specified by Eq. (5) to obtain the corresponding errors en, e′n, and e′′n , according
to Eqs. (9), (10), and (11), respectively. Therefore, the difference equations for the errors between the different levels
showing the accumulation of the roundoff error are derived as the following error analysis cases:
B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666 661Fig. 4. Error flowgraph for Lth-order filter (parallel form).
1. Real to floating-point error analysis:
(16)en +
L∑
k=1
aken−k =
M∑
k=0
bk(θn,k − 1)xn−k −
L∑
k=1
ak(φn,k − 1)yn−k
2. Real to fixed-point error analysis:
(17)e′n +
L∑
k=1
ake
′
n−k =
M∑
k=0
δ′n,k +
M∑
k=1
ζ ′n,k +
L∑
k=1
′n,k +
L∑
k=2
η′n,k + ξ ′n
3. Floating-point to fixed-point error analysis:
e′′n +
L∑
k=1
ake
′′
n−k =
M∑
k=0
δ′n,k +
M∑
k=1
ζ ′n,k +
L∑
k=1
′n,k +
L∑
k=2
η′n,k + ξ ′n
(18)−
M∑
k=0
bk(θn,k − 1)xn−k +
L∑
k=1
ak(φn,k − 1)yn−k
Similar analysis is performed for the parallel and cascade realization forms based on the error flowgraphs as shown
in Figs. 4 and 5, respectively [1].
4.1. Effects of input quantization and coefficient inaccuracy
The discussion presented in previous section concerns only the roundoff accumulation effect. As mentioned before,
there are two other common causes of error due to the finite word length in implementation of digital filters. They are
the quantization of the input data {x(n)} and the inaccuracy of the coefficients {ak} and {bk}.
Let x′(n) and x′′(n) be the floating- and fixed-point quantized versions of x(n), respectively. Then from the dis-
cussion in Section 3 we can write
662 B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666Fig. 5. Error flowgraph for Lth-order filter (cascade form).
(19)x′(n) = x(n)(1 + θn)
(20)x′′(n) = x(n) + θ ′n
where θn is the error caused by floating-point quantization, and θ ′n is the error caused by fixed-point quantization in
the input signal.
Also, let {a′k}, {a′′k } and {b′k}, {b′′k } be the floating-point and fixed-point quantized versions of {ak} and {bk}, respec-
tively. Then from the discussion in Section 3 we can write
(21)a′k = ak(1 + ϕk), a′′k = ak + ϕ′k
(22)b′k = bk(1 + ψk), b′′k = bk(p) + ψ ′k
where ϕk and ψk are the errors caused by floating-point quantizations, and ϕ′k and ψ ′k are the errors caused by fixed-
point quantizations in the coefficients. One may now proceed with the analysis of Section 4 by adding the factors
(1 + θn), (1 + ϕk), (1 + ψk), θ ′n, ϕ′k , and ψ ′k in appropriate places in the error accumulation equations.
4.2. Error analysis in HOL
In HOL, we first specified parametric Lth-order digital filters at the real, floating-point, and fixed-point abstraction
levels, as predicates in higher-order logic. The direct form is defined in HOL using Eq. (5). For the real specification,
we used the expression sum (m,n) f denoting
∑m+n−1
i=m f (i), which is a function available in the HOL real library
[7] and defines the finite summation on the real numbers. For the floating-point and fixed-point specifications, we
defined similar functions for the finite summations on the floating-point (float_sum) and fixed-point (fxp_sum)
numbers, using the recursive definition in HOL. The corresponding codes in HOL are as follows.
def L_Order_Filter_Direct_Form_Ideal_Spec a b x w M L =
∀n.
w n =
sum (0,SUC M) (λi. b i * x (n − i)) −
sum (1,L) (λi. a i * w (n − i))
def L_Orde_Filter_Direct_Form_Float_Imp a’ b’ x’ y M L =
∀n.
y n =
float_sum (0,SUC M) (λi. b’ i * x’ (n − i)) −
float_sum (1,L) (λi. a’ i * y (n − i))
def L_Order_Filter_Direct_Form_Fxp_Imp X a” b” v M L =
∀n.
v n =
FxpSub X
(fxp_sum (0,SUC M) X (λi. FxpMul X b” i x” (n − i)))
(fxp_sum (1,L) X (λi. FxpMul X a i y (n − i)))
B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666 663For the error analysis of the digital filters in HOL, we first defined the finite product on the real numbers recursively
as the expression mul (m,n) f denoting
∏m+n−1
i=m f (i) as follows:
def ∀f n m. (mul (n,0) f = 1) ∧
(mul (n,SUC m) f = mul (n,m) f * f (n + m))
Then we established the following lemmas to compute the output real values of the floating-point and fixed-point
filters according to Eqs. (14) and (15) for the direct form of realization.
 L_Order_Filter_Direct_Form_Float_Imp X a’ b’ x’ y M L ⇒
∃t f.
(Val (y n) =
(if L = 0 then
sum (0,SUC M) (λi. Val (b’ i) * t i * Val (x’ (n − i)))
else
sum (0,SUC M) (λi. Val (b’ i) * t i * Val (x’ (n − i))) −
sum (1,L) (λi. Val (a’ i) * f i * Val (y (n − i))))) ∧
∃k d p e z.
abs k  inv (2 pow ((fracwidth X) + 1)) ∧
(∀i. i  M ⇒ abs (d i)  inv (2 pow ((fracwidth X) + 1))) ∧
(∀i. i  M ⇒ abs (p i)  inv (2 pow ((fracwidth X) + 1))) ∧
(∀i. i  L ⇒ abs (e i)  inv (2 pow ((fracwidth X) + 1))) ∧
(∀i. i  L ⇒ abs (z i)  inv (2 pow ((fracwidth X) + 1))) ∧
(t 0 = (1 + k) * (1 + d 0) * mul (1,M) (λi. 1 + p i)) ∧
(∀j.
1  j ∧ j  M ⇒
(t j = (1 + k) * (1 + d j) * mul (j,M − (j − 1)) (λj. 1 + p j))) ∧
(f 1 = (1 + k) * (1 + e 1) * mul (2,L − 1) (λi. 1 + z i)) ∧
(∀j.
2  j ∧ j  L ⇒
(f j = (1 + k) * (1 + e j) * mul (j,L − j + 1) (λj. 1 + z j)))
 L_Order_Filter_Direct_Form_Fxp_Imp X’ a” b” x” v M L ⇒
∃k d p e z.
abs k  inv (2 pow fracbits X’) ∧
(∀i. i  M ⇒ abs (d i)  inv (2 pow fracbits X’)) ∧
(∀i. i  M ⇒ abs (p i)  inv (2 pow fracbits X’)) ∧
(∀i. i  L ⇒ abs (e i)  inv (2 pow fracbits X’)) ∧
(∀i. i  L ⇒ abs (z i)  inv (2 pow fracbits X’)) ∧
(value (v n) =
(if L = 0 then
sum (0,SUC M) (λi. value (b” i) * value (x” (n − i))) +
sum (0,SUC M) (λi. d i) + sum (1,M) (λj. p j) + k
else
sum (0,SUC M) (λi. value (b” i) * value (x” (n − i))) +
sum (0,SUC M) (λi. d i) + sum (1,M) (λj. p j) -
(sum (1,L) (λi. value (a” i) * value (v (n − i))) +
sum (1,L) (λi. e i) + sum (2,L − 1) (λj. z j)) + k))
Finally, we defined the errors as the differences between the output of the real filter specification and the cor-
responding real values of the floating-point and fixed-point filter implementations (Float_Error,Fxp_Error),
as well as the error in transition from the floating-point to fixed-point levels (Float_Fxp_Error), according to
Eqs. (9), (10), and (11), respectively. Then, we established lemmas for the accumulation of the roundoff error between
the different levels, according to Eqs. (16), (17), and (18).
 [L_Order_Filter_Direct_Form_Ideal_Spec a b x w M L ∧
L_Order_Filter_Direct_Form_Float_Imp X a’ b’ x’ y M L] ⇒
∃t f.
(if L = 0 then
664 B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666Float_Error n =
sum (0,SUC M) (λi. Val (b’ i) * (t i − 1) * Val (x’ (n − i)))
else
Float_Error n + sum (1,L) (λi. a i * Float_Error (n − i)) =
sum (0,SUC M) (λi. Val (b’ i) * (t i − 1) * Val (x’ (n − i))) −
sum (1,L) (λi. Val (a’ i) * (f i − 1) * Val (y (n − i)))) ∧
∃k d p e z.
abs k  inv (2 pow ((fracwidth X) + 1)) ∧
(∀i. i  M ⇒ abs (d i)  inv (2 pow ((fracwidth X) + 1))) ∧
(∀i. i  M ⇒ abs (p i)  inv (2 pow ((fracwidth X) + 1))) ∧
(∀i. i  L ⇒ abs (e i)  inv (2 pow ((fracwidth X) + 1))) ∧
(∀i. i  L ⇒ abs (z i)  inv (2 pow ((fracwidth X) + 1))) ∧
(t 0 = (1 + k) * (1 + d 0) * mul (1,M) (λi. 1 + p i)) ∧
(∀j.
1  j ∧ j  M ⇒
(t j = (1 + k) * (1 + d j) * mul (j,M − (j − 1)) (λj. 1 + p j))) ∧
(f 1 = (1 + k) * (1 + e 1) * mul (2,L − 1) (λi. 1 + z i)) ∧
(∀j.
2  j ∧ j  L ⇒
(f j = (1 + k) * (1 + e j) * mul (j,L − j + 1) (λj. 1 + z j)))
 [L_Order_Filter_Direct_Form_Ideal_Spec a b x w M L ∧
L_Order_Filter_Direct_Form_Fxp_Imp X’ a” b” x” v M L] ⇒
∃k d p e z.
abs k  inv (2 pow fracbits X’) ∧
(∀i. i  M ⇒ abs (d i)  inv (2 pow fracbits X’)) ∧
(∀i. i  M ⇒ abs (p i)  inv (2 pow fracbits X’)) ∧
(∀i. i  L ⇒ abs (e i)  inv (2 pow fracbits X’)) ∧
(∀i. i  L ⇒ abs (z i)  inv (2 pow fracbits X’)) ∧
(if L = 0 then
Fxp_Error n =
sum (0,SUC M) (λi. d i) + sum (1,M) (λj. p j) + k
else
Fxp_Error n + sum (1,L) (λi. a i * Fxp_Error (n − i)) =
sum (0,SUC M) (λi. d i) + sum (1,M) (λj. p j) −
(sum (1,L) (λi. e i) + sum (2,L − 1) (λj. z j)) + k)
 [L_Order_Filter_Direct_Form_Ideal_Spec a b x w M L ∧
L_Order_Filter_Direct_Form_Float_Imp X a’ b’ x’ y M L ∧
L_Order_Filter_Direct_Form_Fxp_Imp X’ a” b” x” v M L] ⇒
∃t f k’ d’ p’ e’ z’.
(if L = 0 then
Float_Fxp_Error n =
sum (0,SUC M) (λi. d’ i) + sum (1,M) (λj. p’ j) + k’ −
sum (0,SUC M) (λi. Val (b’ i) * (t i − 1) * Val (x’ (n − i)))
else
Float_Fxp_Error n +
sum (1,L) (λi. a i * Float_Fxp_Error (n − i)) =
sum (0,SUC M) (λi. d’ i) + sum (1,M) (λj. p’ j) −
(sum (1,L) (λi. e’ i) + sum (2,L − 1) (λj. z’ j)) + k’ −
(sum (0,SUC M) (λi. Val (b’ i) * (t i − 1) * Val (x’ (n − i))) −
sum (1,L) (λi. Val (a’ i) * (f i − 1) * Val (y (n − i)))))
We proved these lemmas using the fundamental floating-point and fixed-point error analysis lemmas, based on the
error models presented in Section 3. The lemmas are proved by induction on the parameters L and M for the direct
form of realization. Similar analysis is performed in HOL for the parallel and cascade realization forms. For these
cases, we proved the corresponding lemmas by induction on the parameter K which is defined as the number of the
B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666 665internal sub-filters connected in parallel or cascade forms to generate the final output. A complete list of the derived
HOL definitions and theorems can be found in [1].
The HOL formalization and proofs are found to be in a good agreement with existing theoretical paper-and-pencil
counterparts [15]. However, we uncovered some minor errors in the hand proofs and located a few subtle corner cases
that were overlooked informally. Namely, in contrast to the related work, which neglects the roundoff error in fixed-
point addition, we considered (as depicted in Fig. 3 and derived in the corresponding formulas) the roundoff error in
both fixed-point addition and multiplication. Therefore, we confirmed and strengthened the related work. On top of
that, for the floating-point error analysis, we used the formalization in HOL of the IEEE-754 [9], a standard which had
not yet been established at the time of the above mentioned theoretical error analysis. This enabled us to cover a more
complete set of rounding and overflow modes and degenerate cases which were not discussed in earlier theoretical
work.
As explained in Section 1 and illustrated in Fig. 1, our goal in this paper is to propose a complete framework
for error analysis of digital filters in transition from the ideal real specification down to the floating-point and fixed-
point algorithmic level implementations. We considered the real domain as the golden reference and compared it
with the real values of the floating-point and fixed-point domains, and defined three different errors. To make this
analysis feasible, we applied the general floating-point error analysis method proposed in [15] to both floating-point
and fixed-point error analysis. Therefore, all derivations in this paper on the error analysis of real to fixed-point, and
floating-point to fixed-point transitions are new and done for the first time. The definitions in Eqs. (10), (11), (13), (15),
(17), and (18) are new. The key error flowgraphs in Figs. 3, 4, and 5, are the extended versions of the corresponding
flowgraphs in reference [15], to cover both floating-point and fixed-point analyses, simultaneously. We extended the
definition of error flowgraphs to cover both multiplicative and additive behaviours for floating-point and fixed-point
roundoff errors, respectively. Also, the discussion on Section 4.1 are adopted to cover the fixed-point error analysis.
Almost all the related work on error analysis of fixed-point digital filters were oriented towards statistical analysis and
none of them had performed an analysis based on the accumulation of error as developed here. This enabled us to
establish a framework for comparison between different domains.
5. Conclusions
In this paper, we described a comprehensive methodology for the error analysis of generic digital filters using
the HOL theorem prover. The proposed approach covers the three canonical forms (direct, parallel and cascade)
of realization entirely specified in HOL. We made use of existing theories in HOL on real, IEEE standard based
floating-point, and fixed-point arithmetic to model the ideal filter specification and the corresponding implementations
in higher-order logic. We used valuation functions to define the errors as the differences between the real values
of the floating-point and fixed-point filter implementation outputs and the corresponding output of the ideal real
filter specification. Finally, we established fundamental analysis lemmas as our model to derive expressions for the
accumulation of the roundoff error in digital filters. Related work did exist since the late sixties using theoretical
paper-and-pencil proofs and simulation techniques. We believe this is the first time a complete formal framework
is considered using mechanical proofs in HOL for the error analysis of digital filters. As a future work, we plan to
extend these lemmas to analyse the worst-case, average, and variance errors. We also plan to extend the verification to
the lower levels of abstraction, and prove that the implementation of a digital filter at the register transfer and netlist
gate levels implies the corresponding fixed-point specification using classical hierarchical verification in HOL, hence
bridging the gap between the hardware implementation and high levels of the mathematical specification. Finally, we
plan to link HOL with computer algebra systems to create a sound, reliable, and powerful system for the verification of
DSP systems. This opens new avenues in using formal methods for the verification of DSP systems as a complement
to the traditional theoretical (analytical) and simulation techniques.
References
[1] B. Akbarpour, Modeling and verification of DSP designs in HOL, Ph.D. Thesis, Concordia University, Department of Electrical and Computer
Engineering, Montreal, QC, Canada, March 2005.
[2] B. Akbarpour, S. Tahar, A. Dekdouk, Formalization of fixed-point arithmetic in HOL, Formal Methods in System Design 27 (1/2) (2005)
173–200.
[3] Cadence Design Systems, Inc., Signal Processing WorkSystem (SPW) User’s Guide, USA, July 1999.
666 B. Akbarpour, S. Tahar / Journal of Applied Logic 5 (2007) 651–666[4] G. Forsythe, C.B. Moler, Computer Solution of Linear Algebraic Systems, Prentice-Hall, 1967.
[5] M.J.C. Gordon, T.F. Melham, Introduction to HOL: A Theorem Proving Environment for Higher-Order Logic, Cambridge University Press,
1993.
[6] B. Gold, C. M. Radar, Effects of quantization noise in digital filters, in: Proceedings AFIPS Spring Joint Computer Conference, vol. 28, 1966,
pp. 213–219.
[7] J.R. Harrison, Constructing the real numbers in HOL, Formal Methods in System Design 5 (1/2) (1994) 35–59.
[8] J.R. Harrison, A machine-checked theory of floating-point arithmetic, in: Theorem Proving in Higher Order Logics, in: Lecture Notes in
Computer Science, vol. 1690, Springer-Verlag, 1999, pp. 113–130.
[9] J.R. Harrison, Floating-point verification in HOL Light: The exponential function, Formal Methods in System Design 16 (3) (2000) 271–305.
[10] J.R. Harrison, L. Théry, A skeptic’s approach to combining HOL and Maple, Journal of Automated Reasoning 21 (1998) 279–294.
[11] M. Huhn, K. Schneider, T. Kropf, G. Logothetis, Verifying imprecisely working arithmetic circuits, in: Proceedings Design Automation and
Test in Europe Conference, March 1999, pp. 65–69.
[12] L.B. Jackson, Roundoff-noise analysis for fixed-point digital filters realized in cascade or parallel form, IEEE Transactions on Audio and
Electroacoustics AU-18 (June 1970) 107–122.
[13] J.F. Kaiser, Digital filters, in: F.F. Kuo, J.F. Kaiser (Eds.), System Analysis by Digital Computer, Wiley, 1966, pp. 218–285.
[14] J.B. Knowles, R. Edwards, Effects of a finite-word-length computer in a sampled-data feedback system, IEE Proceedings 112 (June 1965)
1197–1207.
[15] B. Liu, T. Kaneko, Error analysis of digital filters realized with floating-point arithmetic, Proceedings of the IEEE 57 (October 1969) 1735–
1747.
[16] Mathworks, Inc., Simulink Reference Manual, USA, 1996.
[17] A.V. Oppenheim, C.J. Weinstein, Effects of finite register length in digital filtering and the fast Fourier transform, Proceedings of the
IEEE 60 (8) (August 1972) 957–976.
[18] A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing, Prentice-Hall, 1989.
[19] I.W. Sandberg, Floating-point-roundoff accumulation in digital filter realization, The Bell System Technical Journal 46 (October 1967) 1775–
1791.
[20] Synopsys, Inc., CoCentricTM System Studio User’s Guide, USA, August 2001.
[21] C. Weinstein, A.V. Oppenheim, A comparison of roundoff noise in floating point and fixed point digital filter realizations, Proceedings of the
IEEE (Proceedings Letters) 57 (June 1969) 1181–1183.
[22] H. Keding, M. Willems, M. Coors, H. Meyr, FRIDGE: A fixed-point design and simulation environment, in: Proceedings Design Automation
and Test in Europe Conference, February 1998, pp. 429–435.
[23] J.H. Wilkinson, Rounding Errors in Algebraic Processes, Prentice-Hall, 1963.
