Dual FiXed-point (DFX) is a new data representation which is an efficient compromise between fixed-point and floatingpoint representations. DFX has an implementation complexity similar to that of a fixed-point system with the improved dynamic range capability of a floating-point system. Automating the process of DFX scaling optimisation requires the knowledge of its truncation/rounding noise properties. This paper presents truncation and rounding error models for DFX arithmetic as traditional error models do not apply to DFX. The models were tested on a 159-tap FIR filter and the benefits of using DFX over floatingpoint are demonstrated with implementations on a Xilinx Virtex II Pro.
INTRODUCTION
FPGAs have long been an attractive alternative to digital signal processors in DSP applications provided that floatingpoint is not required. In recent years, the size of FPGA devices has increased to such an extend that floating-point implementations have become possible [1, 2] . Since floatingpoint designs are considerably larger and slower than their fixed-point counterparts, their use is only justified where very large dynamic range is required.
In [3] the authors introduced a new representation known as Dual FiXed-point (DFX). It showed that for the same chip area, DFX designs outperform floating-point due to their reduced complexity while having a similar dynamic range. To obtain an efficient DFX implementation while satisfying the computational accuracy constraints imposed by the designer, design automation tools that automatically determine the optimum parameters in a DFX design needs to be developed. One prerequisite to the development of such tools is an accurate truncation error model for DFX arithmetic modules. Unfortunately, traditional techniques such as the additive roundoff error model for fixed-point [4] and the relative roundoff error model for floating-point [5] cannot be utilised because DFX is a scaled number presentation and is not normalised.
This paper reports our latest work on deriving an accurate error model for DFX arithmetic modules. The original contributions of this paper are: 1) detail implementations of DFX arithmetic modules with rounding are examined and the sources of truncation errors identified; 2) analytical models for errors introduced by DFX arithmetic modules are developed; 3) these errors models are verified against simulation results and are shown to be accurate; 4) the models are applied to a 159-tap FIR filter and the size, speed and noise performance of the DFX implements are compared against those using floating point.
The paper is organised as follows. Section 2 presents the background and definition of DFX. The basic arithmetic modules using DFX and the sources of truncation errors are described in Section 3. Section 4 presents the error models of each individual arithmetic module. A case study showing the benefits of DFX on a FIR filter and a test of the error models are given in Section 5. Section 6 concludes the paper and suggestions for future work are presented.
DUAL FIXED-POINT: BACKGROUND AND DEFINITION

n -1 bits
Exponent E Signed Significand X 1 bit
Fig. 1. DFX Format
An n-bit Dual FiXed-point (DFX) number consists of an exponent bit E, and n − 1 bits of a signed significand X as shown in Figure 1 . The exponent selects between two scalings for the significand X, giving two possible ranges for the number. The lower number range is referred to as Num0 while the higher number range is referred to as Num1. To achieve two different scalings, Num0 has p 0 fractional bits and Num1 has p 1 . The DFX format notation used is written as n p 0 p 1 .
The value of a DFX number, D, is given by
A boundary value, B, is needed to decide the best scaling to use and hence the value of E. E is determined as follows,
In order to simplify the design of the arithmetic units, the boundary value is defined as the next incremental value after the maximum positive number of Num0, i.e. B = 2 n−p0−2 (−2 because of the exponent and sign bits). The range and precision of Num0 and Num1 are illustrated in Figure 2 .
DFX ARITHMETIC MODULES
The basic arithmetic modules for DFX have been designed in VHDL. Unlike the modules introduced in [3] , these modules here perform rounding instead of truncation.
DFX Adder
The DFX Adder module ( Figure 3 ) adds together two DFX numbers. Similar to a floating-point adder, DFX inputs may need aligning before addition. But unlike floating-point, the number of bits shifted is known a priori. This means only multiplexers are needed to perform the necessary scaling instead of barrel shifters. As a result, the DFX Adder is both smaller and faster than an equivalent floating-point adder. Note that " " and " " are right and left shift operators respectively which require only wire routing, and "mod 2 n−1 " simply extracts the least significant (n − 1) bits.
Whenever the input exponents are different, i.e. one input is Num0 and the other Num1, the Num0 input will be shifted up to the Num1 range. When both exponents are the same, there will be no shifting. gBits either retains the truncated bits as the result of shifting or carry all zero bits. The retained truncated bits is used to reduce the Adder's output truncation error (Section 4.4). Rounding is done after all the scaling is complete. Two factors determine the rounding decision. Firstly, if either of the inputs are shifted right and the Sum is not shifted, the rounding decision depends on the either of the input's rounding bit. Secondly, if the Sum is shifted right, the rounding decision depends on the Sum's rounding bit.
DFX-H Multiplier
A DFX-Half (DFX-H) Multiplier ( Figure 4 ) takes one DFX input and a constant fixed-point multiplier. This is particularly useful in applications such as filtering where one of the operands is a constant. Unlike the DFX Adder, the inputs to the multiplier do not need aligning. However, the product P needs to be properly scaled and converted back to DFX.
Consider the multiplication of a DFX n p 0 p 1 number with a fixed-point n m p m (n m is the word-length of the multiplier m and p m is the fractional length). The product of the multiplication, P rod, would be in the format DFX 
It is then converted back to a DFX n p 0 p 1 formatted number by the Rescaler Block (Figure 4(b) ). Table 5 shows all possible input and output combinations with their respective output truncations and output shifts required. The truncation p a → p b represents converting from a number with the binary point p a to p b . When the multiplier |m| > 1, Case 4 will never happen. On the other hand, Case 3 will never happen when |m| < 1.
The decision to round depends on the output scaling and it's rounding bit while the Num0 number does not overflow. Unlike ordinary truncation, rounding may cause an overflow due to the non-symmetrical nature of 2's complement representation. The overflow of the Num0 range is most vital of all as we need to guarantee there is no overflow within the whole number range.
DFX Encoder and Decoder
In order to utilize this number system, a method is needed to convert a number from a known type to DFX and vice-versa. 
Arithmetic Module Comparisons
For completeness, here's a comparison of the DFX Modules (rounded and truncated) with equivalent floating-point implementations [6] . All modules have a 16-bit word-length. The DFX modules are of the format 16 19 14 and floatingpoint modules are of the format M7 E8 (7 mantissa bits and 8 exponent bits ). The DFX modules with rounding is not very much larger than their truncated counterpart. This is because the addition logic for rounding is absorbed into the multiplexer stage before it.
ERROR ANALYSIS OF DFX ARITHMETIC MODULES
The noise for each DFX arithmetic module is modelled as an addition of an error source at the end of each module. These errors are highly dependent on the distribution and correlation of its inputs, thus ordinary static error analysis [4] is not possible. Provided that we know the probability distribution function of arithmetic module's inputs, we can estimate the output error. The distribution function is obtained by performing a single pass profiling simulation, which is explained later. This paper focuses on the noise added by each DFX arithmetic module, therefore all inputs to the modules are assumed to contain no errors.
Background
Errors are introduced into a system whenever truncation takes place. In the case of DFX modules, truncation occurs whenever there is a right shift in the data path. A two's complement signal with binary point p a truncated to p b will introduce an error with the mean and variance given by (3) which uses a discrete error distribution [7] . The equations are derived from the assumption that each of the combinations of the low-end truncated bits are equally likely, which holds true in practice if the signals have sufficient dynamic range over that bit-width. If rounding is performed instead of truncation, (3) still applies but the error mean becomes zero while the variance remains the same.
Since DFX has dual precision, more than one truncation/ rounding error may occur within each arithmetic module. Let T be the set of all these possible truncation/rounding that may occur and let i ∈ T. For every truncation/rounding i, there is a corresponding error mean, μ i , error variance, σ 2 i and probability of truncation occurring, P i . From the profiling simulation, we can determine the probability of all the sources of truncations within each module. Therefore, the output error mean and error variance are given by (4). Again, if rounding is performed, the error mean will be zero and the variance is calculated with the zero error means.
Profiling Simulation
In the system context, the errors of each DFX arithmetic module are highly dependent on the correlation between the input signals. The purpose of the profiling simulation is to obtain the joint/probability distribution function of the inputs to each arithmetic module within the system. This is done by feeding a set of typical representative data into the system for a single pass simulation. While the simulation is running, information regarding the magnitude, sign and correlation between the inputs are gathered for each module. With this information, the probability distribution function (PDF) can be obtained for modules with a single input or the joint probability distribution function for dual input modules. The information gathered by this single pass simulation together with the following error models are sufficient to estimate the errors of any DFX format required. 
Error: DFX Encoder Module
This module performs two forms of quantisation depending on its input. If the output is a Num0, the output would be truncated by T E0 and if the output is a Num1, the output would be truncated by T E1 . The quantisation means and variances of T E0 and T E1 are shown in Table 3 . From the profiling simulation, the PDF of the input can be obtained as shown in Figure 6 . From the PDF, the probability P TE0 is the integral of the PDF curve whereby the Input is a Num0. Likewise, the probability P TE1 is the integral of the PDF curve whereby the Input is a Num1. Therefore using (4), the modelled error mean and the error variance is given by (5) . If rounding is performed, the error mean is zero and the error variance is calculated similar to the ordinary truncation but with zero error mean.
Error: DFX Adder Module
The analysis of the error model for this module begins with analysing all possible input combinations. Table 4 depicts that truncation happens in only 2 out of 6 cases. Ideally, if the inputs were independent of each other, obtaining the probability distribution of the inputs individually would be sufficient. However, in practice, input signals have some degree of correlation between them. Instead, a joint probability distribution table (Figure 7 ) of the inputs is obtained via the profiling simulation. The table can be viewed as a graph with the x-axis for the input X and y-axis for the input Y . Boundaries marked on the table separates the regions where the inputs are a Num0 and a Num1. The shaded area denotes the area where the adder's result is truncated. 
Fig. 7. DFX Adder input joint probability distribution table
Provided that DFX Adder's inputs and output have the same DFX format, the truncations within the DFX Adder will have the same error mean and error variance as given by (6) . The probability of truncation occurring, P TA is the integral of the shaded area of the joint probability distribution table (Figure 7) .
Therefore the DFX Adder's output error mean and error variance injected are given by (7) . Again, if rounding is performed, the error mean is zero and the error variance are calculated with zero error mean.
Error: DFX-Half Multiplier (DFX-H Multiplier)
As mentioned earlier in Section 3.2, the DFX-H Multiplier multiplies a DFX number, X, with a fixed-point constant multiplier, m, with the format n m p m . Table 5 shows all the possible truncation error means and variances. 
,QSXW %RXQGDU\ %RXQGDU\ When |m| > 1, the output product will never be a Num0 if the input is a Num1 which means Case 4 will never occur. The probabilities for each case can be found from Figure 8 (a) through integration. Therefore, using (4), the output truncation error mean and variance when |m| > 1 are given by (8).
However when |m| < 1, the output product will never be a Num1 if the input is a Num0 which means Case 3 never happens. The probabilities for each case can be found from Figure 8 (b) through integration. Once again, the output truncation error mean and variance when |m| < 1 are given by (9). As before, if rounding is performed, the error mean is zero and the error variance is calculated with zero error mean. 
Error Model Evaluation
For verification of error models, audio samples were used as an input sample. The error is the difference between the estimated output and actual (double precision) output. Tables 6  and 7 shows that with two different DFX formats, the truncation and rounding error models are capable of providing error estimates that are within ±3% of the actual error.
CASE STUDY
A 159-tap transposed direct form FIR filter for both DFX and floating-point were implemented using Xilinx Virtex II Pro XC2VP70-6ff1517. The DFX designs were made using the rounding modules shown in Section 3 while the floatingpoint designs were made using floating-point library by [6] . Table 8 shows the size, latency and signal to noise(SNR) performance of these filters with different word-lengths and the best scaling were used for each word-length. SNR is the ratio of the desired output power over the noise power. D1-3 are DFX designs and P1-3 are floating-point designs.
For designs with the same word-length, DFX designs are about 5 times smaller and 3.1 times faster than an equivalent floating-point design. Furthermore, the SNR performance of floating-point designs are about 7dB less than DFX. The SNR of the DFX designs were predicted using the error models mentioned in the previous section. Provided that all the cross-correlations between the individual errors injected can be accounted for, the predicted SNR are found to be within 3% of the actual SNR as shown in Table 8 . 
CONCLUSION
As a prerequisite to automating the design process of DSP systems using our new DFX data representation, an accurate and reliable error model of DFX arithmetic modules has been developed. A single profiling simulation run is all that is required to obtain the probability distribution tables necessary to perform error estimation. In a system context, the characteristics of the output error can be found provided that the cross correlations between the errors injected by each arithmetic module is known. Future work will include the exploration of multiple word-length designs using DFX and the optimisation of DFX design for area, accuracy and speed.
