An efficient scheme is presented for implementing the sign LMS algorithm in block floating point format, which permits processing of data over a wide dynamic range at a processor complexity and cost as low as that of a fixed point processor. The proposed scheme adopts appropriate formats for representing the filter coefficients and the data. It also employs a scaled representation for the stepsize that has a time-varying mantissa and also a time-varying exponent. Using these and an upper bound on the step-size mantissa, update relations for the filter weight mantissas and exponent are developed, taking care so that neither overflow occurs, nor are quantities which are already very small multiplied directly. Separate update relations are also worked out for the step size mantissa. The proposed scheme employs mostly fixed-point-based operations, and thus achieves considerable speedup over its floatingpoint-based counterpart.
INTRODUCTION
Sufficient signal-to-quantization noise ratio over a large dynamic range is a desirable feature of modern day digital signal processing systems. While the floating point (FP) data format is ideally suited to achieve this due to normalized data representation, the accompanying high processing cost restricts its usage in many applications. This is specially true for resource-constrained contexts like batteryoperated low power devices, where custom implementations on FPGA/ASIC are the primary mode of realization. In such contexts, the block floating point (BFP) format provides a viable alternative to the FP scheme. In BFP, a common exponent is assigned to a group of variables. As a result, computations involving these variables can be carried out in simple fixed point (FxP) like manner, while presence of the exponent provides an FP-like high dynamic range.
Over years, the BFP format has been used by several researchers for efficient realization of many signal processing systems and algorithms. These include various forms of fixed coefficient digital filters (see [1] [2] [3] [4] [5] [6] ), adaptive filters (see [7, 8] ), and unitary transforms (see [9] [10] [11] ) on one hand and several audio data transmission standards like NICAM (stereophonic sound system for PAL TV standard), the audio part of MUSE (Japanese HDTV standard), and DSR (German digital satellite radio system) on the other. Of the various systems studied, adaptive filters pose special challenges to their implementation using the BFP arithmetic. This is mainly because (i) unlike a fixed coefficient filter, the filter coefficients in an adaptive filter cannot be represented in the simpler fixed point form, as the coefficients in effect evolve from the data by a time update relation;
(ii) the two principal operations in an adaptive filterfiltering and weight updating, are mutually coupled, thus requiring an appropriate arrangement for joint prevention of overflow.
Recently, a BFP-based approach has been proposed for efficient realization of the LMS-based transversal adaptive filters [7] , which was later extended to the normalized LMS algorithm [8] and the gradient adaptive lattice [12] . In this paper, we extend the philosophy used in [7] for a BFP realization of the sign LMS algorithm [13] . The sign LMS algorithm forms a popular class of adaptive filters within the LMS family, which considers mainly the sign of the gradient in the weight update process and thus does not require multipliers in the weight update loop. The proposed scheme adopts appropriate BFP format for the filter coefficients which remains invariant as the coefficients are updated in time. Using this and the BFP representation of the data as used in [7] , separate time update relations for the filter weight mantissas and the exponent are developed. Unlike [7] , the proposed scheme, however, requires a scaled representation for the step size, which has a time-varying mantissa and also a time-varying exponent. Separate time update relation for the step size mantissa is worked out. It is also shown that in order to maintain overflow free condition, the step size mantissa, at all times, must remain bounded by an upper limit, which is ensured by setting its initial value appropriately. Again, the weight update relation of the sign LMS algorithm is different from the LMS algorithm and thus new steps are needed for the computation of the update term, taking care so that neither overflow occurs, nor are quantities which are already very small multiplied directly. As expected, the proposed scheme employs mostly FxP-based operations and thus achieves considerable speed up over its FP-based counterpart, which is verified both by detailed complexity analysis and from the synthesis report of an FPGA-based realization.
The organization of the paper is as follows: in Section 2, we discuss the BFP arithmetic and present a new block formatting algorithm for FP as well as FxP data. Section 3 presents the proposed BFP realization of the sign LMS algorithm. Complexity issues vis-à-vis an FP-based realization are discussed in Section 4 while finite precision based simulation results as well as the FPGA synthesis summary are presented in Section 5. Variables with an overbar indicate mantissa elements all throughout the paper. Also, boldfaced lowercase letters are used to denote vectors.
THE BFP ARITHMETIC AND A BLOCK-FORMATTING ALGORITHM
The BFP representation can be considered as a special case of the FP format, where every nonoverlapping block of N incoming data has a joint scaling factor corresponding to the data sample with the highest magnitude in the block. In other words, given a block [x 0 , . . .
. . , N − 1 and the block exponent γ is defined as γ = log 2 Max + 1 + S where Max = max(|x 0 |, . . . , |x N−1 |), " · " is the so-called floor function, meaning rounding down to the closest integer and the integer S is a scaling factor, used for preventing overflow during filtering operation.
In practice, if the data is given in an FP format, that is, if x l = M l 2 el , l = 0, 1, . . . , N − 1 with |M l | < 1, and the 2's complement system is used, the above block formatting may be carried out by Algorithm 1.
Algorithm 1 (Block-formatting algorithm). First, count the number, say, n l of binary 0's (if x l is positive) or binary 1's (if x l is negative) between the binary point of M l and the first binary 1 or binary 0 from left, respectively. Compute e max = max{(e l − n l ) | l = 0, 1, . . . , N − 1}. Shift each M l right or left by (e max + S − e l ) bits depending on whether (e max + S − e l ) is positive or negative, respectively. Take the block exponent as e max + S.
Note. For cases where x l is negative with M l having only binary 0's after the first n l bits from the binary point, n l should be replaced by n l − 1 in the above computation.
When the data is given in FxP format, the corresponding block formatting turns out to be a special case of the above, for which x l ≡ M l , e l = 0, and e max is given by min{n l | l = 0, 1, . . . , N − 1}. Note that due to the presence of S, the range of each mantissa is given as 0 ≤ |x l | < 2 −S . The scaling factor S can be calculated from the inner product computation representing filtering operation [3] . An inner product is calculated in BFP arithmetic as
where w is a length L, fixed point filter coefficient vector, and x(n) is the data vector at the nth index, represented in the aforesaid BFP format. For no overflow in y(n), we
k=0 |w k |) in order to have |y(n)| < 1 satisfied, where " · " denotes the so-called ceiling function, meaning rounding up to the closest integer.
THE PROPOSED IMPLEMENTATION
Consider a length L sign LMS based adaptive filter [13] that takes an input sequence x(n) and updates the weights as
where
is the output error corresponding to the nth index. The sequence d(n) is the so-called desired response available during the initial training period and y(n) = w t (n)x(n) is the filter output at the nth index, with μ denoting the so-called step size parameter. The operator sgn{·} is the well known signum function which returns values +1 or −1 depending on whether the operand is nonnegative or negative, respectively. The proposed scheme uses a scaled format to represent the filter coefficient vector w(n) as
where w(n) and ψ n are, respectively, the filter mantissa vector and the filter block exponent which are updated separately over n. The chosen format thus normalizes all components of w(n) by a common factor 2 ψn at each index n. In our treatment, the exponent ψ n is a nondecreasing function of n with zero initial value and is chosen to ensure
If the data vector x(n) is given in the aforesaid BFP format as x(n) = x(n)2 γ , where
and S is an appropriate scaling factor, then, the filter output y(n) can be expressed as y(n) = y(n)2 γ+ψn with y(n) = w t (n)x(n) denoting the output mantissa. To prevent overflow in y(n), it is required that |y(n)| < 1. However, in the proposed scheme, we restrict y(n) to lie between +1/2 and −1/2, that is, |y(n)| < 1/2.
and |y(n)| < 1/2 ensure no overflow during updating of w(n) and computation of output error mantissa, respectively, as shown later.
The proposed implementation
The proposed BFP realization consists of the following three stages.
(i) Buffering: here, the input sequence x(n) and the desired response d(n) are jointly partitioned into nonoverlapping blocks of length N each, with the ith block given by
For this, x(n) and d(n) are shifted into buffers of size N each. We take N ≥ L − 1, as otherwise, the complexity of implementation would go up. The buffers are cleared and their contents transferred jointly to a block formatter once in every N input clock cycles.
(ii) Block formatting: here, the data samples x(n) and d(n) which constitute the ith block, i ∈ Z, and which are available in either FP or FxP form, are block formatted as per the block formatting algorithm of Section 2, resulting in the BFP representation:
The scaling factor S i is chosen to ensure that (i) S i ≥ S min , and (ii) x(n) has a uniform BFP representation during the block-to-block transition phase as well, that is, when part of x(n) comes from the ith block and part from the (i − 1)th block. This is realized by the following exponent assignment algorithm (see Algorithm 2).
Algorithm 2.
[Exponent assignment algorithm] Assign S min = log 2 L as the scaling factor to the first block and for any (i − 1)th block, assume
, we change S i−1 to an effective scaling factor of S i−1 = S i−1 + Δγ i . This permits a BFP representation of the data vector x(n) with common exponent γ i during block-to-block transition phase as well.
In practice, such rescaling is effected by passing each of the delayed terms x(n − j), j = 1, . . . , L − 1, through a rescaling unit that applies Δγ i number of right or left shifts on x(n − j) depending on whether Δγ i is positive or negative, respectively. This is, however, done only at the beginning of each block, that is, at indices n = iN, i ∈ Z + . Also, note that though for the case (A) above, Δγ i ≥ 0, for (B) and (C), however, Δγ i ≤ 0, meaning that in these cases, the aforesaid mantissas from the (i − 1)th block are actually scaled up by 2 −Δγi . It is, however, not difficult to see that the effective scaling factor S i−1 for the elements x(iN − L + 1), . . . , x(iN − 1) still remains lower bounded by S min , thus ensuring no overflow during filtering operation.
(iii) Filtering and weight updating: the block formatter inputs x(n), d(n), n ∈ Z i , and (b) the rescaled mantissas for x(iN − k), k = 1, 2, . . . , L − 1 to the transversal filter, which computes y(n) = w t (n)x(n) for all n ∈ Z i . Since the data in (b), coming from the (i−1)th block, are rescaled so as to have the same exponent γ i , the above computation can be made faster via overlap and save method. This employs (N + L − 1) point FFT on data frames formed by appending the data in (b) to the left of [x(iN), . . . , x(iN + N − 1) ] and discarding the first L − 1 output. Since the FFT is FxP-based, it would require much less computational complexities than an FPbased evaluation.
Next, the output error e(n) is evaluated as e(n)=e(n)2 γi+ψn where the mantissa e(n) is given by
It is easy to see that |e(n)| < 1, that is, the computation in (5) above does not produce any overflow, since
as 2 −Si ≤ 1/L. Except for ψ n = 0, L = 1, the right-hand side is always less than or equal to 1.
For the above description of e(n), x(n), w(n) and noting that sgn{e(n)} = sgn{e(n)}, the weight update equation (2) can now be written as w(n + 1) = v(n)2 ψn , where
where μ n = μ2 −ψn . In other words, the proposed scheme employs a scaled representation for μ as μ = μ n 2 ψn , with μ n updated from a knowledge of ψ n and ψ n+1 as μ n+1 = μ n 2 (ψn−ψn+1) .
As stated earlier, w(n + 1) is required to satisfy |w k (n + 1)| < 1/2, for all k ∈ Z L , which can be realized in several ways. Our preferred option is to limit v(n) so that |v k (n)| < 1, for all k ∈ Z L . Then, if each v k (n) happens to be lying within ±1/2, we make the assignments
Otherwise, we scale down v(n) by 2, in which case
In order to have
EURASIP Journal on Advances in Signal Processing
It is easy to verify that the above bound for μ n is valid not only when each element of x(n) in (6) comes purely from the ith block, but also during transition from the (i − 1)th to the ith block with ex i ≥ ex i−1 , for which, after necessary rescaling, we have S i−1 ≥ S i = S min implying |x(n−k)| < 2 −Si . For ex i < ex i−1 , however, the upper bound expression given by (11) gets modified with ex i replaced by ex i−1 , as in that case, we have γ i = ex i−1 +S i−1 with S i−1 = S min < S i meaning |x(n − k)| < 2 −S i−1 . From above, we obtain a general upper bound for μ n by replacing ex i by ex max = max{ex i | i ∈ Z + }, which is given by
In order to satisfy the above upper bound, first note from (8) and (9) that ψ n is a nondecreasing function of n. This, together with (7), implies that μ n+1 ≤ μ n for all n. To satisfy the above upper bound, it is thus enough to fix the initial value of μ n by setting the first ex max +1 bits of the corresponding register following the binary point as zero, if ex max +1 ≥ 0. If, however, ex max +1 < 0, one can retain | ex max +1| data bits to the left of the binary point. Note also that since the initial value of ψ n is zero, the initial value of μ n actually determines the step size μ. Finally, for practical implementation of v(n) as given by (6), we need to evaluate the product μ n x(n − k)2 γi in such a way that no overflow occurs in any of the intermediate products or shift operations. At the same time, we need to avoid direct product of quantities which could be very small, as that may lead to loss of several useful bits via truncation. For this purpose, we proceed as follows: if ex i ≥ ex i−1 , then, S i = S min and we express 2 γi as 2 γi = 2 exi 2 Smin . If, instead, ex i < ex i−1 , then, S i−1 = S min , γ i = ex i−1 +S i−1 and we decompose 2 γi as 2 γi = 2 exi−1 2 Smin . The factors 2 exi (or, 2 exi−1 ) and 2 Smin are then distributed to compute the update term as follows.
Step 1. μ 1,n = μ n 2 exi , if ex i ≥ ex i−1 ; if ex i < ex i−1 , μ 1,n = μ n 2 exi−1 .
Step 2.
Step 3.
Note that in Step 2, only the current mantissa x(n) is to be shifted by 2 Smin , as the other terms x(n − k), k = 1, 2, . . . , L − 1 are already shifted at the previous indices. For n = iN, that is, the starting index of the ith block, these terms correspond to the last (L − 1) mantissas of the (i − 1)th block, rescaled by 2 −Δγi . Further scaling of these terms by 2 Smin can be carried out during the block formatting stage, that is, before the processing of the ith block.
The proposed BFP treatment to the sign LMS algorithm is summarized in Table 1 . The three units, viz., (i) buffering, (ii) block formatting, and (iii) filtering and weight updating are actually pipelined and their relative timing is shown in Figure 1 . Also, for the filtering and weight updating unit, the internal processing is illustrated in Figure 2 . 
(1) Preprocessing: using the data for the ith block, x(n) and d(n), n ∈ Z i , i ∈ Z + (stored during the processing of the (i − 1)th block).
(a) Evaluate block exponent γ i as per the exponent assignment algorithm of Section 3 and express
Rescale the following elements of the (i − 1)th block:
Step 2 of Section 3, rescale the same separately by 2 −Δγi+S min ).
(2) Processing for the ith block:
ex out(n) = γ i + ψ n (ex out(n) is the filter output exponent at the nth index). (b) Output error (mantissa) computation:
(c) Filter weight updating:
Repeat Steps 1 to 2.
COMPLEXITY ISSUES
The proposed scheme relies mostly on FxP arithmetic, resulting in computational complexities much less than that of their FP-based counterparts. For example, to compute the filter output in Table 1 , L "multiply and accumulate (MAC)" operations (FxP) are needed to evaluate y(n) and at the most, one exponent addition operation to compute the exponent ex out(n). In and exponent addition. In other words, in FP, computation of y(n) will require the following additional operations over the BFP-based realization: (a) 2L shifts (assuming availability of single cycle barrel shifters), (b) L EC, and (c) 2L − 1 EA. Similar advantages exist in weight updating also. Table 2 provides a comparative account of the two approaches in terms of number of operations required per iteration. Note that the number of additional operations required under FP increases linearly with the filter length L. It is easy to verify from Table 2 that given a low cost, simple FxP processor with single cycle MAC and barrel shifter units, the proposed scheme is about three times faster than an FP-based implementation, for moderately large values of L.
SIMULATION AND FINITE PRECISION IMPLEMENTATION
The proposed scheme was implemented in finite precision in the context of a system identification problem. A system modelled by a 3-tap FIR filter was used to generate an output y(n) = 0.7x(n) + 0.65x(n − 1) + 0.25x(n − 2) + v(n), with v(n) and x(n) being the observation noise and the system input, respectively, with the following variances:
) was found to be 0.935. To calculate the upper bound of μ(= 2 − exmax /2), the quantity M = {|x(n)|, |y(n)| | n ∈ Z} was calculated, as 1.99 max{σ x , σ y }, so as to contain about 95% of the samples of x(n) and y(n). This gives rise to ex max = 1 and thus the upper bound of μ to be 0.25. Taking μ = 2 −6 , block length N = 20, and allocating 12 bits (1 + 11) for the mantissas and 4(1 + 3) bits for the exponents of both the data and the filter coefficients, the proposed scheme was implemented in VHDL. For this, the Xilinx ISE 7.1i software was used for a target device of Xilinx Virtex series FPGA XCV1000bg (speed grade 6). Details of this implementation like hardware requirement, respective gate counts, and execution times are provided later in this section. Here we study the finite precision effects by plotting the learning curves for this as well as for an FP-based realization under same precision for both the exponent and the mantissa. The learning curves, shown in Figure 3 by solid and dashed lines, respectively, demonstrate that both these implementations have similar rates of Block formatting algorithm
Compute filter weight mantissa and exponent (Eq. (8) and (9)), using steps 1-3. Update step-size mantissa (Eq. (7)) Sign 
convergence. However, in the steady state, the BFP scheme has slightly more excess mean square error (EMSE) than the FP scheme, which may be caused by the block formatting of data. This slight increase in EMSE is, however, offset by the speedup that is achieved and verified by comparing the execution times of the proposed realization with its FP counterpart. 
FPGA synthesis summary
The proposed scheme as well as the FP-based algorithm are implemented using basic hardware units like adder, multiplier, register, multiplexer, shifter, and so forth. The step size μ is taken to be a power of two as it eliminates the need of multiplier in the weight update loop. For the proposed scheme, the three stages, (a) buffering, (b) block formatting, and (c) filtering and weight updating have the following hardware requirements. (c) Filtering and weight updating: for filtering, a MAC operation (FxP) is used iteratively L times where L is the filter order (L = 3 for the example considered). The MAC unit requires one 12 × 12 multiplier, one 24 bit adder, and two 24 bit registers, one for holding the result of multiplication and the other for storing the MAC output. This is followed by computation of output error mantissa that uses one 12 bit shifter and one 12 bit subtractor. For updating each tap weight, first note that since μ is a power of 2, that is, μ = μ 0 = 2 s (say), we have μ n = 2 sn where s n = s − ψ n . For updating μ n , it is then enough to update s n , which requires a 4 bit subtractor and a 4 bit register, but does not require the shifter implied in the general update relation (7) . The Steps 1-3 of Section 3 also get simplified, as it is then sufficient to use two 4 bit adders and one 4 bit register to compute s n + ex i +S min , 2L 12 bit shifters to shift x(n − k), k ∈ Z L left/right by s n + ex i +S min and L 12 bit adders/subtractors to evaluate v(n) as per (6) . Finally, to realize the update relations (8) and (9), we need a 4 bit adder and a 4 bit register to update ψ n , and L 12 bit shifters as well as L 12 bit registers to compute w(n).
An FP-based realization, on the other hand, has only two operations, namely, filtering and weight updating, both requiring FP addition and multiplication. If two FP numbers having r bit mantissa and m bit exponent each are multiplied, we need one r × r multiplier, one m bit adder, and two registers of length m bits and 2r bits. If, on the other hand, the two numbers are added, we need one m bit comparator, one m bit subtractor, two r bit shifters, two r bit 2 : 1 MUX, one r bit adder and for renormalization of the result, two r bit shifters and one m bit adder/subtractor. We also need registers of length m bits and r bits for storing the mantissa and exponent of the result of addition. To realize the filtering operation, an FP-based MAC operation is used iteratively L times that uses one FP multiplication with r = 12 and m = 4, and an FP addition with r = 24 and m = 4. For computing the output error, an FP addition with r = 12 and m = 4 is deployed. For updating each weight, a 4 bit adder is used to add the exponents of the step size and data, followed by an FP addition with r = 12 and m = 4.
The total equivalent gate count for the proposed scheme with N = 20 was found to be 9227, while the same for an FP-based implementation was 12,468. The minimum clock period needed for the FP-based implementation has been 16.052 ns. For the proposed scheme, minimum clock periods required for the three stages, (a) buffering, (b) block formatting, and (c) filtering and weight updating have been 0.232 ns, 4.575 ns, and 6.695 ns. In other words, the minimum clock period needed for the proposed scheme has been 6.695 ns and thus the BFP realization is about 2.39 times faster than the FP-based realization, which also conforms to our observation from Table 2 for L = 3.
CONCLUSION
The sign LMS algorithm is presented in a BFP framework that ensures simple FxP-based operations in most of the computations while maintaining an FP-like wide dynamic range via a block exponent. The proposed scheme is particularly useful for custom implementations on ASIC or FPGA, where hardware and power efficiency constitute an important factor. For identical resource constraints, the proposed scheme achieves a speed-up in the range of 2 : 1 to 3 : 1 over an FP-based implementation, as observed both from operational counts and also from a custom implementation on FPGA. Finite precision-based simulations also did not show up any noticeable difference in the convergence characteristics, as one moves from the FP to the BFP format.
ACKNOWLEDGMENT
This work was supported in part by the Institute of Information Technology Assessment (IITA), South Korea.
