Abstract
Introduction
In radix-0 r LNS, a number X is represented as a signed-exponent word, x , and its value is determined as The addition and subtraction in LNS arithmetic require the computation of the functions r 0 and r 0 , which is usually performed by table-lookup operation. An increase in the word length of the LNS number implies an exponential increase of the table size. In order to reduce the hardware cost for computing these two functions, many table-reduction techniques [1] [2] and computational methods, either to compute [3] [4] or to avoid [5] [6] these two functions, have been proposed. In [7] , digit-parallel additive-normalization [8] and digit on-line [9] multiplicative-normalization methods are adopted to compute the exponential function and the logarithmic function, respectively, in the r 0 and r 0 functions. The size of the lookup tables required in the proposed LNS unit in [7] is only a third-order polynomial function of the word length and the cost of the other circuits for computing the r 0 and r 0 functions is only proportional to the square of the word length of the LNS unit. However, the cost and the pipeline latency of the circuits in [7] remain large. This paper proposes two approaches for improving the performance of the LNS addition/subtraction computation that was introduced in [7] . For convenience, the LNS unit designed in [7] is referred to as the basic LNS unit hereafter. We first propose that the base-e exponential function be computed for obtaining the We assume the base of the LNS is two, 2 0 r . One advantage of this approach is the simplification of the exponent computation. The other advantage is that the operations in half of the pipeline stages can be replaced by just one stage of multiplication-and-accumulate (MAC) operations. The pipeline latency and the hardware costs can thus be greatly reduced. The other approach for improving the performance of the LNS unit is the adoption of signed-digit (SD) arithmetic. We developed SD exponential, SD discretization, and SD on-line logarithmic algorithms for computing the r 0 and r 0 functions. Fast carry-propagation-free SD adders can be used instead of the slower carry-propagate adders. The latency in each stage and the pipeline latency of the LNS unit can thus be greatly reduced.
We have designed the architectures of a 32-bit LNS unit and a 64-bit LNS unit by using the base-e exponential algorithm. We called these LNS units the hardware-reduced LNS units. Then we designed the architectures of a 32-bit LNS unit and a 64-bit LNS unit, based on both the SD algorithms and the hardware-reduced algorithms. We call these two LNS units as the SD hardware-reduced LNS units. From our analysis, we have estimated that about half the size of the lookup tables, and half the circuit cost of the 64-bit basic LNS unit can be reduced by using the proposed hardware-reduced approach. If the SD approach is further utilized, the throughput of the 64-bit basic LNS unit can be increased by 4.62 times. Furthermore, its pipeline latency can be reduced by 85.88%. It then becomes possible to design a 64-bit LNS unit whose number of pipeline stages is only one-seventh of that of the 64-bit basic LNS unit. We conclude that the proposed approaches can significantly improve the performance of the basic LNS unit. In the following, the proposed algorithms are presented in Section 2 and 3. The architectural design and simulation results of the SD hardware-reduced LNS unit are described in Section 4. Section 5 compares the proposed LNS units with the basic LNS units. Conclusions are made in Section 6.
Hardware-reduced computation of LNS addition/subtraction
In [7] , the algorithm for computing the functions 2 [7] includes three digit-pipeline stages, the exponential, the discretization, and the digit on-line logarithmic stages.
The base-e exponential algorithm
The purpose of each exponential stage is to compute the partial result of 2 
As a result, we can replace the last 2 N pipeline stages by an MAC operation. The discretization algorithm first computes the value of 2 1 1 and then determines the value of the digit j z according to the following rule: 
The discretization algorithm
j j j j E E Y 2 ) (. 5 . 0 2 ) ( 2 if 1 5 . 0 2 ) ( 2 0.5 if 0 5 . 0 2 ) ( 2 if 1 1 1 1 1 1 1 j j j j j j j j j j j j j E E Y E E Y E E Y z Finally, the discretization term j Y is computed as: j j j j j j z E E Y Y 2 ) ( 2
Digit on-line logarithmic algorithm
In the digit on-line logarithmic algorithm, the argument 
, where
l denotes the on-line delay whose value is derived to be two. In the th j stage, the digit on-line logarithmic computation includes the following three steps:
Q j is defined as Step 2. Determine the value of j q according to the following rule:
Step 3 
The hardware-reduced logarithmic algorithm
In the logarithmic pipeline of the hardware-reduced LNS unit, there are only 
The value of the on-line digit 
We can show that
Architectural design of the 32-bit hardware-reduced LNS unit
To design a 32-bit LNS unit that has its precision performance comparable to IEEE single-precision FLP operations, we require that are the word lengths of the tables in the exponential and logarithmic stages, respectively. Fig. 1 shows the architecture of the 32-bit hardware-reduced LNS unit.
In Fig. 1 , the Operand Multiplier in stage 1 is used to generate the operand 2 ln 
Signed-Digit algorithms for LNS addition/subtraction computation
In this section, the SD exponential, SD discretization, and SD on-line logarithmic algorithms that are proposed to compute the LNS addition/subtraction are described.
SD exponential algorithm
We first assume that 
Signed-digit discretization algorithm
In our proposed SD descretization algorithm, only a few leading digits of the term
are needed to determine the value of j z . However, the digit on-line delay of this SD algorithm is increased to be three. The SD discretization algorithm includes the following steps:
Step 1. Compute the value of
Step 2. Determine the value of the digit j z according to the following rule:
. 
SD on-line logarithmic algorithm
Basically, the steps of the proposed SD on-line logarithmic algorithm are the same as the three steps described in Section 2.3, except that the j q selection rule of the SD on-line logarithmic algorithm is different from that in (1) and that the digit on-line delay in the SD on-line logarithmic algorithm is three, rather than two. From (1), we denote It has been proved that
, and the value of B is derived to be 1.5. The convergence of this SD on-line algorithm is thus ensured and the algorithm is proved.
Design of the SD hardware-reduced LNS unit
For a 64-bit LNS unit, whose precision is comparable to IEEE double-precision floating-point unit, the number of integer bits and fractional bits should be 11 I and 52 F , respectively. The architecture of a SD hardware-reduced LNS unit is similar to the architecture of the hardware-reduced LNS unit shown in Fig. 1 . There are several differences between these two architectures. First of all, the arithmetic operations in the exponential, the discretization, and the logarithmic stages of the SD LNS unit are all in redundant SD format. Secondly, in the SD LNS unit, the digit on-line delay in the discretization pipeline is three rather than one, and the on-line delay of the logarithmic pipeline is three rather than two. A similar method for analyzing the error of the hardware-reduced LNS unit is applied to analyze the error of the SD hardware-reduced LNS unit. For a 64-bit LNS unit, the minimum values of 
Comparison and discussion
This section compares the following six kinds of LNS units: the 32-bit and 64-bit basic LNS units in [7] , the 32-bit and the 64-bit hardware-reduced (HR) LNS units, and the 32-bit and 64-bit SD hardware-reduced (SD HR) LNS units.
Circuit area comparison
The circuit within the proposed LNS units is mainly composed of the following components: small PLAs for lookup tables, adders, comparators, shifters, and multiplexers. The circuit area of the LNS unit is estimated by the amount of these circuit components. The amounts of these components in these LNS units are listed in Table 1 . In order to estimate the area of the whole LNS unit, we make the observations on the gate counts of these circuit components, which are resulted from the Synopsis synthesis by using the 0.35 m cell library. In addition, these circuit components are designed by using four-input logic gates as much as possible. The SD adder cells with four-input logic gates introduced in [10] are used for the design of the SD adders. The CLA with 8-bit group generates and group propagates are used. Based on the above observations and the results in Table 1 , we estimated the circuit areas of the six kinds of LNS units, which are listed in Table 2 . 
Delay and pipeline latency comparison
The critical path of each stage of the proposed pipelined LNS unit is in the logarithmic stage, whose delay is the sum of the delays of two additions, one comparison, and one table lookup. The delay of the lookup table in the basic and the hardware-reduced LNS units is estimated to be two gate delays, while the delay of the lookup table in the SD hardware-reduced LNS unit is estimated to be three gate delays. In Table 3 , the estimated delays of the critical paths in the six LNS units are listed. The pipeline latencies of the LNS units are estimated as the sum of the delays in the critical path from the first stage to the last stage without considering the delay in the latches. The multipliers within the MAC components are designed to be binary-tree SD multipliers. The estimated pipeline latencies of the six LNS units are listed in Table 4 .
Discussion
From Table 1 and 2, we can derive that the table size and the circuit area of the basic LNS unit can be reduced by 54.4% and 52.2%, respectively, for the 64-bit unit by using the proposed hardware-reduced algorithms. However, the area reduction is only 27.6% for the 64-bit unit, if the SD approach is further used for designing the LNS unit. From Table 3 , we can derive that the throughputs of the 64-bit SD hardware-reduced LNS unit is 4.62 times the throughputs of the 64-bit basic LNS units.
Finally, from Table 4 , we can find that the pipeline latency of the 64-bit SD hardware-reduced LNS unit is 14.12% of that of the 64-bit basic LNS unit. From the above discussion, we conclude that if only the hardware-reduced algorithms, including the base-e exponential and the hardware-reduced logarithmic algorithms, are applied in developing the LNS unit, the hardware cost can be reduced by half. However, the throughput improvement is not significant. If the SD algorithms are further applied, the throughput of the LNS unit is significantly improved. However, the saving in the hardware cost becomes smaller. There is a tradeoff between the hardware saving and the throughput improvement.
Conclusions
In this paper, the base-e exponential algorithm is first proposed to reduce the hardware cost and the pipeline latency of the LNS pipelined computation proposed in [7] . From our analysis, the hardware cost of the 64-bit basic LNS addition/subtraction unit can be reduced by more than 50% by using this hardware-reduced approach. We also developed the SD algorithms to further enhance the performance of LNS addition/subtraction computation. The throughput of the 64-bit basic LNS unit can thus be increased by a factor of 4.61 and its pipeline latency can be reduced by a factor of seven. The hardware cost of the 64-bit SD hardware-reduced LNS is still 27.6% less than that of the basic LNS unit. The proposed two approaches, the hardware-reduced algorithms and the SD algorithms, have been verified by our thorough simulations on the 32-bit hardware-reduced and the 32-bit SD hardware-reduced LNS units. From the results of our simulations and analysis, we conclude that the proposed approaches can significantly improve the performance of very large word-length LNS addition/subtraction computation.
Acknowledgement

