AbstractÐA digit-recurrence algorithm for computing the Euclidean norm of a three-dimensional (3D) vector which often appears in 3D computer graphics is proposed. One of the three squarings required for the usual computation is removed and the other two squarings, as well as the two additions, are overlapped with the square rooting. The Euclidean norm is computed by iteration of carry-propagation-free additions, shifts, and multiplications by one digit. Different specific versions of the algorithm are possible, depending on the radix, the redundancy factor of the digit set, and etc. Each version of the algorithm can be implemented as a sequential (folded) circuit or a combinational (unfolded) circuit, which has a regular array structure suitable for VLSI.
INTRODUCTION
A DVANCES of VLSI technologies make it attractive to accelerate important complex operations by special hardware. By providing special hardware for such an operation, we can release the other arithmetic units, especially a multiplier, from it and, hence, increase the performance of a system greatly. It is also nice to avoid intermediate rounding errors between atomic floating point operations, but only have a bounded error for the final result of a complex operation.
Computation of the Euclidean norm of a three-dimensional (3D) vector often appears in 3D computer graphics for, e.g., normalizing the vector. ªThe number of norm computations per secondº is one of the criteria of the performance of a graphic engine. Note that not its square but the Euclidean norm itself is required for normalizing a vector. In this paper, we propose an algorithm for computing the Euclidean norm of a 3D vector, which is suitable for VLSI implementation. 1 Usually, the Euclidean norm is computed through three squarings, two additions, and one square rooting. We take these operations together into one digit-recurrence and compute the Euclidean norm by iteration of addition/ subtractions, shifts, and multiplications by one digit. We remove one of the three squarings required for the usual computation and overlap the other two squarings, as well as the two additions, with the square rooting.
Starting with the vector component with the highest order of magnitude among the three as the initial value of the partial result, we add correcting-digits to it step-by-step for obtaining the norm. We add partial products of the squares of the other two vector components to the residual, step-by-step. We select each correcting-digit from a redundant digit set by estimating the residual and the partial result. We perform addition/subtractions appearing in the calculation of the recurrence equation for the residual without carry/borrow propagation by representing it in a redundant representation, such as the carry-save form or the signed-digit representation [2] . We extend the on-the-fly conversion algorithm [3] for calculating the recurrence equation for the partial result.
We can design different specific versions of the algorithm, depending on the radix, the redundancy factor of the correcting-digit set, the type of representation of the residual, and the correcting-digit selection function, as digit-recurrence algorithms for division or square rooting [4] . We can implement each version as a sequential (folded) circuit or a combinational (unfolded) circuit, which has a regular array structure suitable for VLSI.
In the rest of this paper, we will first define the computation of the Euclidean norm of a 3D vector to be considered. Then, we will show a general algorithm in Section 3 and show a radix-2 and a radix-4 version of the algorithm in Section 4. We will consider implementation of the algorithm in Section 5.
COMPUTATION OF THE EUCLIDEAN NORM
We consider the computation of the mantissa part of the Euclidean norm of a 3D vector represented as a triple of floating-point numbers, p Y p Y p , where
Here, (P fHY Ig), i , and w ( I P w `I ) are the sign bit, the exponent part, and the mantissa part of p , respectively.
Without loss of generality, we assume that p has the highest order of magnitude among the three components. Namely, i ! i and i ! i . Then, the Euclidean norm of the vector is
where w , w Á P i Ài , and w Á P i Ài .
Therefore, in this paper, we consider the computation of P P P p , where I P `I, H `I, and
holds. We assume that , , and are represented as nEdigit, n hEdigit, and n hEdigit rEry fractions, respectively, where r P . Namely, , , and are represented as
Xx I x P Á Á Á x n , Xy I y P Á Á Á y nh , and Xz I z P Á Á Á z nh , respectively, and n iI x i r Ài , nh iI y i r Ài , and nh iI z i r Ài . h is a nonnegative integer determined for each version of the proposed algorithm. We will discuss h in Section 3.2. We intend to obtain with nEdigit precision. We assume that i s r e p r es e n t e d a s a n n IEdigit rEry n u m b e r , p H Xp I p P Á Á Á p n .
GENERAL ALGORITHM

Recurrence
Here, we derive a general digit-recurrence equation for radix r, where r is a power of 2, i.e., r P . Different from digit-recurrence algorithms for division or square rooting [4] , the result is obtained by adding correcting-digit q j produced in the recurrence to the initial value H of the partial result. Furthermore, partial products of P and P ,
i.e., Á y j and Á z j are added to the residual, step-by-step. Let j be the partial result after j iterations. Then, j H j iI q i r Ài . The recurrence equation for the partial result is:
We select the correcting-digit q jI from a redundant digit set fÀY Á Á Á Y ÀIY HY IY Á Á Á Y g, where r P `r. The final result is n H n iI q i r Ài . The result has to be computed for n-digit precision. Namely,
We define a residual (or scaled partial remainder) j as
where jh Xy I y P Á Á Á y jh and jh Xz I z P Á Á Á z jh . Subtracting r times (3) from the equation for j I, we get the recurrence equation for the residual as:
We will discuss h and selection of the correcting-digit q jI in the next subsection. In order to obtain that satisfies (2), we have to bound j within a certain range. Equation (2) is rewritten as: we get the bounds for j as:
Now, we determine the initial values of the residual and the partial result. Initially, (5) has to be satisfied for j H and, hence,
has to hold. Since
we can satisfy (6) by letting
Note that, when & I, we can also satisfy (6) by letting
In both cases, the larger h is, the more the required computations for obtaining H. The algorithm for computing the Euclidean norm consists of performing n iterations of calculation of the recurrence equations (1) and (4) . The general algorithm is summarized as follows:
Step 1:
j I X r j y jhI r Àh z jhI r Àh ÀP jq jI À q P jI r ÀjÀI ; } Note that, when & I, we can replace Step 1 by ª H X ; H X Á h Á h ;.º We will discuss selection of the correcting-digit q jI in the next subsection.
We can increase the speed of the implementation with a small increase in hardware complexity by performing the addition/subtractions in the recurrence equation for the residual without carry/borrow propagation by the use of a redundant representation. Therefore, in this paper, we concentrate on this type of implementations. Namely, we represent the residual j in a redundant representation, such as the carry-save form or the (binary) signed-digit representation [2] , and perform the addition/subtractions without carry/borrow propagation. Since ÀR`j`R, we can represent j by either a two's complement carrysave form with 3-bit integer part (including the sign bit) or a binary signed-digit representation with 3-bit integer part.
Although we may represent the partial result j in a redundant representation as well, we keep the nonredundant representation of it by an extension of the on-the-fly conversion. We have to extend the on-the-fly conversion algorithm [3] because the initial value of the partial result is not H but P ÀI or . We will discuss this in Section 3.3.
Correcting-Digit Selection
We have to select the correcting-digit q jI from fÀY Á Á Á Y ÀIY HY IY Á Á Á Y g so that j I satisfies
q jI depends on the shifted residual r j, the partial products y jhI r Àh and z jhI r Àh , and the partial result j. We can reduce the effect of y jhI r Àh and z jhI r Àh to the selection by making h larger. We define j as j r j y jhI r Àh z jhI r Àh X Let the interval of j, where
Note that the lower bound of the interval for k À and the upper bound of the interval for k are equal to the lower bound and the upper bound of j, respectively. The continuity condition, i.e., kÀI j ! v k j, yields
h must be determined so that (8) is satisfied. The smaller & is, the larger h must be. The lefthand side of (8) gives the overlap between successive selection intervals. Using the overlap, we can select q jI by estimates of j and j, i.e., j and j, respectively. The correcting-digit selection function is described by a set of selection constants
We determine h so that the partial products have no effect on the correcting-digit selection. Note that H y jhI r Àh` r À Ir Àh and that H z jhI r Àh` r À Ir Àh . We obtain j by truncating r j, which is in a redundant representation, to t fractional bits and obtain j by truncating j to d fractional bits. (Note that not rEry digits but bits.) Since j is in the nonredundant representation, to be shown in the next subsection, j j` j P Àd . When j is in the carry-save form, j r j` j P ÀtI X Therefore,
have to be satisfied. mx j v k j denotes the maximum value of the lower bound of the interval of j, where k can be selected as q jI when the estimate of j is j. Note that the maximum value of j for which k À I is selected as q jI is m k j À P Àt . m k j must be a multiple of P Àt that satisfies (9). When j is in the binary signed-digit representation, j À P Àt`r j` j P Àt and, hence,
have to be satisfied. In both cases, the minimum overlap required for a feasible correcting-digit selection is
IP
These expressions are used to determine the correcting-digit selection function, i.e., the selection constants. However, since they depend on j, a different selection function might result for different j.
If we want to have a single selection function, we need to develop expressions that are independent of j. For min j kÀI j, the term depending on j is always positive and approaching 0 for large j and, therefore, can be neglected. On the other hand, for mx j v k j, the term depending on j is always positive and, therefore, cannot be neglected and we have to use its maximum value (for j H).
To examine whether a single selection function exists, we use (10). The worst case is j P ÀI and k À I, resulting in
IQ
When r P and I (& I), this results in P ÀI ! P Àt P ÀhP and, therefore, we can obtain a single selection function, which does not depend on j. On the other hand, when r ! R, the lefthand side of (13) becomes negative, which means that no single selection function for all j exists, as the case of square rooting [5] .
A possible alternative is to find t so that a single selection function can be used for j ! t and consider the cases for j`t separately. For the case j ! t,
has be to satisfied. For specific values of r and (&), we can determine the values of t, d, t, and h.
Representation of the Partial Result
Here, we consider representation of the partial result j. We intend to have the ordinary nonredundant rEry representation (with digit set fHY IY Á Á Á Y r À Ig) of j by an extension of the on-the-fly conversion. Different from digit-recurrence algorithms for division and square rooting where the partial result after j iterations is a jEdigit number [4] , j is an n IEdigit number. Since the j IEth digit of (the nonredundant representation of) j is in the range from 0 through r À I and q jI is selected from fÀY Á Á Á Y ÀIY HY IY Á Á Á Y g, then there are three possible carry values, i.e., H, ÀI, and I, into the jEth place of j I. Note that, in digit-recurrence algorithms for division and square rooting, there are only two possible carry values, i.e., H and ÀI.
Therefore, we extend the on-the-fly conversion algorithm [3] . We keep the nonredundant representations of j r I n i t i a l l y , w h e n w e l e t H P
We can obtain j I and j I À by selecting j or j À and rewriting the j-th and the j Ith digits of the selected one according to the rule shown in Table 1 . In the table, g denotes pj jI q jI . (pj jI and pj À jI are identical and, for j ! I, they are equal to x jI .) Note that, when r P, there are only two groups, i.e., g is ÀI or H and I ( r À I) or P ( r). The rule for r P is shown in Table 3 in the next section.
Recall the recurrence equation (4) for the residual. We have to produce
ÀjÀI ÀPq jI j I P q jI r ÀjÀI as the adder input. We can make the computation of (4) simpler by providing the nonredundant representation of 
SPECIFIC VERSIONS
We can design different specific versions of the algorithm, depending on the radix r, the redundancy factor & of the correcting-digit set, the type of representation of the residual (carry-save or signed-digit), and the correctingdigit selection function. In this section, we show the details of a radix-2 and a radix-4 version of the algorithm.
A Radix-2 Version
Here, we consider the case that the radix r is P, the correcting-digit set is fÀIY HY Ig (i.e., I, & I), and the residual j is represented in the carry-save form (with 3-bit integer part).
The recurrence equations are:
Since & I, initially, we can let H and H Á h Á h . As stated in Section 3.2, a single correcting-digit selection function is possible for all j. To obtain the function, we get the values of mx j v k j and min j k j from (11) and (12). Then, from (9), the function has to satisfy
Since j ! P ÀI , we can choose m H ÀP ÀI and m I H, which do not depend on j, by letting t P and h R. We can reduce h through detailed consideration on the case of j H. Since
Since P H ! H and t P, then H ! ÀP ÀP . Therefore, when j H, m H can be any value as long as it is not larger than ÀP ÀP . For j ! I,
Therefore, we can choose m H À Q R and m I H by letting t P and h Q. The correcting-digit selection function depends on the most significant 6 bits of the shifted residual
The rule for the extended on-the-fly conversion for the radix-2 case is shown in Table 3 . The rule for generation of j P ÀjÀP q jI is obtained by substituting 2 for r in the rule shown in Table 2 . (See Table 4 in [1] .) 
Algorithm [3DNORM_R2]
( j: truncation of P j to P fractional bits.) j I X P j y jR P ÀQ z jR P ÀQ ÀPq jI j P ÀjÀP q jI ; (Carry-save additions. j P ÀjÀP q jI is generated by Table 2 .) Obtain j I and j I À by Table 3 ; } Step 3: Obtain from n and n À ;
In
Step 3, we obtain by either rewriting the least significant bit of n À to I or rewriting the least significant bit of n to H, accordingly, as the least significant bit of n , i.e., pn n is H or I.
A Radix-4 Version
Here, we consider the case that the radix r is R, the correcting-digit set is fÀPY ÀIY HY IY Pg (i.e., P, & P Q ), and the residual j is represented in the carry-save form (with 3-bit integer part).
Initially, we let H P ÀI and
As stated in Section 3.2, no single correcting-digit selection function for all j exists. Therefore, we determine a single selection function for j ! t and consider the cases for j`t separately. For the case j ! t, from (14),
has to hold. We can make the lefthand side positive by letting t I and d S.
To obtain the single function for j ! I, we get the values of mx j v k j and min j k j from (11) and (12). Then, from (9), the function has to satisfy
( j: truncation of j to S fractional bits.) ( j: truncation of R j to S fractional bits.) j I X R j y jT R ÀS z jT R
ÀS
ÀPq jI j I P q jI R ÀjÀI ; (Carry-save additions. j I P q jI R ÀjÀI is generated by Table 2 .) Obtain j I and j I À by Table 1 ; } Step 3: Obtain from n and n À ;
IMPLEMENTATION
We can implement each version of the algorithm as a sequential (folded) circuit or a combinational (unfolded) circuit. We can also use pipelining. Here, we consider implementation of the algorithm as a sequential circuit which performs one iteration of Step 2 in each clock cycle. Of course, we can construct a sequential circuit which performs more than one iteration of Step 2 per clock cycle.
General Consideration
First, we consider a circuit for performing one iteration of
Step 2. It consists of a combinational circuit part and registers. The combinational circuit part consists of the following modules. We assume that residual j is represented in the carry-save form. P ÀI À I, REG-WC to IIIXI Á Á Á I" x I " x P Á Á Á " x n (( Â h) 1s at the right of the binary point), where " x i is the Eit binary representation of r À I À x i , i.e., the bit-wise complement of x i , REG-WS to IIIXI Á Á Á IH Á Á Á HI (( Â h P) 1s at the right of the binary point followed by ( Â n À Q) 0s). Recall that r P . Note that a carry save form of r Àh À À P ÀP is stored in REG-WC and REG-WS. (When & I and we let H and H Á h Á h , we initialize REG-PP to I, REG-PM to À I, REG-WC to 0 and REG-WS to 0.) Then, we execute h cycles (for j ÀhY Á Á Á Y ÀI) with letting q jI be H, keeping SREG-J, REG-PP, and REG-PM unchanged.
We can also perform Step 3 using the circuit. We execute one cycle (for j n) with letting q nI be H. Then, we can find in PEG-PP (down to nEth place). Therefore, using the circuit, we can perform nEdigit Euclidean norm computation in n h I clock cycles. As shown in Fig. 2 , the cycle time is the maximum of t gen t gerry t ge , t ghsel t esgen t ge , a n d t ghsel t iy p onv , together with the time for loading data to registers. The amount of hardware is proportional to n. The circuit has a regular linear cellular array structure with a bit-slice feature suitable for VLSI implementation.
The circuit has many similarities with widely used dividers and square rooting circuits based on digit-recurrence algorithms, especially with the latter. The CD selector, the EOTF converter, and the AI generator can be obtained by modifying the result-digit selector, the on-the-fly converter, and the adder-input generator of a square rooting circuit, respectively. The CSA is the same as that of a square rooting circuit. We have to attach the PP generators and the CSA array. Thus, we can construct a norm computation circuit by modifying a square rooting circuit and attaching simple hardware for generating and adding partial products. It may be possible to implement a combined unit that performs both norm computation and square rooting as well as division.
For floating-point norm computation, besides the circuitry shown above, we need circuitry for finding the vector component with the highest order of magnitude and for aligning the mantissas. We can find such circuitry in a floating-point addition unit. Note that we need two floating-point additions in the usual floating-point norm computation.
Sequential Implementations of the Specific Versions
Here, we consider a bit more detail of sequential implementations of the radix-2 and the radix-4 version shown in Section 4. We denote the number of bits of the operand by x. For r P, n x. For r R, n xaP, assuming x to be even. First, we consider an implementation of the radix-2 version. The CD selector is a combination of a 6-bit carrypropagate adder and simple constant value comparators. Each of the PP generators consists of a row of x Q 2-input AND gates. Buffers for driving y jhI and z jhI are also required. The CSA array is an x T-bit 4-2 carry-save adder. The AI generator consists of a 5-input combinational circuit and an x I-bit 5-to-1 selector followed by an x IEit 2-input multiplexer. Buffers for driving the rewriting values are also required. The CSA is an x TEit carry-save adder. The EOTF converter consists of a 4-input combinational circuit and a pair of x IEit 4-to-1 selectors. Buffers for driving the rewriting values are also required. The cycle time is t ghsel t esgen t ge t lod , which is about t Tg e t Som t uf t Siv t Pw t p e t lod X Here, t Tg e , t Som , t uf , t Siv , t Pw , t p e , and t lod are delays for a 6-input carry-propagate adder, a 5-input combinational circuit, a buffer for driving a signal to about xEit length, a 5-to-1 selector, a 2-input multiplexer, a full adder, and register loading, respectively. The cycle time is slightly longer than that of a radix-2 square rooting circuit.
Next, we consider an implementation of the radix-4 version. The CD selector is a combination of a 10-bit carrypropagate adder and comparators. Each of the PP generators consists of two rows of x IH 2-input AND gates. The CSA array is an x IHEit 4-2 carry-save adder followed by an x IQEit 4-2 carry-save adder. The AI generator consists of a 9-input combinational circuit and an x IEit 5-to-1 selector followed by an x PEit 4-input multiplexer. The CSA is an x IQEit carry-save adder. The EOTF converter consists of a 7-input combinational circuit and a pair of x IEit 4-to-1 selectors. Several buffers are required for driving long lines. The cycle time is again t ghsel t esgen t ge t lod , which is about t IHg e t Wom t uf t Siv t Rw t p e t lod X It is slightly longer than that of a radix-4 square rooting circuit.
CONCLUSION
We have proposed a digit-recurrence algorithm for computing the Euclidean norm of a 3D vector which often appears in 3D computer graphics. We have removed one of the three squarings required for the usual computation and overlapped the other two squarings, as well as the two additions with the square rooting. We have also extended the on-thefly conversion algorithm, which seems useful for other applications, e.g., on-the-fly rounding [6] . We can design different specific versions of the algorithm, depending on the radix, the redundancy factor of the correcting-digit set, the type of representation of the residual, and the correcting-digit selection function. We can implement each version as a sequential circuit or a combinational circuit, which has a regular array structure suitable for VLSI. By providing a norm computation circuit, we can release multipliers from norm computation, and can increase the performance of a graphic engine.
