Composite Iterative Algorithm and Architecture for q-th Root Calculation by Vazquez, Alvaro & Bruguera, Javier
Composite Iterative Algorithm and Architecture for
q-th Root Calculation
Alvaro Vazquez, Javier Bruguera
To cite this version:
Alvaro Vazquez, Javier Bruguera. Composite Iterative Algorithm and Architecture for q-th
Root Calculation. [Research Report] RR-7564, INRIA. 2011, pp.30. <inria-00575573>
HAL Id: inria-00575573
https://hal.inria.fr/inria-00575573
Submitted on 10 Mar 2011
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
appor t  

de  r ech er ch e
IS
S
N
0
2
4
9
-6
3
9
9
IS
R
N
IN
R
IA
/R
R
--
7
5
6
4
--
F
R
+
E
N
G
INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE
Composite Iterative Algorithm and Architecture for
q-th Root Calculation
Álvaro Vázquez — Javier D. Bruguera
N° 7564
March 2011

Centre de recherche INRIA Grenoble – Rhône-Alpes
655, avenue de l’Europe, 38334 Montbonnot Saint Ismier
Téléphone : +33 4 76 61 52 00 — Télécopie +33 4 76 61 52 52
Composite Iterative Algorithm and Architecture
for q-th Root Calculation
Álvaro Vázquez ∗, Javier D. Bruguera †
Thème : Algorithms, Certification, and Cryptography
Équipe-Projet Arénaire
Rapport de recherche n° 7564 — March 2011 — 30 pages
Abstract: An algorithm for the q-th root extraction, being q any integer, is
presented in this paper. The algorithm is based on an optimized implementa-
tion of X1/q = 2(1/q) log2(X) by a sequence of parallel and/or overlapped opera-
tions: (1) reciprocal, (2) digit–recurrence logarithm, (3) left–to–right carry–free
multiplication and (4) on–line exponential. A detailed error analysis and two
architectures are proposed, for low precision q and for higher precision q. The
execution time and hardware requirements are estimated for single and double
precision floating–point computations for several radices; this helps to determine
which radices result in the most efficient implementations. The architectures
proposed improve the features of other architectures for q–th root extraction.
Key-words: Integer rooting, high-radix digit-by-digit algorithms, on-line al-
gorithms, elementary function evaluation
∗ INRIA, LIP (UMR 5668 CNRS - ENS de Lyon - INRIA - UCBL), Université de Lyon,
France, Alvaro.Vazquez-Alvarez@inrialpes.fr
† Universidade de Santiago de Compostela, Departamento de Electrónica e Computación,
Spain, jd.bruguera@usc.es. Work supported in part by Ministry of Education and Science
of Spain under contract TIN 2007-67537-C03. J. D. Bruguera is a member of the European
Network of Excellence on High Performance and Embedded Architecture and Compilation
(HiPEAC)
Algorithme itératif composé et architecture pour
le calcul des racines q-ièmes
Résumé : Dans cet article, nous présentons un algorithme matériel pour
l’extraction de la racine q-ième d’un nombre X, où q est un entier naturel
non nul. Cet algorithme est basé sur une implantation optimisée de la fonc-
tion X1/q = 2(1/q) log2(X) par une séquence d’opérations parallèles et/ou super-
posées: (1) réciproque, (2) logarithme chiffre par chiffre, (3) multiplication de
gauche-à-droite sans propagation de retenue et (4) exponentielle en ligne. Une
analyse détaillée des erreurs et deux architectures sont proposées, pour q de
basse précision et pour q de précision plus haute. Le temps d’exécution et les
composants matériels à utiliser sont estimés pour des calculs en virgule flot-
tante simple et double précision et pour plusieurs bases. Cette étude aide à
déterminer quelles bases mènent aux implantations les plus efficaces. Les ar-
chitectures proposées améliorent les caractéristiques d’architectures précédentes
destinées à l’extraction des racines.
Mots-clés : Calcul des racines entières, méthodes de calcul chiffre par chiffre,
grande base, algorithmes en ligne, évaluation des fonctions élémentaires
Composite Iterative Algorithm and Architecture for q-th Root Calculation 3
1 Introduction
The design of functional units for the computation of q–th roots (X1/q) has
been a challenging task for years. The q–th root extraction includes some very
frequent operations in computer graphics, digital signal processing and scientific
computation, such as square root (X1/2), reciprocal (X−1), inverse square root
(X−1/2), cubic root (X1/3) and inverse cubic root (X−1/3), and some other less
frequent but also important functions.
The traditional approximation to q–th roots extraction has been the de-
velopment of functional units for the computation of a given root. This way,
there is a number of algorithms and implementations for the two most fre-
quent roots, the square root and the inverse square root calculation, includ-
ing linear convergence digit–recurrence algorithms and quadratic convergence
multiplicative-based methods, such as Newton-Raphson and Goldschmidt algo-
rithms [5]. There are also several approaches for the calculation of other roots
derived from the application of general methods for function evaluation to the
case of root extraction.
In general, for the calculation of a q–th root with very low precision, it is
possible to employ direct table look–up, but its high memory requirements make
it an inefficient method for single– or double–precision floating–point formats.
Polynomial and rational approximations are another way of implementing the
q–th root extraction [9]. However, one of the most efficient methods in floating–
point representation is table–driven algorithms, which are halfway between di-
rect table look–up and polynomial and rational approximations. The use of
a polynomial approximation allows the table size to be reduced and the table
look–up allows us to reduce the degree of the polynomial.
A first order piecewise linear approximation based on a Taylor expansion for
the calculation of Xp with p = ±2k or p = ±2k1 ± 2−k2 , being k and k1 any
integer and k2 any nonnegative integer, is presented in [15]. A limited number
of roots, square root, reciprocal square root, fourth root, etc., are included.
This implementation requires, besides the table to store the coefficients, just
one multiplier. Alternatively, second order polynomial approximations for the
calculation of Xy, being y any rational, are presented in [2, 10]. In these cases,
besides the look–up table, one or two multipliers and several adders are required.
On the other hand, a digit–recurrence method for the q–th root extraction
is presented in [8] and particularized to the radix 2 cube root computation in
[13]. The complexity of the resulting architecture depends on q, such as the
larger q the larger the complexity, in such a way that the architecture for the
computation of large q–th roots seems difficult to implement. Other digit–
recurrence implementations for both square and cube root computations are
described in [6, 16].
It has to be pointed out that all the methods outlined above for the extraction
of a q–root are targeted for a given q. That means that the resulting architecture
cannot be used for the calculation of a root different to that it has been designed
for. To adapt the architecture to a different q–th root requires to change the
look–up tables in the case of table–driven polynomial approximations, or to de-
sign a completely new architecture, in the case of the digit–recurrence method.
Of course, the table–driven polynomial approximations can be adapted to com-
pute more than just one q–th root, but this needs the replication of the look–up
RR n° 7564
4 Vázquez & Bruguera
Symbol Definition
X = Mx × 2
Ex floating-point radicand
n precision bits of radicand and result
nEx precision bits of the exponent of X
r radix r = 2b (b ∈ N+, b ≥ 3)
q root degree (q ∈ N , q 6=0)
Mq normalized significand of q, Mq ⊂ [0.5, 1)
Eq Normalized exponent of q, Eq ≥ 0
X1/q = 2(1/q) log2(X) q-th root function
nq precision bits of |q| or Mq
Mz Significand of the result
Ez Exponent of the result
T = (1/q) log2(X) argument of the q-th root function
I integer part of T
F fractional part of T
D = 1/q reciprocal of the root degree
S = log2(X) = Ex+log2(Mx) logarithm of the radicand
S∗ logarithm of the radicand shifted by 2−Eq
L = log2(Mx) logarithm of the radicand significand
Dj , Lj , S∗j , Tj , Ej high-radix digits of intermediate operands
D[j], L[j], T [j], E[j] intermediate results at iteration j
wd[j], wl[j], wm[j], we[j] intermediate residuals at iteration j
nd, nl, ns, nm, ne precision bits of intermediate operands
gd, gl, gm, ge guard bits of intermediate operans
Nd, Nl, Ns, Nm, Ne latencies of intermediate operations
Notation for operations: d or D for reciprocal, l or L for logarithm,
m or T for multiplication, s or S∗ for shifting, e or E for exponential.
Table 1: Symbols and notation
tables. In any case, the methods above cannot be considered as general methods
for the calculation of any q–th root.
The only architecture in the literature for the q–th root extraction for any q,
except the naive implementation of X1/q into a cascaded reciprocal-logarithm-
multiplication-exponential chain, has been presented in [12]. This architecture
was designed for the computation of the powering function Xp, with p any
integer, based on a logarithm-multiplication-exponential chain implementation
speeded–up by using redundancy and on–line arithmetic, and extended to the
computation of X1/q. However, the extended architecture for the q–th root
extraction is hard to implement, because in addition to the operations in the
chain, it includes an integer division and requires the calculation of the modulus
of the division.
In this paper we give a detailed description of an optimized composite it-
erative algorithm for the computation of X1/q, for a floating–point operand
X = (−1)sx ×Mx × 2
Ex and an integer operand q. The algorithm is based
on the architecture presented in [12] for the computation of Xp, being p any
integer, and the final result is computed as X1/q = 2(1/q)×(Ex+log2 Mx) through
a sequence of overlapped operations: high–radix digit–recurrence logarithm,
high–radix left–to–right carry–free multiplication and on–line high–radix expo-
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 5
nential. This formulation avoids the integer division and the modulus operation
of the extension of the algorithm in [12] to the calculation of the q–th root.
Besides, the proposed q-th root computation could be easily integrated into the
dataflow of the powering architecture [12], allowing the evaluation of numerous
powering and q-th root operations, or even logarithms and exponentials.
We propose two different implementations of the algorithm. First, we pro-
pose an algorithm for the computation of X1/q with low precision values of q,
in such a way that 1/q can be obtained from a look–up table. Later, we modify
the algorithm to overcome this limitation in the precision of q. We perform a
detailed error analysis to determine the size of the intermediate operands and an
analysis of the tradeoffs between area and speed for determining which radices
result in the most efficient implementations.
The rest of the paper is structured as follows: we first focus on the algorithm
for q–th root extraction for low precision q, giving a detailed description of the
algorithm and its error analysis in Section 2. The architecture implementing
the algorithm is presented in Section 3. In Section 4, we present the modifi-
cation of the algorithm and the architecture to larger precision q values. The
evaluation of the architecture for several different radices and the comparison
with other implementations is given in Section 5. Finally, the main conclusions
are summarized in Section 6.
2 Algorithm for q-th root Calculation
In this Section we present a new composite algorithm for the computation of
the q-th root function X1/q and its error analysis, being X a floating-point
number and q = (−1)sqabs(q) a nq + 1-bit signed integer exponent (see Table
1 for symbols and notation used). This algorithm is based on the algorithm
presented in [12] for the computation of the powering function Xq, although
there are important differences between both algorithms.
The algorithm presented in this Section is intended for low precision values
of q, or when only a reduced subset of q values is interesting for the application,
such that 1/q can be obtained efficiently from a look-up table. In Section 4
we propose a modification of the algorithm that avoids this limitation on the
precision of q.
2.1 Overview
The algorithm for the q-th root calculation (q 6= 0) is derived as follows
X1/q = 2log2(X
1/q) = 2(1/q) log2(X) (1)
Considering a floating–point operand X = Mx × 2Ex , being Mx the n-bit sig-
nificand and Ex the nEx-bit signed exponent,
X1/q = 2(1/q) log2(Mx×2
Ex )
= 2(1/q)(log2(Mx)+log2(2
Ex ))
= 2(1/q) log2(Mx) × 2log2(2
Ex/q)
= 2(1/q) log2(Mx) × 2Ex/q (2)
RR n° 7564
6 Vázquez & Bruguera
Equation (2) could be taken as the floating–point result for the q-th root calcu-
lation, being 2(1/q) log2(Mx) the significand and Ex/q the exponent; however, the
main problem for this interpretation is that the exponent is not an integer. In
[12], it has been suggested that this situation can be handled by decomposing
the integer division Ex/q into integer and fractional part, in such a way that
Ex/q = ⌊Ex/q⌋+(1/q)×Ex%q being % the modulus of the division. This way,
equation (2) could be rewritten as follows
X1/q = 2(1/q) log2(Mx) × 2(1/q)×Ex%q × 2⌊Ex/q⌋ (3)
being 2(1/q) log2(Mx)×2(1/q)×Ex%q and ⌊Ex/q⌋ the significand and the exponent,
respectively. However, equation (3) is hard to implement due to integer division
and modulus operations.
Therefore, another transformation to equation (2) must be used. To avoid
the integer division Ex/q equation (2) can be rewritten as
X1/q = 2(1/q) log2(Mx×2
Ex )
= 2(1/q)(Ex+log2(Mx))
= 2(1/q)×S
where S = Ex+log2(Mx) is the concatenation of the digits of Ex (integer value)
and log2(Mx) ⊂ [0, 1).
Therefore,
X1/q = 2S/q (4)
According equation (4) the q-th root can be calculated as a sequence of
operations: logarithm of the significand Mx (log2(Mx) ⊂ [0, 1)), addition of Ex
and log2(Mx) (concatenation of binary strings), division by q and exponential
of the result of the division. For an efficient implementation, the computation of
the operations involved must be overlapped. This requires a left–to–right most–
significant digit first (MSDF) mode of operation and the use of a redundant
representation.
A problem of the algorithm above is the range of the exponential function
2S/q. Digit–recurrence exponential algorithms require the argument to be in
the interval (−1, 1), while S/q is out of the range. To extend the range of
convergence and guarantee the convergence of the algorithm, the integer and
fractional parts of the argument of the exponential must be extracted serially
and equation (4) must be rewritten.
If T = S/q is computed and its integer I and fractional F parts are extracted,
2T becomes
2T = 2I × 2F (5)
so that the argument of the exponential 2F is now in (−1, 1).
Then, replacing in equation (4), the q-th root is obtained as follows
X1/q = 2F × 2I (6)
with
Mz = 2
F , Ez = I (7)
being Mz and Ez the significand and exponent of X1/q. This results in a
bounded argument for the exponential.
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 7
Mx
log2 Mx
q
1/q
E
x
D=1/q
I1
F1 F2 F3 F4
I2
DxS DxS DxS
log2 Mx
DxS DxS DxS
2F 2F 2F 2F
Ez
To on−the−fly conversion
Mz
S1 S2 S3
log2 Mx
S4
log2 Mx
S
−1 S0
F[0] = F1+F2
LU TABLES
on−line delay = 2
SD radix−128 conversion
LU TABLE
logarithm
LRCF mult
exponential
reciprocal
Figure 1: Sequence of operations of the algorithm.
We have first implemented the q-th root algorithm for low precision values of
q and a generic radix r = 2b. An example is shown in Fig. 1 for a single-precision
operand X and radix r = 128. The sequence of operations is as follows:
1. Evaluation of D = (−1)sq×abs(1/q) using a lookup table of nq inputs (low
precision) and nd outputs (1 integer bit and nd − 1 fractional bits), with
abs(1/q) ⊂ (0, 1] and Eq > 0. Operand D is obtained in a non-redundant
binary form (Section 3.1).
2. Evaluation of the logarithm L = log2(Mx) ⊂ [0, 1) to a precision of nl bits
using a high-radix algorithm (see Section 3.2).
3. Multiplication T = D × S (see Section 3.3). Operand
S =
⌈nl/b⌉−1∑
i=−⌈nEx/b⌉+1
Si r
−i = Ex + L
is obtained serially by concatenating the digits of Ex and L, with Ex ⊂
[−2nEx−1, 2nEx−1 − 1] and L expressed in a signed-digit radix-r form. We
evaluate this multiplication using a LRCF (left-to-right carry-free) multi-
plier [3]. The product of this fixed point multiplication has at most nEx
significant integer bits. The maximum number of accurate fractional bits
required is calculated in Section 2.2.
4. Serial extraction of the integer I and fractional F parts of T , and on-the-
fly conversion of I to a non-redundant representation. The integer part I
corresponds to the γ = ⌈nEx/b⌉ leading digits of T .
5. Online high-radix exponential 2F ⊂ (0.5, 2) with argument F = frac(T ) ⊂
(−1, 1), precision of ne bits, and online delay δ = 2 (see Section 3.4). The
redundant result is normalized and rounded to n bits using an on-the-fly
rounding unit [4].
The latency of the algorithm for r = 2b ≥ 8 is given by:
Nq−root = 1 + γ + (δ + 1) +Ne
RR n° 7564
8 Vázquez & Bruguera
where δ = 2 and Ne = ⌈ne/b⌉ are respectively the online delay and the latency
of the exponential 2F , and γ = ⌈nEx/b⌉ is the number of radix-r integer digits
of T . We have performed an error analysis to obtain an estimation of the
precisions and latencies for the intermediate operations. For the example of
Fig. 1, the number of cycles of the logarithm, multiplication and exponential
are respectively Nl = 4, Nm = 6 and Ne = 4, while the latency of the q-th root
algorithm is Nq−root = 10 cycles.
2.2 Error Analysis
If we evaluate X1/q using the previous sequence of operations, we get an ap-
proximation X1/qα (we use a subindex α to indicate an approximated value) with
the following contributions to the error, represented by the different ε’s:
1. Reciprocal D = 1/q. The output of the reciprocal unit is D + εrec
2. Logarithm L = log2(Mx): The output of the module is L+ εlog
3. Multiplication T = D× S with S = Ex +L. The output of the multiplier
is given by
Tα = T +D × εlog + S × εrec + εlog × εrec + εmul
In order to simplify the previous expression we use
εf = D × εlog + S × εrec + εlog × εrec + εmul (8)
so
Tα = T + εf
4. Extraction of the integer I and fractional F parts of T = D × S. The
integer part is of the form Iα = ⌊Tα⌋ = I + ⌊F + εf⌋. The fractional part
is given by
Fα = frac(Tα) = F + εf − ⌊F + εf⌋
5. Operation 2F . The output of the exponential computation is 2Fα + εexp.
So the approximation of X1/q is as follows
X1/qα = (2
Fα + εexp)× 2
Iα
= (2F+εf−⌊F+εf⌋ + εexp)× 2
I+⌊F+εf⌋
= (2F+εf + εexp)× 2
I
= X1/q × 2εf + εexp × 2
I (9)
with X1/q = 2F × 2I .
Since F ⊂ (−1, 1) is expressed in signed-digit, then 2F ⊂ (0.5, 2) and we
are considering n precision bits for the final normalized result, this implies that
1 ulp = 2n. Assuming rounding to the nearest even, the error bound for the
exponential is 2−(n+1). Then, it must be verified that
|X1/q −X1/qα | ≤ 2
−(n+1) × 2I
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 9
Replacing X1/qα by expression (9), the previous condition is transformed into
|X1/q × (1− 2εf )− εexp × 2
I | ≤ 2−(n+1) × 2I
Using X1/q = 2F × 2I and taking out the common factor 2I , the previous
condition can be expressed as
|2F × (1− 2εf )− εexp| ≤ 2
−(n+1)
Next, we compute an upper bound for the left term of this inequality. Since
2F ⊂ (0.5, 2), we replace it by 2, and considering εf << 1 we use the approxi-
mation 2εf ≈ 1 + (ln(2)/2)εf + O(ε2f ) ≤ 1 + εf/2. We introduce these bounds
in the previous expression obtaining
|εf + εexp| ≤ 2
−(n+1)
The critical parameter to minimize first is εexp, since it is directly related to
the latency of the algorithm. The minimum possible value for the upper bound
for the exponential error is:
|εexp| ≤ 2
−(n+2)
Then we have
|εf | ≤ 2
−(n+2)
To obtain the highest contribution to the error εf given by expression (8)
we replace D = 1/q ≤ 1 by 1 and S = Ex + log2(Mx) ≤ 2
nEx by 2nEx . Then,
|εlog + 2
nEx × εrec + εlog × εrec + εmul| ≤ 2
−(n+2)
To simplify we put the same upper bound ε for the errors of the logarithm and
multiplication, that is, εlog ≤ ε, εmul ≤ ε, while for the error of the reciprocal
we consider εrec ≤ 2−nEx × ε. Then we get
|(3 + 2−nEx)× ε| ≤ 2−(n+2)
so that an upper bound for the error of the logarithm and multiplication is
|ε| ≤ 2−(n+4)
For the reciprocal we get
|εrec| ≤ 2
−(n+nEx+4)
Considering the bits required for representing the integer parts (1 bit for the
reciprocal abs(1/q), nEx bits for the multiplication), the required precision and
minumum latency values for each intermediate operation (nd and Nd for exact
rounded reciprocal, nl and Nl for logarithm, nm and Nm for multiplication, and
ne and Ne for exponential) are shown in Table 2 parametrized as a function
of n and nEx. We also show the latency for the q–root computation and the
corresponding values for single (SP) and double (DP) precision with r = 128.
These values will be used to determine the number of guard bits gl, gm and ge
to guarantee the required precisions nl, nm, ne in the corresponding units.
RR n° 7564
10 Vázquez & Bruguera
Target Precision (bits)
Precision nd nl nm ne nq−root
(n, nEx) nEx+n+4 n+4 nEx+n+4 n+2 n
SP (24,8) 36 28 36 26 24
DP (53,11) 68 57 68 55 53
Target Latency (cycles)
Precision Nd Nl Nm Ne Nq−root
(n, nEx) 1 ⌈(n+4)/b⌉ γ+⌈(n+4)/b⌉ ⌈(n+2)/b⌉ 2+δ+γ+Ne
SP (24,8) 1 4 2+4 4 10
DP (53,11) 1 9 2+9 8 14
Table 2: Example of Parameters for q-th root Computation (r = 128, δ = 2, γ =
⌈nEx/b⌉)
Mx
High−Radix 
Logarithm
q Ex
Ez
Mz
On−line 
High−Radix 
Exponential
LRCF Multiplier 
(1/q) xSj
Reciprocal 
(Look−up 
table)
On−the−fly 
conversion
Sj
LjExj
Fj
Ij
1/q
Figure 2: Block diagram of the architecture.
3 Implementation
We propose a sequential architecture for the implementation of the algorithm
described in Section 2. Fig. 2 shows the general block diagram of the architec-
ture. Single thick lines represent long-word operands (around n bits), single thin
lines represent short-word operands (around b or nEx bits), and double lines rep-
resent redundant signed radix-r digits in a borrow-save format (or signed-digit
radix 2). The high-radix logarithm, LRCF multiplication, and on-line high-
radix exponential units are similar to those implemented in [12]. A detailed
description of an on-the-fly conversion unit can be found in [4].
To allow faster execution of iterations in these units we opted for repre-
senting all variables in a redundant borrow-save representation. An advantage
of borrow-save over carry-save representation is an easier conversion of signed
radix-r digits. Moreover, a borrow-save adder can be implemented as a carry-
save adder with some inverted inputs and outputs. Next, we outline the main
computations involved.
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 11
nq
(n, nEx) 5 6 7 8 9 10 11 12
SP (24,8) 1.12 2.25 4.5 9 18 36 72 144
DP (53,11) 2.12 4.25 8.5 17 34 68 136 272
Table 3: Size (in Kbits) of look-up tables for reciprocal computation.
3.1 Table Lookup for Reciprocal
If q is a low-precision operand (precision of abs(q) nq ≤ 12 bits), we can use
a look-up table to compute efficiently its reciprocal abs(1/q) ⊂ (0, 1]. As we
have detailed in Section 2.2, we need n+nEx +4 precision bits to represent the
correctly rounded reciprocal abs(1/q). Therefore, the size of the look-up table
is 2nq × (n+ nEx + 4) bits. We assume a 1-cycle latency for the look-up table.
In Table 3 we show the size of the corresponding look-up table for single and
double precision results and several values of nq.
3.2 High-Radix Logarithm
For the computation of log2(Mx) we use an optimized high-radix digit-recurrence
algorithm similar to that implemented in [12]. This algorithm is based on the
identity
log2(Mx) = log2(Mx
∏
fj)−
∑
log2(fj)
such that ifMx
∏
fj ← 1 then −
∑
log2(fj) ← log2(Mx), where constant factors
of the form fj = (1 + ljr−j) allow the use of a shift-and-add implementation.
For radix r = 2b the recurrences for the logarithm L[j] and the residual
wl[j + 1] result in
L[j + 1] = L[j]− log2(1 + ljr
−j)
wl[j + 1] = r(wl[j] + lj + ljwl[j]r
−j)
The number of iterations required to obtain the logarithm with nl bits of accu-
racy is Nl = ⌈nl/b⌉.
The block diagram of an implementation of the previous recurrences is shown
in Fig. 3. We use selection by rounding to obtain the high-radix digits lj , with
|lj | ≤ (r − 1), except for the first digit l1, which is obtained by a lookup table.
The rounding is performed on an estimate ̂wl[j + 1] obtained by truncating
to t fractional bits the borrow-save residual wl[j + 1]. The constants log2(1 +
ljr
−j) are stored in lookup tables. In order to simplify these tables, the loga-
rithm constants are approximated by −lj/ln(2) from iteration 1 + ⌈Nl/2⌉.
At the end of each iteration j ≥ 1 we obtain a high-radix digit Lj of the
logarithm in a borrow-save form. The recurrence L[j] is scaled by r so that
each digit Lj is obtained by the extraction of the b leading bits of the scaled
borrow-save L[j].
3.3 LRCF Multiplication
The left-to-right carry-free (LRCF) multiplication [3] produces the product dig-
its Tj from a redundant set in a most-significant-digit-first (MSDF) manner.
RR n° 7564
12 Vázquez & Bruguera
BSA 3:2
log[(nl/2b)−1]
nl+gl
Round & 
assim.
Mux−3
TABLE 
−rj−1log (1+l j r−j)
b+1
Mult−Add
Mux−2
TABLE 
−log (1+l1 r−1)
j
−lj
b+1
TABLE 
−lj /ln(2)
b+1
MxBSA 4:2
Mux−2
*
>>r−1
Round & 
recod.
nl+gl
nl+gl
nl+gl
nl+gl
Mux−2
r2(1−Mx)
* +
nl+gl
b+1
b
b+1
nl+gl
nl+gl
b
b
b+t
b
nl+gl−b
nl+gl
rwl[j]Mx
n
Lj
nl+gl−b
nl+gl−b
nl+gl
b
L[j]
rL[j]
to the LRCF_mult
BARREL 
SHIFTER
l1
lj
lj
TABLE 
l1
n
b+1
nl+gl
b+tb+t
nl+gl
j nl+gl
wl[j]
r−j+1wl[j]
b+1
Figure 3: Block diagram of the logarithm stage.
We use a LRCF multiplier to perform the intermediate multiplication in the
low precision q-th root algorithm. We adapt the original architecture to fit our
requirements as in [12]. This stage computes the following two recurrences for
the residual wm[j + 1] and the product T [j]:
wm[j + 1] = r(frac(wm[j] +D × Sj+1)
Kj = ⌊wm[j] +D × Sj+1⌋
T [j + 1] = T [j] +Kjr
−j−1
where the maximum value of |Kj | < 3r/2 can be larger than (r − 1). Before
starting the computation of iterations, the radix points of operands D and S
are adjusted such that |D| < 1 and |S| < 1.
The block diagram of the LRCF multiplication stage is shown in Fig. 4.
The LRCF multiplier consists of a multiply-add unit which computes wm[j]+
D × Sj+1 and a recoding block to obtain Tj from Kj and Kj−1. The inputs
to the multiply-add unit are the operand D = 1/q, the high-radix digits Si (in
borrow-save format) of the operand S = Ex + log2(Mx) and the residual wm[j].
The digits Sj come either from the high-radix representation of Ex or from
the evaluation of the logarithm (digits Lj). In each iteration the operand D
is multiplied by a high-radix digit Si and accumulated to the previous residual
wm[j]. The next residual is obtained by scaling by r the fractional part of the
result of the multiply-add operation.
The product T has integer I and fractional F parts. Since we have consider
similar error bounds for the logarithm and multiplication (see Section 2.2),
we need to compute the result within nm = nEx + nl bits of accuracy. The
integer part consists of the γ = ⌈nEx/b⌉ most significant digits Tj obtained as
Tj = recod(Kj ,Kj−1). These digits are passed to a on-the-fly conversion unit.
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 13
nm+gm
Mult−Add
*
Recoding
1/q
* +
b
wm[j]
Sj
nd+gd
Kj−1
b
wm[j]
Tj
b
nm+gm
K
K
T
j
j
j
to on−the−fly_CONV or 
OL_HR_EXP
from HR_LOG  or Exfrom RECP_LUT  
Figure 4: Block diagram of the LRCF multiplier.
Since the fractional part is obtained with nl bits of accuracy, in the remaining
iterations we compute Nl = ⌈nl/b⌉ digits of the fractional part of T , which are
used by the exponential unit without conversion. The total number of iterations
of the LRCF multiplication is Nm = γ +Nl.
3.4 On-line high-radix exponential
The online high-radix algorithm for the computation of 2F is detailed in [12].
This algorithm is based on the identity
2F = (
∏
hj)2
F−
∑
log
2
(hj)
with hj = (1 + ejr−j). The input operand F is only available up to the j + δ
digit at iteration j, as F [0] =
∑δ
i=1 Fjr
−j , with δ the online delay and Fj the
digits of F . The recurrences for the residual and the exponential are given by:
E[j + 1] = E[j](1 + ejr
−j)
we[j + 1] = r(we[j]− r
j log2(1 + ejr
−j) + Fj+δr
−δ)
with j ≥ 1, E[1] = 1 and we[1] = rF [0]. The number of iterations required to
obtain the exponential with ne bits of accuracy is Ne = ⌈ne/b⌉.
The block diagram of the exponential unit is shown in Fig. 5. We use
selection by rounding to obtain the high-radix digits ej , with |ej | ≤ (r − 1),
except for the first digit e1, which is obtained by a lookup table. The lookup
tables used in the first iteration are addresses one cycle in advance, since F [0]
is known while F1+δ is being computed. The rounding is performed on an
estimate ̂we[j + 1] obtained by truncating to t fractional bits the borrow-save
residual we[j + 1].
RR n° 7564
14 Vázquez & Bruguera
BSA3:2
log[(ne/2b)−1]
ne+ge
Round & 
assim.
Mux−3
TABLE 
−rj+1log (1+e j r−j)
b+1
Mult−Add
TABLE 
−log (1+e1 r−1)
j
−ej
b+1
TABLE 
−ej /ln(2)
b+1
Mux−2
*
<<r1
Round & 
recod.
*
+
b+1
b
b+1
b+t
rwe[j]
Mz
ne+ge−b
r2F[0]
from the LRCF_mult
BARREL 
SHIFTER
ej
TABLE 
e1
b+tb+t
ne+ge
j
r−j+1E[j]
F1
b+1
e1
Fj+d r−d+1
ne+ge
rE[j]
On−the−fly 
Conversion
rE[j+1]
we[j]
BSA 4:2
Mux−2
ne+ge ne+ge
ne+ge
ne+ge
ne+ge
ne+ge
b
ne+ge
ne+ge
from the 
LRCF_mult
<<r2
F1
b−bit CLA
b+1
b
n
Ej+1
b
ne+ge−b
ne+ge
E[j]
Figure 5: Block diagram of the exponential stage.
The constants log2(1+ejr
−j) for iterations 2 to ⌈Ne/2⌉ are stored in lookup
tables. The exponential constants are approximated by −rej/ln(2) from itera-
tion 1 + ⌈Ne/2⌉. At the end of each iteration we obtain a high-radix digit Ej of
the logarithm in a borrow-save form. The recurrence E[j] is scaled by r so that
each digit Ej is obtained by the extraction of the b leading bits of the scaled
borrow-save E[j].
The conversion of the high-radix digits Ej of the exponential to the non
redundant binary result is performed in an on-the-fly rounding unit [4], avoiding
an increase in neither the latency of the algorithm nor the cycle time.
4 Implementation for higher precision q
The main limitation of the algorithm proposed in Section 2 is the range of q.
If the precision nq of abs(q) is significantly high (nq > 12), a direct extraction
of abs(1/q) from a look-up table is not efficient. In this case, the algorithm of
Section 2 needs to be modified.
For the computation of the reciprocal we propose a high-radix iterative al-
gorithm described in Section 4.1. The divisor is required to be in the conver-
gence range of the algorithm, which in our case is [0.5, 1). Therefore abs(q) ⊂
[1, 2nq − 1] is normalized into Mq ⊂ [0.5, 1), such that abs(q) = Mq2Eq , with
Eq > 0. Thus, the computation of 1/q is replaced by the evaluation of D×2−Eq
with D = (−1)sq × (1/Mq), so that the algorithm performs 2D×2
−Eq×S with
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 15
Mx
log 2 Mx
q
Ex
I1
F1 F2
F3 F4
I2
DxS* DxS* DxS*
log 2 Mx
DxS* DxS*
2F 2F 2F 2F
Ez
To on−the−fly conversion
Mz
S1 S2 S3
log2 Mx
S4
log2 Mx
S
−1
F[0] = F1+F2
LU TABLES
on−line exp. delay = 2
SD radix−128 conversion
logarithm
LRCF mult
exponential
reciprocal
x2−Eq
S*1 S*2 S*3 S*4
S*5
digit shifter
1/Mq 1/Mq 1/M q 1/M q 1/M q 1/M qNormlz
LZD
D1 D3 D4 D5
S*[0] = S*
−1+S*0
lzq
(Eq/b, Eq%b)
D6
on−line mult. delay = 2
Mq
x2−Eq x2−Eq x2−Eq x2−Eq x2−Eq x2−Eq
S0
DxS*
S*
−1 S*0
D2
S[0](1+D 1)
INIT MULT 
TABLES
Figure 6: Sequence of operations of the modified algorithm.
S = Ex + log2(Mx). In order to reduce the latency of the algorithm, the factor
2−Eq is first multiplied by S and the result by D.
An example of the operation flow of the modified q-th root algorithm for
single precision and r = 128 is shown in Fig. 6. The sequence of operations of
the algorithm for nq > 12 is as follows:
1. Evaluation of the number of leading zeros lzq ⊂ [0, nq − 1] of q using a
leading zero detector (LZD). The LZD is enhanced (see Section 4.2) to
allow the computation of ⌊Eq/b⌋ and Eq%b in parallel, with the exponent
Eq given by Eq = nq − lzq.
2. Normalization of abs(q) into the range [0.5, 1). A nq-bit barrel shifter
is used to obtain the normalized divisor Mq ⊂ [0.5, 1) by shifting q an
amount of lzq bits to the left.
3. Evaluation of D = (−1)sq × (1/Mq) ⊂ (−2, 2) using a high-radix iterative
algorithm with multiplicative decomposition (see Section 4.1). The digits
Dj of the reciprocal are in a signed-digit radix-r redundant form.
4. Evaluation of the logarithm L = log2(Mx) ⊂ [0, 1) to a precision of nl bits
using the high-radix algorithm described in Section 3.2.
5. Multiplication of S∗ = S × 2−Eq , with S = Ex + log2(Mx). Since S is
obtained digit by digit in a signed-digit radix-r form, this multiplication
is implemented as a composition of two different right shifts: a first shift
of S by ⌊Eq/b⌋ radix-r digits and a second shift of each digit Si of S by
Eq%b bits. A signed-digit shifter that implements this operation serially
is described in Section 4.2.
6. Multiplication T = D × S∗ (see Section 4.3). Since the digits of the
reciprocal D (Di) and the operand S∗ (S∗j) are obtained serially, we
implement this multiplication using an online multiplier [17] with online
delay δ = 2.
7. Serial extraction of the integer I and fractional F parts of T , and on-the-fly
conversion of I to a non-redundant representation.
RR n° 7564
16 Vázquez & Bruguera
Precision of intermediate operations (bits)
Target Precision nd nl ns nm ne
(n, nEx, nq) nEx+n+4 n+4 nEx+n+5 nEx+n+4 n+2
SP (24,8,32) 36 28 37 36 26
DP (53,11,64) 68 57 69 68 55
Latency of intermediate operations (cycles)
Target Precision Nd Nl Ns Nm Ne
(n, nEx, nq) ⌈nd/b⌉ ⌈nl/b⌉ γ+⌈(nl+1)/b⌉ γ+⌈nl/b⌉ ⌈ne/b⌉
SP (24,8,32) 6 4 2+5 2+4 4
DP (53,11,64) 10 9 2+9 2+9 8
Target Precision nq−root (bits) q–th root latency
(n, nEx, nq) n 3+2δ+γ+⌈ne/b⌉
SP (24,8,32) 24 13
DP (53,11,64) 53 17
Table 4: Example of Parameters for q-th root Computation (r = 128, δ = 2, γ =
⌈nEx/b⌉)
8. On-line high-radix exponential 2F ⊂ (−2, 2) with argument F = frac(T ) ⊂
(−1, 1), precision of ne bits, and online delay δ = 2 (see Section 3.4). The
redundant result is normalized and rounded to n bits using an on-the-fly
rounding unit [4].
To obtain the required precisions for the different intermediate operations,
we have performed an error analysis similar to the analysis in Section 2.2. The
required precisions for each operation (reciprocal, logarithm, signed-digit shift-
ing, multiplication and exponential) are shown in Table 4. These parameters
determine the latencies of the intermediate operations for radix r = 2b. Table
4 also shows the required precision and latency of each operation for single and
double precision and r = 128. Besides, we consider a 1-cycle latency for the LZD
and coding of shifting amounts Eq/b and Eq%b, one cycle for the normalization
of q and Nd = ⌈nd/b⌉ iterations for the high-radix fixed-point reciprocal unit.
The precision of q only has direct impact in the latency and area of the
LZD and the barrel shifter used for normalization, but not in the latency of the
high-radix iterative reciprocal unit.
The latency of the algorithm for r = 2b ≥ 8 is given by:
Nq−root = 1 + γ + 2× (δ + 1) +Ne
where δ = 2 and Ne = ⌈ne/b⌉ are respectively the on-line delay and the latency
of the exponential 2F , and γ = ⌈nEx/b⌉ is the maximum number of radix-r
integer digits of the multiplication product.
A sequential architecture for the implementation of the modified algorithm
is shown in Fig. 7.
Next, we only describe the new units, that is, high-radix reciprocal unit,
the LZD, the signed-digit shifter and the on-line multiplier. The high-radix
logarithm and on-line high-radix exponential units were detailed in Section 3.
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 17
Mx
High−Radix 
Logarithm
q Ex
Ez
Mz
On−line 
High−Radix 
Exponential
On−line High−Radix 
Multiplier
High−Radix 
Reciprocal
On−the−fly 
conversion
S*j
LjExj
Fj
Ij
Dj
LZD
SD radix−r shifter
Word−length 
barrel shifter
Sj
Mq
Eq%b
lzq
Eq/b
Figure 7: Block diagram of the modified architecture.
4.1 High-radix digit-recurrence reciprocal
High-radix SRT methods [5] use an additive decomposition for the reciprocal.
To allow selection by rounding for the quotient digits in these type of methods,
the initial residual needs to be scaled, which introduces additional delay. When
a correct rounded reciprocal is not a requirement, we can use a multiplicative
decomposition of 1/Mq instead of a sum of terms to avoid the initial scaling [5].
Thus, we decompose the reciprocal in a multiplication of factors determined by
a digit, that is, 1/Mq =
∏∞
i=1(1 + dir
−i), such that at iteration j the reciprocal
is approximated by Q[j] =
∏j
i=1(1 + dir
−i).
Observing that MqQ[j] → 1 when j → ∞, we define a scaled residual as
wd[j] = r
j(1 −MqD[j]), obtaining the following recurrences for the reciprocal
and the residual:
Q[j + 1] = Q[j](1 + dj+1r
−j−1)
wd[j + 1] = rwd[j]− dj+1 + dj+1wd[j]r
−j
with Q[0] = 1. As in the SRT algorithms, the convergence is linear, obtaining
(roughly) one digit of the result in each iteration. For Nd = ⌈nd/b⌉ iterations
we obtain nd precision bits of the reciprocal (|εd| ≤ 2−nd). Instead of Q[j], a
scaled reciprocal D[j] = Q[j]rj is computed in order to extract a radix-r digit
Dj per iteration from the leading position in all iterations j ≥ 1 (digit D0 = 1
is implicit).
The selection of digits dj+1 ∈ {−(r − 1), . . . , r − 1} is performed as dj+1 =
round( ̂rw[j]), where ̂rw[j] denotes the r-shifted residual w[j] truncated to t
fractional bits. To ensure convergence with selection by rounding for j > 1
and Mq ⊂ [0.5, 1), d1 is selected from the over-redundant digit set {−(2r −
1), . . . , 2r − 1} by table look-up. This table is addressed by the b + 2 most
significant bits of Mq.
The block diagram of this unit is shown in Fig. 8. A more detailed descrip-
tion of this unit can be found in [18].
RR n° 7564
18 Vázquez & Bruguera
Mult−Add
Mux−2
Mux−2
*
Round & 
recod.
2(1−Mq)
*
+
b+1 b
b+1
b+t
rwd[j]
Mq
nq
BARREL 
SHIFTER d1
dj
TABLE 
d1
b+2
b+t
j
wd[j]
r−j+1wd[j]
nd+gd
wd[j]
nd+gd
nd+gd
nd+gd
nd+gd
nd+gd
nq
b+1
Mult−Add
Mux−2
*
Mux−2
*
+
b
rD[j]
BARREL 
SHIFTER
j
r−j+1D[j]
D[j]
nd+gd
nd+gd
nd+gd−b
nd+gd
b+1
nd+gd
nd+gd
Dj
D[0]r−1
D[0]=1
nd+gd−b
BSA 4:2
Mux−2
nq
b
b
b nd+gd−b
dj
to OL_HR_mult
from NORM_q
Figure 8: Block diagram of the reciprocal unit.
4.2 Enhanced LZD and signed-digit shifter
Since S = Ex+log2(Mx) is obtained in the form of a sequence of γ+Nl radix-2
b
digits, the multiplication S∗ = S × 2−Eq can only be performed as a digit right
shift of S when Eq is an integer multiple of b. A more general solution is to
decompose Eq in two terms as
Eq = ⌊Eq/b⌋ × b+ Eq%b
so that the operation is computed as a compound right shift of ⌊Eq/b⌋ radix-r
digits plus Eq%b bits. Provided that the maximally redundant radix-r digits
Si are represented in a borrow-save (or carry-save) form and the radix is an
integer power of 2 (r = 2b), the right shift of Eq%b bits can be implemented as
a regular binary shift 1. The serial architecture proposed is shown in Fig. 9.
The unit processes a digit S∗i per cycle (0 ≤ i < Ns). First, a right shift of
Eq%b < b bits is performed over Sj (−γ < j < Nl) using two barrel shifters of
2b bits. Since this binary shift crosses the digit boundaries, the previous input
digit Sj−1 is latched and then concatenated at cycle i to the left of Sj . The
initial value of this latch is 0. The b most significant bits of the two barrel shifter
output are passed to a level of latched 2:1 multiplexers controlled by a signal2
rsa = ⌊Eq/b⌋ < Ns.
This signal represents the number of radix-r digit shifted to the right in a
hot-one code. The output of the barrel shifters is stored in the latch which
corresponds to the number of shifted positions counted from the right (from
1That is, a bit shifted out to the right of a radix-r digit halves its value.
2This bound is ensured by error analysis, so that S∗ is computed with enough precision
bits ns.
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 19
Sj
Eq%b
b
log2(b)
1
0
b
2b−bit barrel shifters>>
Mux−2
b
S*j
Mux−2
b
Mux−2
b
0 Mux−2
b b b
1 0 1 0 1 0 1
rsa0rsaNs−1 1 1 1
b
bb b
b
Sj−1
Figure 9: Block diagram of the serial signed-digit shifter.
q
log log
22
(b)(n )
E%b
E%b
lz rsa
N
n
n
q
f qn −2
q
q
Binary encoder
q
mod(b) encoder
s
radix−r encoder
Hot−one code
Hot−one code fn  −1f q0 q
Prefix Tree of OR gates
q
f0 f fn  −1q
n  −1q q1
1
qn −2
qq
0
q
q= 0000100101001111
f= 0000100000000000
not(OR)= 1111100000000000
OR= 0000011111111111
q
 = 0 (00)
Example: qn=16 b=4
lz= 4 (0100)
rsa= 3 (00010)
Figure 10: Enhanced leading zero detector.
Ns − 1 to 0). The latch outputs are shifted one radix-r digit to the right each
cycle, while S∗i corresponds to the value stored in the rightmost latch.
The shifting amounts signals rsa and Eq%b are obtained with a slight mod-
ification of the decoding logic in the LZD using the relation Eq = nq − lzq.
The architecture of this enhanced LZD is shown in Fig. 10. First, a hot-
one code string f0 . . . f1f2fnq−1 indicating the position of the leading one of
q =
∑nq−1
i=0 qi 2
nq−1−i is obtained using a tree of OR gates (prefix tree, carry
lookahead...) and a level of AND-2 gates with an inverted input. These signals
are computed as
fi = qi ·OR
i−1
j=0 qj
Then, lzq , rsa = ⌊Eq/b⌋ and Eq%b are computed as
lzq =
nq−1∑
i=0
i× fi
rsa =
nq−1∑
i=0
⌊(nq − i)/b⌋ × fi
RR n° 7564
20 Vázquez & Bruguera
(Eq%b) =
nq−1∑
i=0
(nq − i)mod(b)× fi
with an appropriate decoding of the hot-one code signals fi using OR gates.
Namely, the signal lzq is decoded to a binary operand of ⌈log2 nq⌉ bits width,
the signal rsa into a hot-one radix-r operand of Ns bits width, and the signal
Ns into a binary (modulo b) operand of ⌈log2 b⌉ bits width. We show a toy
example for nq = 16 q = 2383 and b = 4 (radix-16) in the left side of Fig. 10.
We obtain lzq = 4, rsa = 3 and (Eq%b) = 0 as follows:
1. We compute signals f0 to f15 using the prefix tree of OR gates (⌈log2(16)⌉-
1=3 levels of two-input gates), and the level of NOT-AND2 gates, obtain-
ing f4 = 1.
2. We obtain lzq = 4 (01002) from the binary encoder, implemented using
the following boolean expressions:
lzq(3) = f8 ∨ f9 ∨ f10 ∨ f11 ∨ f12 ∨ f13 ∨ f14 ∨ f15
lzq(2) = f4 ∨ f5 ∨ f6 ∨ f7 ∨ f12 ∨ f13 ∨ f14 ∨ f15
lzq(1) = f2 ∨ f3 ∨ f5 ∨ f7 ∨ f10 ∨ f11 ∨ f14 ∨ f15
lzq(0) = f1 ∨ f3 ∨ f5 ∨ f7 ∨ f9 ∨ f11 ∨ f13 ∨ f15
3. The hot-one code radix-16 encoder computes the value rsa = ⌊(16 −
lzq)/4⌋ ∈ [0, 4] , producing an output rsa = 3 expressed in a hot-one
code (00010). The logical expressions of this block are given by
rsa(4) = f0
rsa(3) = f1 ∨ f2 ∨ f3 ∨ f4
rsa(2) = f5 ∨ f6 ∨ f7 ∨ f8
rsa(1) = f9 ∨ f10 ∨ f11 ∨ f12
rsa(0) = f13 ∨ f14 ∨ f15
4. The modulo b=4 encoder produces a binary string of log2(4) = 2 bits
which represents the value (16 − lzq)%4 ∈ [0, 3]. The logical expressions
of this block are given by
Eq%4(1) = f1 ∨ f2 ∨ f5 ∨ f6 ∨ f9 ∨ f10 ∨ f13 ∨ f14
Eq%4(0) = f1 ∨ f3 ∨ f5 ∨ f7 ∨ f9 ∨ f11 ∨ f13 ∨ f15
For the example of Fig. 10, since lzq = 4, then f4 = 1 and (16−lzq)%4 = 0.
4.3 On-line multiplier
On-line multiplication [17] of P = X × Y is defined by the following recurrence
equation for the scaled partial product wm[j] = rjX[j]Y [j]:
wm[j + 1] = r(wm[j]− Pj) + r
−δ(Xj+δ+1Y [j] + Yj+δ+1X[j + 1]) (10)
with δ = 2 for r > 2, X[j] =
∑j+δ
i=1 Xir
−i, Y [j] =
∑j+δ
i=1 Yir
−i. The initial
condition for the algorithm is wm[0] = X[0] × Y [0] and P0 = 0, where the
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 21
product digits Pj ⊂ {−(r − 1) . . . , 0, . . . , r − 1} are obtained in our case by a
selection function Pj+1 = round( ̂wm[j + 1]), with wm[j+1] truncated to t bits.
After iteration j ≥ 0, the product is given by
P = P [j + 1] + (wm[j + 1]− Pj+1)r
−j−1
with P [j + 1] =
∑j+1
i=1 Pir
−i. For convergence we have to verify the following
conditions:
• General condition of convergence for the iteration: |wm[j + 1] − Pj+1| ≤
ri+1(P − P [j + 1]). Since |Pj | ≤ (r − 1), and
P − P [j + 1] ≤
∞∑
i=j+2
(r − 1)r−i ≤ r−j−1
this condition can be formulated as |wm[j+1]| ≤ (r−1)+ri+1r−j−1, that
is |wm[j + 1]| ≤ r.
• Condition of convergence for selection by rounding: |wm[j + 1]| < (r −
1) + 1/2, that is |wm[j + 1]| < r − 1/2.
To obtain upper bounds for the values of X and Y we check the second condi-
tion (more restrictive) for j = 0 . The scaled partial product wm[1] given by
recurrence (10) with j = 0 can be expressed as
wm[1] = rwm[0] + r
−δ(Xδ+1Y [0] + Yδ+1X[1])
= r(X[0]Y [0]) + r−δ(Xδ+1Y [0] + Yδ+1X[1])
= Y [0]r(X[0] +Xδ+1r
−δ−1) +X[1]r(Yδ+1r
−δ−1)
= rY [0]X[1] +X[1]r(Yδ+1r
−δ−1)
= rX[1](Y [0] + Yδ+1r
−δ−1) = rX[1]Y [1]
(11)
Using r|X[1]Y [1]| ≤ r(|X × Y |) in the previous expression, the condition of
convergence for selection by rounding limits the values of the input operands to
|X × Y | < 1− 1/2r
To obtain a value for the minimum number of fractional bits t required for
the estimate ̂wm[j + 1], we use the condition of convergence |wm[j+1]| < r−1/2
for j > 0. In this case, a bound for wm[j + 1] is given by
wm[j + 1] ≤ r(1/2 + 2
−t) + r−δ2(r − 1)(1− 1/2r)1/2 (12)
obtained by introducing the following bounds in expression (10): |wm[j]−Pj | ≤
1/2 + 2−t, |Xj+δ+1| ≤ (r − 1), |Yj+δ+1| ≤ (r − 1), |Y [j − 1]| ≤ (1 − 1/2r)1/2,
and |X[j]| ≤ (1− 1/2r)1/2.
With the previous bound for |wm[j+1]|, the condition of convergence results
in
r(1/2 + 2−t) + r−δ2(r − 1)(1− 1/2r)1/2 < r − 1/2 (13)
obtaining a minimum value of t = 1 for r ≥ 4 and δ = 2.
RR n° 7564
22 Vázquez & Bruguera
nm+gm
Mult−Add
*
Round & 
recod.
*
+
rwm[j]
b
wm[0]
Tj
nm+gm
to on−the−fly_CONV or OL_HR_EXP
from SD_SHIFTERfrom HR_RECP
Mult−Add
Dj
Mux−2
b
nd+gd
D[j]
S*j
Mux−2
b
ns+gs
S*[j+1]
S*[j]
D[j+1]
* *
nd+gd
nm+gm
BSA 4:2
b+t
nm+gm
nm+gm b
−rTj
wm[j+1]
Tj+1
+
Mux−2
Mux−2
nm+gm
Figure 11: Block diagram of the on-line multiplier.
We perform the operation T = D × S∗ using the online multiplier of Fig.
11. Since operands D and S∗ are of the form D =
∑⌈(nd+gd)/b⌉
i=0 Dir
−i with
D0 = 1, and S∗ =
∑⌈nEx/b⌉−1
i=0 S
∗
ir
i +
∑⌈(ns+gs)/b⌉
i=1 S
∗
ir
−i, they need to be
scaled by a constant to verify X × Y < 1− 1/2r. Thus, we scale both operands
as follows, X = Dr−1 (X1 = 1), and Y = S∗r−⌈nEx/b⌉, so that X×Y ≥ 2/r and
X×Y < 1−1/2r is verified for r ≥ 4. The product is given by T = Pr⌈nEx/b⌉+1.
We need an initial cycle to compute wm[0] = D[0]S∗[0] = S∗[0] + D1S∗[0],
and Nm = γ + Nl iterations of the recurrence (10) followed by the selection
of Pj+1 = round( ̂wm[j + 1]) to get an accuracy of nm bits (nEx integer, nl
fractional) of the product T .
5 Evaluation and Comparison
In this Section, we present estimates of the execution time and hardware cost
for the proposed low and high precision q architectures described in Section 3
and 4. First, we describe the evaluation model used to obtain the area and
delay estimates. Next, we particularize for single (n = 24, nEx = 8) and double
precision (n = 53, nEx = 11) formats with radix values r = 2b ranging from
r = 8 to r = 1024. Finally, we present a comparison with other representative
implementations in Section 5.1.
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 23
Component Delay Area
(FO4) (NAND2)
Nand2 0.7 1
Xor2 1.3 2.5
FA 3.2 7.5
1-bit 4:2 CSA 5 15
4-bit CLA 5.5 35
1-bit D-latch 2.5 3.5
Table 5: Area and delay values for CMOS components.
nq
5 6 7 8 9 10 11 12
Single Precision (n, nEx)=(24,8)
Size (Kbits) 1.12 2.25 4.5 9 18 36 72 144
Area (♯Nand2) 335 675 1180 2360 4725 9450 18900 32400
Delay (♯FO4) 6.4 8 9.6 11.2 12.8 14.4 14.4 16
Double Precision (n, nEx) =(53, 11)
Size (Kbits) 2.12 4.25 8.5 17 34 68 136 272
Area (♯Nand2) 635 1275 2230 4460 8925 17850 35700 61200
Delay (♯FO4) 6.4 8 9.6 11.2 12.8 14.4 14.4 16
Table 6: Size, area and delay of look-up tables for reciprocal computation
We use an area and delay evaluation model based on a simplification logical
effort method [14] that allows for faster hand calculations. It considers the
different input and output gate loads, but neither interconnections nor gate
sizing optimizations. Instead, we use other optimization techniques, such as
buffering or cloning (gate replication) to drive high loads. The total stage delay
is obtained as the sum of the delays of the gates on the critical path. The delays
are expressed in FO4 units (delay of an 1x inverter with a fanout of 4 inverters),
and the area in number of equivalent minimum size NAND2 gates.
We do not expect this rough model to give very precise area-delay figures, but
it provides good first-order area and delay estimations to be used in technology-
independent comparisons. In Table 5 we detail the delay and area of some
common CMOS gates and logic components.
The hardware complexity and delay figures for look-up tables shown in Table
6 were extracted from synthesis evaluations as in [12] (see Apendix A in [11] for
more details).
Tables 7 and 8 show estimates of the latency (in number of cycles), cycle
time and execution time (in FO4 units) of the q-th root computation, area of
the high radix-r logarithm and on–line exponential units (in NAND2 units), and
total area of the proposed low precision q architecture with nq = 8, for single
and double precision computations respectively. The implemented radix values
go from r = 8 to r = 1024.
The latency of a q–root computation for the low precision q implementation
was calculated in accordance with the formula 2 + γ + δ + Ne, with γ = 2 and
δ = 2. The cycle time corresponds to the critical path delay of the logarithm
unit, and is the sum of the delays of the round unit, a multiplexer and the
RR n° 7564
24 Vázquez & Bruguera
Radix Latency Cycle T. Exec. T. LOG Area Exp. Area Total Area
(♯cycles) (♯FO4) (♯FO4) (NAND2) (NAND2) (NAND2)
8 16 30 480 4500 4200 13650
16 13 32 416 5700 5400 16450
32 12 32 384 6600 6200 18200
64 11 34 374 9000 8500 23300
128 10 34 340 11400 10800 28250
256 9 36 324 18700 17600 42800
512 8 36 288 32600 30700 69900
1024 8 37 296 60500 57000 124650
Table 7: Area and delay of the low precision q architecture with nq = 8 and single
precision (n = 24, nEx = 8).
Radix Latency Cycle T. Exec. T. LOG Area Exp. Area Total Area
(♯cycles) (♯FO4) (♯FO4) (NAND2) (NAND2) (NAND2)
8 27 31 837 11350 11000 31400
16 21 33 693 12700 12300 34800
32 18 33 594 18100 17500 45700
64 16 34 544 20600 20000 51400
128 14 34 476 32900 31800 75700
256 13 36 468 58000 56100 125800
512 13 36 468 98000 94800 204800
1024 12 37 444 133300 129100 275200
Table 8: Area and delay of the low precision q architecture with nq = 8 and double
precision (n = 53, nEx = 11).
multiply-add unit. The main contribution to the total area of the q-root unit
comes from both the high-radix logarithm and on–line exponential units, and
this is significantly high for radix values r > 128. In addition, we observe that,
very little advantage in execution time is obtained from using very high radix
values (over r = 128). On the other hand, the estimated area look–up table for
the reciprocal with nq = 8 is only 2360 NAND2 gates for single precision and
4460 NAND2 gates for double precision. However, as we show later, for higher
precision values of q, such as nq > 10, the contribution of this look–up table to
the total area is very significant, and the use of a look–up table to compute the
reciprocal may be not justified.
Tables 9 and 10 show the corresponding estimates for the high–precision q
architecture, with nq = 32 and nq = 64 for the single and double precision
computations respectively.
In addition to the previous estimates, we show the area of the reciprocal unit
for the different radix values. We include in this area estimation the fixed point
high-radix iterative unit, the leading zero detector and the related shifters. The
contribution of this unit to the total area is more significant in percentage for
low radix values. Thus, for a given radix, the area of this architecture is higher
than the area of the low precision q architecture for nq. Besides, there is a
latency overhead of 3 cycles. On the other hand, a variation in the precision nq
has a negligible impact on the total area or execution time.
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 25
Radix Latency Cycle T. Exec. T. Area (NAND2)
(♯cycles) (♯FO4) (♯FO4) Recip. Log. Exp. Total
8 19 30 570 6200 4500 4200 21100
16 16 32 512 7050 5700 5400 25200
32 15 32 480 7500 6600 6200 27700
64 14 34 476 8700 9000 8500 34400
128 13 34 442 9700 11400 10800 40500
256 12 36 432 12000 18700 17600 57800
512 11 36 396 15100 32600 30700 88400
1024 11 37 407 20650 60500 57000 149100
Table 9: Area and delay of the high precision q architecture with nq = 32 and single
precision (n = 24, nEx = 8).
Radix Latency Cycle T. Exec. T. Area (NAND2)
(♯cycles) (♯FO4) (♯FO4) Recip. Log. Exp. Total
8 30 31 930 11200 11350 11000 44700
16 24 33 792 12500 12700 12300 50250
32 21 33 693 13200 18100 17500 62100
64 19 34 646 15000 20600 20000 70400
128 17 34 578 16200 32900 31800 96100
256 16 36 576 18800 58000 56100 149700
512 16 36 576 22300 98000 94800 232500
1024 15 37 555 28500 133300 129100 310000
Table 10: Area and delay of the high precision q architecture with nq = 64 and double
precision (n = 53, nEx = 11).
Thus, we want to determine a threshold value for nq such that the high pre-
cision q architecture presents a hardware cost advantage over the low precision
q architecture. We present in Table 11 area estimations of the low precision
q architecture, for single and double precision computations, and for different
values of nq from 5 up to 12. We choose a radix r = 128 since it seems to be
the most advantageous in terms of the product execution time ×area. We also
present the area estimations for the high-precision architecture for a radix value
r = 128. We observe that for both single and double precision computations,
the area of the low precision q architecture with nq = 11 is slightly higher than
the area of the higher precision q architecture, although the latency is 3 cycles
lower.
5.1 Comparison
The comparison of our q-th root computation method with previous alterna-
tives is not easy. As far as we know, no other previously proposed algorithm
and architecture, but a naive implementation of X1/q = 2(1/q) log2(X) and the
extension of the powering architecture in [12], allows the computation of the
q-th root for any value of q; that is, the other architectures in the literature are
derived for a given value of q and changing q implies making a different imple-
mentation. On the contrary, the proposed architecture allows the computation
RR n° 7564
26 Vázquez & Bruguera
q-th root unit with reciprocal by lookup table: Area (♯NAND2)
Single precision (n = 24, nEx = 8, 10 cycles latency)
nq 5 6 7 8 9 10 11 12
Recip. 335 675 1180 2360 4725 9450 18900 32400
Total 26200 26600 27100 28250 30600 35350 44800 58300
Double precision (n = 53, nEx = 11, 14 cycles latency)
nq 5 6 7 8 9 10 11 12
Recip. 635 1275 2230 4460 8925 17850 35700 61200
Total 71900 72500 73500 75700 80200 89100 107000 132500
q-th root unit with reciprocal by high-radix digit-recurrence
Latency (cycles) Area 1/q (NAND2) Total Area (NAND2)
SP 13 10850 40500
DP 17 18600 96100
Table 11: Determining the best unit in terms of cost for r = 128 as a function of nq.
SP (n=24, nEx=8, nq=32), DP (n=53, nEx=11, nq=64).
of any q–th root without any additional modification. It has to be pointed out
that the complexity of the extended algorithm in [12] which implements Equa-
tion (3), makes it very hard and inefficient to implement more than a small set
of q–th roots.
Basically, there are two types of algorithms for the computation of the q-th
root. Algorithms based on table-driven polynomial approximations [2, 10, 15]
and digit–recurrence algorithms [8]. Among the former, in [15] a method for
generating Xp for a given p is proposed, applicable to values of p = ±2k or
p = ±2k1 ±2k2 being k1 an integer and k2 a non-negative integer. This includes
a limited number of roots, such as square root, fourth root, eighth root, etc.
The powering function is computed by a piecewise linear approximation based
on modified first–order Taylor expansion. The first–order Taylor expansion is
rewritten as Xp = C×X ′ being X ′ = X1 +2−m−1 + p× (X2− 2−m−1), and X1
and X2 the upper m-bit part and the lower part of X, respectively. C can be
read through a table look-up addressed by X1 and, for special p’s, X ′ is easily
obtained by modifying X. Only one multiplication is required to evaluate the
modified Taylor expansion.
A second–order minimax approximation is presented in [10], which allows
the computation of Xp for any given p. This includes every q-th root. Xp is
approximated as C2 × X22 + C1 × X2 + C0. The three coefficients C2, C1 and
C0 are stored in look–up tables and selected by X1. The size of the tables is
optimized by carefully minimizing the coefficients wordlength for the required
precision. The evaluation of the powering requires, besides the look–up tables,
a squaring unit and a fused accumulation tree.
Another second–order interpolation for the evaluation of elementary func-
tions, including q-th roots, is presented in [2]. The table sizes are reduced by
storing the function values and one coefficient for each interpolation subinterval,
instead of storing all the three coefficients as in the proposal above. The two
remaining coefficients are computed from the function values. This way, the
memory requirements are reduced by one third. Additionally, some multipliers
and adders are need to complete the powering computation.
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 27
Architecture Latency cycle time Delay Area
(cycles) (FO4) (FO4) (NAND2)
Composite LOG-MUL-EXP Algorithms (r = 128)
Our architecture (nq = 8) 10 34 340 28250
Our architecture (nq = 32) 13 34 442 40500
Naive (nq = 8) 15 34 510 33015
Xp with p integer [12] 9 34 306 29827
X1/q (nq = 8) [12] 9 34 306 > 500000
Linear approx. [15] 1 51 51 26122
2nd–order interp. [10] 3 18.7 56.1 10170
2nd–order interp. [2] 2 54.4 108.8 10612
Digit–recurr. (q = 3, r = 2) [13] 52 119 6188 9035
Table 12: Architecture features comparison for single–precision floating–point
representation and low precision q.
Note that the three algorithms and architectures above for the computation
of X1/q are targeted for a given q. To adapt the architecture to other different
q, when possible, requires changing the look–up tables.
On the other hand, a general digit–recurrence algorithm for the computation
of the q-th root has been presented in [8]. The result is a general algorithm
that must be particularized for each different q, and the larger q the larger the
complexity. This general algorithm has been used to implement a cube root
unit [13].
In order to evaluate our architecture, we compare the algorithms based on a
table-driven polynomial approximation and the digit–recurrence algorithm out-
lined above. However, it has to be kept in perspective that, unlike our algorithm,
all these algorithms have to be particularized for a given q when implementing
the q-th root unit. Moreover, we include in the comparison the naive imple-
mentation of X1/q = e(1/q)ln(X) and the architecture for the computation of Xp
and its extension to X1/q, with p and q integers, presented in [12].
Table 12 shows the latency, delay and area estimate of every algorithm and
implementation. We have considered a single–precision floating–point repre-
sentation for X and low–precision q; in particular, we consider nq = 8 for
low precision q and nq = 32 for our architecture with higher precision q. For
the architectures based on composite logarithm-multiplication-exponential al-
gorithms, we have used a radix r = 128.
As shown in Tables 7-10, a good tradeoff between latency, cycle time and
area can be achieved for specific values of the radix. Although it is difficult
to select just one radix for every implementation, in applications demanding
high-speed processing one of the most efficient implementations corresponds to
radix r = 128.
Note that, although the table-driven polynomial approximation are imple-
mented for a given root, the latency, delay and area are roughly the same for
any root. However, the features of the digit–recurrence architecture are closely
related to the value of q; so, we show the features for the computation of the
radix 2 cube root described in [13].
RR n° 7564
28 Vázquez & Bruguera
As expected, the latency, total delay and area of the table–driven polynomial
approximation architectures is significantly smaller than in the architecture we
have proposed in this paper, because those architectures can compute only a
given root. Of course, those architectures could be extended to compute more
than just one given root. This means that tables to store the coefficients or the
function values need to be replicated to include one set of tables for each root.
This affects the total area and the cycle time.
Similarly, the digit–recurrence architecture is tailored to a given root, as
well. The computation of a different root implies the development of completely
different architecture. The larger the q the larger the complexity.
Finally, the features of our architecture and the architecture in [12] are quite
similar. Both architectures are based on similar optimizations of the naive im-
plementation of X1/q = 2(1/q) log2(X); however, the extension of the architecture
in [12] to the computation of a q-th root, is quite inefficient. This is mainly due
to the huge table required to obtain 2(Ex%q)/q once the modulus of the integer
division Ex%q has been evaluated, even for low precision q.
6 Conclusion
An algorithm for q–th root extraction has been presented, based on a high–
radix composite algorithm. It consists of computingX1/q as 2(1/q)×(Ex+log2(Mx))
through a sequence of parallel and /or overlapped operations: reciprocal, high–
radix digit–recurrence logarithm, high–radix left–to–right carry–free multipli-
cation and on–line high–radix exponential. A detailed error analysis has been
carried out to determine the intermediate wordlengths.
The algorithm is based on a previous algorithm for the computation of the
powering function Xp, with p any integer, which was extended for the computa-
tion of q–th roots. However, the extended algorithm seems hard to implement
since it is necessary to compute an integer division and a modulus operation.
Our algorithm avoids these two operations, resulting in a much simpler algo-
rithm.
Two architectures have been proposed. First, an architecture for low pre-
cision q values, less than 12 bits, where the reciprocal 1/q is obtained directly
from a look–up table; after that, an alternative architecture for higher precision
values of q, where a high–radix iterative algorithm has been used for the compu-
tation of the reciprocal. Both architectures have been evaluated and estimates
of the execution time, the latency and the area have been obtained, based on an
approximated model for the delay and the area of the main building blocks, for
single and double precision floating–point representations and several radices.
The analysis of the tradeoffs between area and speed allows us to determine the
better radix for every implementation: radix r = 128 might be suitable for high
speed implementations. Larger radices result in similar execution times with
much larger area requirements.
The comparison with other previous algorithms is not easy. As far as we
know, no other previously proposed methods, except the extension of the ar-
chitecture for the computation of Xp, allow the computation of the q–th root
for any value of q. Even so, we have discussed the area and time figures for
several non–general implementations for q–th root calculation and the powering
architecture to determine the suitability of our implementation. The conclusion
INRIA
Composite Iterative Algorithm and Architecture for q-th Root Calculation 29
is that the execution times and hardware requirements are better than those
of the powering function calculation and, although obviously are worse than
those of the non–general q–th root extraction architectures, the flexibility of
our implementation makes it an interesting alternative.
Work is in progress to integrate the proposed q–th root computation with
the previous powering function algorithm into a combined unit which would be
able to compute any function of type Xp/q.
References
[1] J. Cao and B.W.Y Wei, High-performance hardware for function generation,
Proc. 13th IEEE Symposium on Computer Arithmetic, pp.184–188, Jul.
1997.
[2] J. Cao, B. W. Y. Wei and J. Cheng, High-performance architectures for
elementary function generation, Proc. 15th IEEE Symposium on Computer
Arithmetic, pp. 136–144, Jun. 2001.
[3] M. D. Ercegovac and T. Lang, Fast Multiplication Without Carry-Propagate
Addition, IEEE Transactions on Computers, vol. 39 no. 11, pp. 1385–1390,
Nov. 1990.
[4] M.D. Ercegovac and T. Lang, On-the-Fly Rounding, IEEE Transactions on
Computers, vol. 41, no. 12, pp. 1497–1503, Dec. 1992.
[5] M.D. Ercegovac and T. Lang, Digital Arithmetic, Morgan Kaufmann, 2004.
[6] M.D. Ercegovac, Digit-by-Digit Methods for Computing Certain Functions,
41st Asilomar Conference on Signals, Systems and Computers, pp. 338–342,
Nov. 2007.
[7] IEEE Std 754(TM)-2008, IEEE Standard for Floating-Point Arithmetic,
IEEE Computer Society, Aug. 2008.
[8] P. Montuschi, J.D. Bruguera, L. Ciminiera and J.-A. Piñeiro, A Digit-by-
Digit Algorithm for mth Root Extraction, IEEE Transactions on Computers,
vol. 56, no. 12, pp. 1696–1706, Dec. 2007.
[9] J.-M. Mueller, Elementary Functions, Algorithms and Implementation,
Birkhäuser, 1997.
[10] J.-A. Piñeiro, J.D. Bruguera and J.-M. Mueller, Faithful Powering Compu-
tation Using Table Lookup and Fused Accumulation Tree, Proc. 15th IEEE
Symposium on Computer Arithmetic, pp. 40–47, Jun. 2001.
[11] J.-A. Piñeiro, Algorithms and Architectures for Elementary Function Com-
putation, Ph.D. Dissertation, Dept. of Electronics and Computer Engineer-
ing, University of Santiago de Compostela, Spain, Jun. 2003. Available at
http://www.ac.usc.es.
[12] J.-A. Piñeiro, M.D. Ercegovac and J.D. Bruguera, Algorithm and Architec-
ture for Logarithm, Exponential and Powering Computation, IEEE Transac-
tions on Computers, vol. 53, no. 9, pp. 1085–1096, Sep. 2004.
RR n° 7564
30 Vázquez & Bruguera
[13] A. Piñeiro, J.D. Bruguera, F. Lamberti, P. Montuschi, A Radix-2 Digit-by-
Digit Architecture for Cube Root, IEEE Transactions on Computers, vol. 57,
no. 4, pp. 562–566, Apr. 2008.
[14] I.E. Sutherland, R.F. Sproull and D. Harris, Logical Effort: Designing Fast
CMOS Circuits, Morgan Kaufmann, 1999.
[15] N. Takagi, Powering by a Table Look-Up and a Multiplication with Operand
Modification, IEEE Transactions on Computers, vol. 47, no. 11, pp. 1216–
1222, Nov. 1998.
[16] N. Takagi, A Digit-Recurrence Algorithm for Cube Rooting, IEICE Trans-
actions on Fundamentals of Electronics, Communications and Computer
Sciences, vol. E84-A, no. 5, pp. 1309–1314, May 2001.
[17] K.S. Trivedi and M.D. Ercegovac, On-Line Algorithms for Division and
Multiplication, IEEE Transactions on Computers, vol. C-26, no. 7, pp. 681–
687, Jul. 1977.
[18] A. Vazquez, E. Antelo and T. Lang, Combined Division/Square-
Root/Reciprocal Square-Root Unit for 3D Graphics Geometry Process-
ing, Tech. Report, Department of Electronic and Computer Science.
University of Santiago de Compostela, Spain. Sep. 2000. Available at
http://www.ac.usc.es.
INRIA
Centre de recherche INRIA Grenoble – Rhône-Alpes
655, avenue de l’Europe - 38334 Montbonnot Saint-Ismier (France)
Centre de recherche INRIA Bordeaux – Sud Ouest : Domaine Universitaire - 351, cours de la Libération - 33405 Talence Cedex
Centre de recherche INRIA Lille – Nord Europe : Parc Scientifique de la Haute Borne - 40, avenue Halley - 59650 Villeneuve d’Ascq
Centre de recherche INRIA Nancy – Grand Est : LORIA, Technopôle de Nancy-Brabois - Campus scientifique
615, rue du Jardin Botanique - BP 101 - 54602 Villers-lès-Nancy Cedex
Centre de recherche INRIA Paris – Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex
Centre de recherche INRIA Rennes – Bretagne Atlantique : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex
Centre de recherche INRIA Saclay – Île-de-France : Parc Orsay Université - ZAC des Vignes : 4, rue Jacques Monod - 91893 Orsay Cedex
Centre de recherche INRIA Sophia Antipolis – Méditerranée : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex
Éditeur
INRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France)
http://www.inria.fr
ISSN 0249-6399
