Area-Time Efficient Evaluation of Elementary Functions by OKABE, Yasuo & YAJIMA, Shuzo
TitleArea-Time Efficient Evaluation of Elementary Functions
Author(s)OKABE, Yasuo; YAJIMA, Shuzo




Type Departmental Bulletin Paper
Textversionpublisher
Kyoto University
Area-Time Efficient Evaluation of Elementary Functions
Yasuo OKABE and Shuzo YAJIMA
Faculty of Engineering, Kyoto University
ABSTRACT
This paper describes area-time efficient VLSI algorithms for evaluating elementary
functions. For square rooting, a VLSI implementation of A $T^{2}$-optimal, i.e., A $T^{2}=0(n^{2})$ ,
circuits are presented, where $A$ is the chip area and $T$ is the computation time in the
range $[\Omega((\log n)^{1+c}, O(\sqrt{n})]$ for arbitrary $\epsilon>0$ . For logarithms and exponentials, an
upper bound $AT^{2}=0(n^{2}(\log n)^{2})$ is given for any $T\in[\Omega((\log n)^{2+\epsilon}, 0(\sqrt{n}\log n)]$. VLSI
circuits for these functions witlt minimum computation time $0(1ogn)$ and almost
optimal $A$ $T^{2}$-preformance, $i.e.,$ A $T^{2}=O(n^{2+c})$ , are also exhibited. These are achieved by
using an extension of the Beame-Cook-Hoover Method which we have proposed before.
1. Introduction
Much research has long been conducted on arithmetic operations, such as addition,
multiplication, division, square rooting and evaluation of elementary functions. Recent
advances in large-scale integration technology of circuits have especially motivated the
research on hardware algorithms for arithmetic operations, suitable for VLSI implemen-
tations.
In VLSI circuits, the chip area is a more reasonable cost measure than the number
of gates. Considering this, VLSI models are proposed as more practical computation
models $[1][2]$ , and much attention has been paid about area-time tradeoffs for many
basic operations such as arithmetic operations.
For multiplication, $A$ $T^{2}$-optimal, $i.e.,$ A $T^{2}=O(n^{2})$ , multipliers are known for all
computation times in the range $[\Omega(\log n), O(\sqrt{n})]$ [3]. For division, Mehlhorn and
Preparata exhibited an $A$ $T^{2}$-optimal design of dividers with computation times in the
range $[\Omega((\log n)^{1+\epsilon}),O(\sqrt{n})]$ for arbitrary constant $e>0$ [4], utilizing Beame-Cook-
Hoover division technique. Mehlhorn also exhibited $A$ $T^{2}$-optimal square rooting circuits
for all computation time in the range $[\Omega(.(\log n)^{3}),O(\sqrt{n})][5]$ .
In this paper, we present several fast and area-time efficient algorithms for evaluat-
ing elementary functions, and give some new upper bounds of the $A$ $T^{2}$-complexity of
these functions. First we describe $A$ $T^{2}$-optimal square rooting circuits for all computa-
tion times in the range $[\Omega((\log n)^{1+\epsilon}),O(\sqrt{n})]$ for arbitrary $e>0$ . Next we exhibit
-1-
666 1988 115-124
$O(n^{2+\epsilon})$ for arbitrary $\epsilon>0$ .
value for $x$ in the above domains. We adopt the VLSI model developed by Brent and $|$
Kung [2] as our computation model, and consider time and area as our cost measures.
2. Area-Time Optinial Square Rooting Circuits for $T=\Omega((\log n)^{1+\epsilon})$
In this section, we present an area-time optimal square rooting circuit for
$T=\Omega((\log n)^{1+\epsilon})$ . Our square rooting algorithm is a modification of Mehlhorn-
Preparata’s area-time optimal division [4]. In our algorithm, $1/\sqrt{x}$ is frst evaluated by
the Newton iteration as a root of the equation $u^{-2}-x$ , and $\sqrt{x}$ is calculated by the equa-
tion $\sqrt{x}=x’(1/\sqrt{x})$ . The iteration rule is
$u;+1= \frac{1}{2}u_{j}(3-xu_{i^{2}})$ .
Since the Newton iteration has a self-correcting property, it is sufficient to compute
with 2 $i+1$-bit precision at the i-th stage of the iteration. Using this rule, it is immedi-
ately shown that there exists an $A$ $T^{2}$-optimal, i.e., A $T^{2}=O(n^{2})$ , n-bit square rooting cir-
cuit for any $T\in[\Omega((\log n)^{2}), O(\sqrt{n})]$ . (See [5].)
We now describe an area-time optimal square rooting circuit with computation
time $T$ in the range $[\Omega((\log n)^{1+\epsilon}),O((\log n)^{2})]$ for any $\epsilon>0$ , which is a modification of
Mehlhorn-Preparata’s area-time optimal divider with computation time in the same
range [4]. It is easily shown that almost all techniques for division are also applicable to
square rooting, except for the polynomial approximation of inverses; the approximation,
however, essentially depends on the peculiarity of the Taylor expansion of $1/x$ .
Instead of the polynomial approximation, we present a new technique for approxi-





Given: an integer $s\in[2,l$) and an integer $k=\lceil\log_{4}s\rceil$
Input: an l-bit number $x\in[$ )$/4,1$ )
Output: an $l+3$-bit number $v\in(1,2$ ] s.t. $v$ gives the leading $(l+3)$ bits of $1/\sqrt{x}$
In line (3) of Algorithm INSQRT2,
$f_{x}^{k}(u)=f_{x}(f_{x}$( , . . $f_{x}(u)$ . . , ) $)$
is evaluated in one step, instead of k-time evaluation of $f_{x}$ .
The following theorem tells us the area-time preformance of computing $f_{x}^{k}(u)$ in
lines $(fl)-(f2)$ [7]. This theorem is proved by using an extension of Beame-Cook-
Hoover’s method [8].
PROPOSITION 1. [Okabe, Takagi and Yajima, 1987] There exists a circuit which
computes the n-bit approximate value of a polynomial of degree $k$ in an n-bit number,
in time $O(\log(nk))$ and with area $O((nk)^{2})$ if $k\geq O(\log n)$ or $O$ ( $n^{2}$ klog n) if $k<O(\log n)$ .
Since $f_{x}^{k}(u)$ is a polynomial of degree $4^{k}(\approx s)$ in $u$ and $x,$ $f_{x}^{k}(u)$ can be computed
in time $O(\log l)$ and with area $O(l^{2}s(s+\log l))$ . The number of iterations required in
lines (2)$-(4)$ is $\lceil\log_{2}l\rceil/k$ , and thus the total computation time of the square rooting cir-
cuit is $O(\log^{2}l/\log s)$ and the chip area is $O(l^{2}s(s+\log l))$ . We now obtain the following
lemma:
LEMMA 1. For any $s\in[2,l]$ , there exists a circuit which computes the l-bit inverse of
the square root of an l-bit number in time $O((\log l)^{2}/\log s)$ and has area $O((ls)^{2})$ if
$s\geq O(\log l)$ or $O$ ( $l^{2}$ slog l) if $s<O(\log l)$ .
We are now ready to describe our square rooting algorithm with optimal $AT^{2_{-}}$
performance for the computation time $T=O(\log^{1+\epsilon}n)$ for any $e>0$ .
-3-
AIgorithm $INSQRT3(x)$
Given: an integer sequence $l_{1}<l_{2}<\ldots<l_{J}=l$
Th
$esuccessiver_{2}e_{i}finement(4)z.=vt$
technique [4] for square rooting can be written as:
$|$
Input: an l-bit number $x\in[)_{4}’,1$ )
Output: an l-bit number $v\in(1,2$] s.t. $vx=1+\epsilon,$ $\epsilon<2^{-l}$
(1) begin $v:=1$ ;
(2) for $i:=1$ to $J$ do
(3) begin $t.:=leftmost(l_{1}+1)$ bits of $x$ ;
(5) $x_{i}:=leftmost(l_{l}+1)$ bits of $z_{i}$;
(6) $v.:=$ ($leftmost(l_{1}+1)$-bit overinverse of square root of $x:$); {i.e., $\lceil 2^{l_{l}+1}/\sqrt{x_{j}}\rceil\cdot 2^{-\langle l_{\iota}+1)}$}
subroutine. To proceed the same discussion as in the case of division, we need the next
LEMMA 2. If an l-bit number $x\in[l_{4}^{/},1$ ) has $(l’-1)$ zeros immediately to the right of
the leading 1, the l-bit inverse of the square root of $x$ can be computed in time
$T=O(\log(l/l’)\cdot\log l/\log s))$ and with area $A=O((ls)^{2})$ , for any $s\in[O(\log l),l/l’]$ .





Proof. In Algorithm INSQRT2, the first approximation $u_{0}=1$ has a precision of at
least $l’$ bits. This implies that $0(\log(l/l’))$ iterations are sufficient to compute $1/\sqrt{x}$
For $i=2$ $J$ , it is obviously verified that $x_{j}$ in Algorithm INSQRT3 satisfies the con-
dition of $x$ in the above lemma for $l=t_{i}$ and $l’=l_{i-1}$ .
Only a straightforward discussion remains. Choose
where $J$ is a largest value of $i$ for which $s_{i}>2$ . (Indeed $J=\theta(1/\epsilon).$ ) Then it is proved
that the successive refinement modules based on Algorithm INSQRT3 can be imple-
mented as a VLSI circuit with $T=O((\log n)^{1+\epsilon}),$ $A=O(n^{2}/((\log n)^{1+\epsilon})^{2})$ , though we
will omit the proof.
The Newton iteration techniques utilized here are completely equal to those of $|$
Mehlhorn-Preparata’s for their division algorithms [4]. Thus we have:
evaluated with optimal $A$ $T^{2}$-performance $o(n^{2})$ for any $T\in[\Omega((\log n)^{1+\epsilon}),O(\sqrt{n})]$ .





Note that similar results can also be derived for the computation of $k\sqrt{x}$ (the k-th
root of x) for any fixed integer $k$ .
On the other hand, by choosing $s$ as $s=l^{\epsilon}(1\geq\epsilon>0)$ in Algorithm INSQRT2, the
resulting circuit achieves $T=O((1/e)\log l)$ and $A=O(l^{2(1+e)})$ . Thus we get the following
result:
THEOREM 2. For any $e>0$ , there exists a circuit which computes the n-bit square
root of an n-bit number in time $O(\log n)$ and with area $A=O(n^{2+\epsilon})$ .
3. Area-Time Efficient Evaluation of Exponentials and Logarithms
In this section, we will consider the area-time efficient implementation of circuits
for exponentials and logarithms.
First we present an algorithm for evaluating n-bit logarithms with
$A$ $T^{2}=O(n^{2}\log^{2}n)$ based on the arithmetic-geometric mean iteration of Gauss [6]. The
algorithms is as follows:
AIgorithm LOGI$(x)$
Input: an n-bit number $x\in[1,2$ )
Output: an n-bit number $y\in[0,\ln 2$ ) s.t. $|y-\ln x|<2^{-}"$
(1) begin $a_{0}:=1;b_{0}:=2^{2-n}\cdot x^{-1};i:=0$ ;






(6) $y:=\pi/(2a_{i})$ -nln 2
end.
Using the result on area-time optimal square rooting circuits, the following theorem
follows immediately.
THEOREM 3. For any fixed $\epsilon>0$ , the n-bit logarithm of an n-bit number can be cal-
culated with $AT^{2}=O(n^{2}(\log n)^{2})$ for any computation time
$T\in[\Omega((\log n)^{2+\epsilon}), O(\sqrt{n}\log n)]$ .
Once an effcient evaluation of logarithms is established, the exponential function
$y=\exp x$ can be evaluated as the root of the equation,
$f(y)=\ln y-x=0$




Input: an n-bit number $x\in[0,h2$ )
Output: an n-bit number $y\in[1,2$ ) s.t. I $y-\exp x|<2^{-n}$
begin
(1) $y_{0}:=$ ( $l$-bit approximation of $\exp x$) $;i:=0$ ;
(2) while $|hy-x|>2^{-n}$ do
begin





We now consider the efficient implementation of this algorithm for each computa-
tion time $T$ in the range $T\in[\Omega((\log n)^{2+\epsilon}), O(\sqrt{n}\log n)]$ .
THEOREM 4. For any fixed $e>0$ , the n-bit exponential of an n-bit number can be
evaluated with $A$ $T^{2}=O(n^{2}(\log n)^{2})$ for any computation time
$T\in[\Omega((\log n)^{2+\epsilon}), O(\sqrt{n}\log n)]$ .
The theorem is easily proved for $T$ in the range $[\Omega((\log n)^{4}),O(\sqrt{n}\log n)]$ , by
choosing $y_{0}:=1$ as an initial approximate value in line (1), utilizing the efficient Newton
iteration technique in [5] and evaluating logarithms in line (3) by Algorithm LOGI.
Thus we consider the implementation of Algorithm EXPI for
$T\in[\Omega((\log n)^{2+\epsilon}), O((\log n)^{4})]$ .
First we propose a new algorithm for fast evaluation of a good initial l-bit approxi-
mate value of $\exp x$ in line(l). Let $x$ be an l-bit binary number, and suppose $l=s^{k}$ . Let
$x_{0}$ be the leftmost bit of $x$ , and $x_{j}$ be the $(s^{i-1}+1)$-th to $s^{i}$-th bits of $x(i=1,\ldots,k)$ . Then
$x=x_{0}+x_{1}+x_{2}+\cdots+x_{k}$ and therefore
$\exp x=\exp(x_{0}+x_{1}+x_{2}+’\cdot\cdot+x_{k})=(\exp x_{0})(\exp x_{1})(\exp x_{2})\cdots(\exp x_{k})$ .
This leads to the algorithm shown below:
$\rangle\backslash$
$- 6-$ $\ovalbox{\tt\small REJECT}^{r}I$
121
AIgorith $mEXP2(x)$
Given: integers $s,k$ s.t. $s\leq l$ and $k=\lceil\log_{s}l\rceil$
Input: an l-bit number $x\in[0,h2$ ),
Output: an l-bit number $y\in[1,2$ ), s.t. I $y-\exp x|<2^{-l}$
(1) begin
(2) for $i:=1$ to $k$ {in parallel} do
begin
(3) $x_{i}:=tl\iota e(s^{-1}+1)$-th to $s^{i}$-th bits of $x$ ;





Consider the Taylor expansion of $\exp x_{i}$ ,
$\exp x_{j}=1+x_{i}+(1/2!)x;^{2}+\cdot\cdot’+(1/n!)x_{i}^{n}+\cdots$
where $x_{j}<2^{-s^{i-1}}$
This means that the sum of the first $2l/s^{i-1}$ terms is an approximate value of
$\exp x_{j}$ with the precision of 21 bits, since $x_{i^{2l/s^{i-1}}}<2^{-2l}$ . Thus we may assume that $y_{i}$ in
hne (4) is a polynomial of $x_{i}$ of degree 2 $l/s^{i-1}$ .
Since $x_{i}$ is an $(s^{i}-s^{i-1})$ -bit number, Step (4) is carried out in time $O(\log l)$ and
with area $O(l^{2}s(s+\log l))$ from Proposition 1. (Note that $(s^{j}-s^{i-1})’(2l/s^{i-1})=O(ls).$ )
All $x_{i}’ s$ can be calculated in parallel (lines (2) $-(4)$ ), and multiplied up pairwise in a
binary-tree-form (line (5)). Thus computation time required in line (2)$-(4)$ is $O(\log l)$
and area $O(k\cdot l^{2}s(s+\log l))$ . Using $(2k-1)$ multipliers with time $O(\log l)$ and area $O(l^{2})$
in line (5), their product can be obtained in time $O((\log k)\cdot(\log l))$ and with area
$O(k\cdot l^{2})$ . Hence the computation time $O((\log k)\cdot(\log l))$ and the chip area
$O(k\prime l^{2}s(s+\log l))$ in line (5). Thus we have the following lemma:
LEMMA 3. For any $s\in[2,l]$ , there exists a circuit which computes the l-bit exponen-
tial of an l-bit number in time $O((\log l)\log(\log l/\log s))$ and has area
$O((ls)^{2}\log l/\log s)$ if $s\geq O(\log l)$ or $0(l^{2}s\log^{2}l/\log s)$ if $s<O(\log l)$ .
Let us continue the proof of the theorem for $T\in[\Omega(\log^{2+\epsilon}n), O(\log^{4}n)]$ . Let $l=n/T$,
and consider the Algorithm EXPI for this $l$ . From the above lemma for $s=2$ , it follows
that there is a circuit, say $F_{E}$ , which computes the initial l-bit approximation of $\exp x$
in line (1) in time






feedback, and the later $m-i_{0}$ iterations on their own circuits $F_{i_{0}+1},\ldots,F_{m}$ respectively.
former $i_{0}$ iterations in time $T_{F}=O(\log^{2+c}n)$ and has area
$A_{F}=O((l\prime 2\eta 2,(i_{0}+\log_{2}l)^{2}/(\log^{2+c}n)^{2})=O(n^{2}\log^{2}n/T^{2})$ .
The circuit $F$
choose the computation time of $F_{m-j}$ as
$T_{m-j}=T/2^{j/2}$
for each $j=0,\ldots,m-\iota_{0}-1$ . Note that
$\tau_{;_{0+1}}=T/l^{(m-i_{0}+1)/2}=O((T\log^{2+c}n)^{l4})>O(\log^{2+c}n)$ .
From the result of Theorem 3, there exists a circuit $F_{j}$ which has area
$A_{m-j}=O((2^{m-j})^{2}(m-j)^{2}/T_{j}^{2})=O(((2^{m})^{2}m^{2}/T^{2})2^{-j})=O((n^{2}\log^{2}n/T^{2})2^{-j})$
for any $j=0,\ldots,m-i_{0}-1$ .




and computes the n-bit logarithm in time
$T_{r}=T_{E}+T_{F} \cdot i_{0}+\sum_{j}T_{j}=O((\log n)\log\log n)+O(\log^{1+c}n)\cdot O(\log n)+\sum_{j}T/2^{-j/2}=O(T)$ ,
since $T\geq O(\log^{2+c}n)$ . Thus the theorem follows.
Next, let us consider the time-optimal, i.e. $T=O(\log n)$ , circuits for exponentials
and logarithms, with suboptimal $A$ $T^{2}$-performance.
Consider Algorithm EXP2 in the previous subsection, and choose $s=l^{\epsilon/3}$ for any
$1\geq\epsilon>0$ . The Circuit in Lemma 3 operates in time $T=O(\log l)$ and has area
$A=O(l^{2+(2/3)\epsilon})$ . Thus we have:
-8-
123
THEOREM 5. For any $e>0$ , there exists a circuit which computes the n-bit exponen-
tial of an n-bit number in time $O(\log n)$ and with area $O(n^{2+\epsilon})$ .
We will show the corresponding result for logarithms. Let $x$ be an l-bit binary
number and suppose $l=s^{k}$ . It is easily verified that $x$ can be written as $x\simeq\eta_{1}\cdot\eta_{2}\cdot\cdot’$ $\eta_{k}$ ,
where $\eta_{j}$ is an $s^{i}$-bit number which has at least $s^{i-1}$ consecutive $0’ s$ immediately to the
right of the leading 1. Since
$\ln x=\ln(\eta_{1}\cdot\eta_{2}’\cdot’\cdot\eta_{k})=\ln\eta_{1}+\ln\eta_{2}+\cdots+\ln\eta_{k}$ ,
$\ln x$ is obtained as:
AIgorithm $LOG2(x)$
Given: integers $s,$ $k$ s.t. $s\leq l$ and $k=\lceil\log_{s}l\rceil$
Input: an l-bit number $x\in[1,2$ )
Output: an l-bit number $y\in[0,h2$ ), s.t. $|y-hx|<2^{-l}$
(1) begin $x_{0}:=x;y:=0$ ;
(2) for $i:=1$ to $k$ do
begin
(3) $\eta_{i}:=leftmosts^{i}$ bits of $x_{i-1}$ ;
(4) $x_{i}:=x/\eta;$ ;




Consider the Taylor expansion of $\ln\eta_{j}$ ,
$\ln\eta_{i}=\ln(1+h_{i})=h_{i}-(1/2)h_{j}^{2}+\cdot’\cdot+(-1)^{n}(1/n)h_{i}^{n}+\cdots$
As is obvious in line (3) $-(4),$ $h_{i}<2^{s^{i-1}}$ This means that sum of the frst 2 $l/s^{i-1}$ terms is
an approximate value of ln $\eta_{i}$ with the precision of $2l$ bits. Thus we may assume that $y_{j}$
is a polynomial of $h_{i}$ of degree 2 $l/s^{i-1}$ . Since $h_{i}$ is an $(s^{i}-s^{i-1})$-bit number, l-bit approx-
imation of $\ln\eta$ ; can be computed in time $O(\log(s^{i}\cdot(2l/s^{;-1})))=O(\log(ls))$ and with
area $O((s^{i}\cdot(2l/s^{i-1}))^{2})=O((ls)^{2})$ (Proposition 1).
Choosing $s=l^{\epsilon/3}$ for any $1\geq e>0$ , this satisfies $k=\theta(1/e)$ . Step (4) can be carried
out in time $O(\log l)$ using a divider with area $O(l^{2+(2/3)\epsilon})$ (see [4]). For each $i=1,\ldots,k$ ,
$y_{j}$ can be evaluated (line (5)) and added up (line (6)) in time $O(\log(ls))=O(\log l)$ and
with area $O((ls)^{2})=O(l^{2})$ Thus the total computation time of our circuit is
$O(k\log l)=O(\log l)$ and the chip area is $O(l^{2+(2/3)\epsilon})$ . We get:
THEOREM 6. For any $e>0,$ there exists a circuit which computes the n-bit loga-




Trigonometric functions such as sines, cosines and arctangents can be evaluated
$\ovalbox{\tt\small REJECT}$




comments and criticism on parallel and VLSI algorithms.
References
Sci., Carnegie-Mellon Univ., Pittsburgh, Pa $(J_{an}$ . 1979).
521-534 (1981).
Computation $Time^{\prime t}$ , Inform. and Control, 58, 137-156 (1983).
Comput. 72, 270-282 (1987).
163-167 (1984).
Press, New York, 151-176 (1976).
Number System“, 1987 LA Symposium in Summer, (July 1987), in Japanese.
lems“, SIAM J. Comput., 15-4, 994-1003 (1986).
