A fast floating-point square-rooting routine for the 8080/8085 microprocessors by Chamrád, Valentin
Kybernetika
Valentin Chamrád
A fast floating-point square-rooting routine for the 8080/8085 microprocessors
Kybernetika, Vol. 19 (1983), No. 4, 335--344
Persistent URL: http://dml.cz/dmlcz/125096
Terms of use:
© Institute of Information Theory and Automation AS CR, 1983
Institute of Mathematics of the Academy of Sciences of the Czech Republic provides access to digitized
documents strictly for personal use. Each copy of any part of this document must contain these
Terms of use.
This paper has been digitized, optimized for electronic delivery and stamped with
digital signature within the project DML-CZ: The Czech Digital Mathematics Library
http://project.dml.cz
KYBERNETIKA-VOLUME 19 (1983), NUMBER 4 
A FAST FLOATING-POINT SQUARE-ROOTING ROUTINE 
FOR THE 8080/8085 MICROPROCESSORS 
VALENTIN CHAMRÂD 
A speed-oriented implementation of the Newton-Raphson algorithm is described, reducing 
the worst-case execution time to the level of standard floating-point multiplication and thus 
supporting a wider use of square-root filters in microprocessor-based self-tuning controllers. 
1. INTRODUCTION 
Square-root is a function for which numerous numerical methods have been 
developed. In most math packages for microprocessors, simple iterative methods 
have been used, as no special demands for speed — nor even for accuracy in some 
cases - are expected: e.g. in [ l ] and [2] the execution time of square-rooting is 
approx. 2-5 times longer than that of multiplication, and in [3] a quintuple error 
limit compared with other operations is accepted. Some floating-point packages 
do not support this function at all leaving its evaluation to user defined programs 
(e.g. W). 
In the floating-point subroutine package for the Intel 8080/8085 microprocessors 
[6] developed in the Institute of Information Theory and Automation in Prague, 
the speed of operation has been strongly emphasized; and as a fiequent use in soft-
ware for self-tuning controllers with square-root filters has been expected, there was 
a special demand for a fast square-rooting subroutine with the same accuracy 
( + 1 LSB1) as with all the other operations. It took a considerable effort to match 
this condition, and every promising method of accelerating the calculation was 
tested — even empirical or intuitive; special testing programs were developed for 
this purpose, checking the real deviation of the square-root returned by the sub-
routine under test for all the 32 k significantly different input values and printing 
those values yielding results with an error exceeding a preset limit of 0-5, 0-75 or 
1 LSB only. 
1 The abbreviations MS and LS will be used for most significant and least significant respec-
tively in this paper; in connection with them, B will be reserved for bit, while byte will not be 
abbreviated. 
335 
2. NUMBER FORMAT 
The first important step to improve the performance of a floating-point subroutine 
is to realize some special properties of the number format used; in our case, this is 
defined for every representable number as 
(1) x = a.2b, 0-5 < \a\ < 1 > . 
where a (mantissa) is a 16 bit FRACTION number in two's complement form, 
and b (exponent) is a 7 bit INTEGER in the "excess-64" code, i.e. with an added 
offset of 64 to avoid negative values, so that the real value stored in the exponent 
byte is 
(la) b' = b + 64, 0 g V <. 127 
which enables us to use the MSB for overflow/underflow detection. As explained in 
[6], the precision of 0003% and the range of representable numbers approx. 
+ 3 . 10" 1 9 to + 10 + 1 9 proved to be quite sufficient for most engineering applications; 
on the other hand, the achievable execution speed of arithmetic subroutines is much 
higher compared to longer formats due to the possibility of using register instructions 
only for most operations. 
With respect to square-rooting, the first important thing to realize is that we deal 
with a product of two numbers, the second of them being a power of two; the opera-
tion thus can be simplified by a conversion — may be fictive only — to an unnorma-
lized format with the next higher even exponent, which can be square-rooted by an 
integer division by 2. If we denote the input operand x and the result y, then 
(2) b, = I N T [ i ( 6 , + l)] 
where INT denotes integer part of the expression in square brackets. In the format 
used, the result exponent will be computed as 
(2a) b'y = INT \}{b'x + 65)] 
Division by the shift right (RAR) instruction yields the Carry bit (LSB of the sum 
in parentheses) representing the directive for denormalizing the mantissa; we shall 
see later that a different treatment of mantissa instead of real denormalization will 
be more useful. A simple analysis of the limit values of b'y shows that neither overflow 
nor underflow can occur; no final testing of exponent will therefore be needed. 
3. THE ALGORITHM AND ITS CODING 
The square-rooting algorithm proper will then operate on numbers in the range 
(3) 0-25 <. ax < 1 
336 
only; the results shall lie within the range of 
(4) 0-5 _ ay < 1 
and that means that they will be automatically normalized; consequently, no final 
normalization will be needed. 
Halving of the result range, together with the same resolution of 15 bits for both 
the input and result values and with the nonlinearity of the function, causes that we 
shall get the same results for two or even three adjacent input values, which should 
not be considered erroneous. 
For square-rooting of mantissa, we have adopted the Newton-Raphson iteration 
formula, used in [2] and [3] as well, which in the ith iteration computes the new 
approximation yi+l as 
(5) yi+i = l(yi + j 
i.e. as the mean value of the old approximation y, and the quotient of the input 
value and the old approximation. For the known deviation of the ith approximation 
(6) Ay, = y - yt 
where y is the correct value of the square root, the deviation of the next step can be 
estimated as 
W *,,„ = =&*-. 
2(y - Ayi) 
As a rule, in conventional computers the iteration cycle starts for simplicity with 
y0 = x, and the iteration process is stopped when the difference between two succes-
sive values of y, is lower than the accuracy required. A similar method — used in our 
testing programs — determines the accuracy of the estimate using the difference 
between yt and the quotient computed when evaluating formula (5); using (6), this 
difference d{ equals 
(8) d-, = yt- - = y + Ayt -I— - + Aqt 
yt \y + Ayt 
where Aqt is the truncation error of the quotient. For Ayt < v, we can approximate 
(8a) di - _ _ _ _ _ + A y , ) - A g . ( y + A y , ) ^ 2 ^ _ ^ 
y + Ayt 
and if we assume Aq, to be small enough (which is satisfied by extended precision 
in our test programs), we can take 
(8b) Ayt m id, 
337 
Note that the last but one approximation is tested here, so that an accuracy exceeding 
the precision of the format used can theoretically be achieved with the final result. 
The possibilities of reducing the overall execution time of this iterative process 
comprise both reducing the execution time of a single iteration cycle and reducing 
the total number of iterations. For the latter way, the only means of reduction is 
a better original estimate, which could be constructed simply enough. Practically 
the choice is limited to a linear function, as any more complicated function (e.g. 
polynomial) would consume more time than another iteration cycle. Let us remind 
that this estimate should be constructed using a still normalized mantissa and the 
Carry bit representing an even or odd exponent; in other words, the Carry bit tells 
us whether the mantissa belongs to the "lower octave" of operands described by 
(9) Carry = 0 , 0-25 ^ xL < 0-5, xL = 0-5aL 
or to the "higher octave" with 
(10) Carry = 1 , 0-5 ^ xH < 1 , xH = aH 
We found advantageous to choose different estimates for each octave: in fact, we 
used the results of the first iteration, taking the known correct values in the end 
points of interval (3) as primitive estimates, but we used a direct method to construct 
them. 
For the upper octave, we used y0H = xH (correct for x = 1), and from (5) and (10) 
we obtained 
(11) yia = i (xB + -5£\ = 0-5a„ + 0-5 
Similarly, we used y0L = 2xL (correct for x = 0-25) for the lower octave and obtained 
from (5) and (9) 
(12) y1L = i (lxL + -—) = 0-5aL + 0-25 
Equations (11) and (12) can be interpreted geometrically as equations of tangents 
touching the square root curve in the end points of the interval (3). Thus we get an 
approximation by a broken line (Fig. 1) with a maximum deviation in the breaking 
point between both octaves 
Almax = 0-75 - 7(0-5) = 0-0428 
i.e. approx. 6 1 % of the correct value. 
The construction of the estimate is extremely simple, as the additive constant only 
depends on the value of the Carry bit after the calculation of result exponent; 
a simple logic function has been adopted for the realisation (see the listing at the end, 
lines 27 through 35). 
338 
An advantageous side effect of this estimate is that it always holds 
(13) ylL ^ aL and yiH > aH 
so that the information on the real exponent (or "octave affiliation") of the input 
operand need not be stored for the calculation of the iteration formula (see later). 
0.3 0.4 0.5 0.6 0.7 0.8 
Lower octave Upper octave 
Fig. 1. Square root and its approximations. 
For this approximation, we can estimate the maximum error after first iteration 
cycle using (7) as 
A2max = 0-0014 « 0-2% 
and after the second cycle 
A3max = 0-0000013 tts 0-0002% 
which exceeds already the precision of our floating-point format; a constant number 
of two iteration cycles is fully acceptable and helps to reduce the execution time of 
each cycle by omitting the test for the accuracy of the result. From this point of view, 
the choice of a better first estimate can be considered useless, as even the best linear 
approximation — by an intersecting line with symmetrical deviations - might 
reduce the max. error At to appr. 1-5% only, which would yield an accuracy of 
0-012% after the first iteration cycle, and consequently would not enable any further 
reduction of the number of iterations; the construction of any nonlinear approxima-
tion would evidently consume more time than one iteration cycle saved. 
By this important modification we entered the second group of methods, i.e. 
reducing the execution time of a single iteration cycle, with our next attention concen-
trated on the division in (5) as the most time-consuming operation. However queer it 
339 
may sound, the most important decision was not to handle the iterations as a loop and 
not to use standard division subroutines, and to reduce instead the precision of com-
puting according to the expected accuracy in the respective iteration cycle. In the first 
cycle, 10 bits are sufficient for the accuracy calculated above, and — on conditions given 
later — even 8 bits (with the precision of 0-4%) will maintain the accuracy of 0-0016% 
in the next iteration, which still exceeds almost twice the accuracy needed for the 
final result. A special division routine was therefore adopted for the first division: 
the MS bytes only enter the operation, the LS byte of dividend being replaced by 
its MSB followed by the mean of all possible values of the remainder (i.e. 07FH); 
in the program, this is realized by shifting in trailing ones into the remainder instead 
of zeros except of the fiist cycle. Full 8 bits are calculated and shifted right before 
addition of the first estimate. 
For both divisions, due to (13) the end-of-loop is tested for the normalized format 
of result only, i.e. one left shift of dividend is added if the starting value of divisor 
is greater; as explained before, this can — and always will — occur with the upper 
octave operands only. This simplification requires an added precaution for the first 
iteration: as the difference may not appear in the MS bytes, a test for zero result 
of the first subtraction is unavoidable, causing a skip of the whole first iteration if 
true. In this case, the argument lies very close to 0-25 or 1 (within 2"6), and the 
corresponding error of the first estimate is less than 0-05%, so that one iteration is 
fully sufficient. 
For the second iteration, the kernel of the standard division subroutine only was 
adopted, thus enabling to omit all the unnecessary parts (such as testing of signs and 
zero values of operands, exponent operations etc.); two calls to the internal division 
loop FTDSR of the FTAR.LIB package (appended to the program listing for 
reference) reduce the extra memory requirement to an acceptable extent. This enabled 
us a different testing of operands to be incorporated; with the second division, the 
equivalence of operands means that the estimate is correct, and even the second 
iteration is skipped. 
The final result of the second iteration is rounded using the shifted-out bit of the 
final division by two. This is important not only to maintain the accuracy (the limits 
of ± 1 LSB would be met even with mere truncation), but to ensure the stability 
of a test loop invoking square root and square in turn: due to the not unique assign-
ment of values mentioned above, the loop will reach a stable pair of values not later 
than in the second repetition, while with the final truncation it would produce a se-
quence of continuously decreasing values. An analysis of this type of numerical 
instability exceeds the scope of this paper. 
The flow diagram of the subroutine described here is given in Fig. 2 and the 
complete listing of the program in Fig. 3. 
340 
4. COMPARISON WITH OTHER METHODS 
The effect of matching the algorithm, its coding and the instruction set of the given 
microprocessor can be demonstrated by comparison of the worst-case execution 
times and memory requirements of four types of programs: 
a) A user program created by mechanical coding of the Newton-Raphson formula, 
using standard floating-point arithmetic subroutines, the primitive initial estimate 
y0 = x and the condition yi + 1 = yt for end of iteration, would need 42 bytes 
of memory and execute in 1-8 ms per iteration2, i.e. in 6 to 650 ms approx. de­
pending on the size of input numbers. 
Г RET ) 
У l=0.5a+Ooг5 
=0 s ' ( b x + 6 5 ) ^ 
v MÒD г ^ 
4 = 1 
У l =0.5a+0.5 
ROUNDING УЗ = У2 
SET FLAGS 
( RET ") 
Fig. 2. Flow diagram for square root. 
2 MHz clock frequency assumed in all compared cases. 
341 


















0000 47 17 
0001 EB 18 
0002 ftF 19 
0003 B2 20 
0004 Fft7200 C 21 
0007 C8 22 
0008 78 23 
0009 C641 24 
OOOB 1F 25 
0000 F5 26 
OOOD 9F 27 
OOOE EE40 28 
0010 E6C0 29 
001? 82 30 
0013 1F 31 
0014 47 32 
0015 7B 33 
0016 1F 34 
0017 4F 35 
0018 7A 36 
0019 90 37 
OOIЛ FA2300 C 38 
001D CftЗEOO C 39 
0020 C32800 C 40 
0023 7B 41 
0024 17 42 
0025 7A 43 
0026 17 44 
0027 90 45 
0028 2Ю2F8 46 
002B 17 47 
002C B8 48 
002D FA3200 l : 49 
0030 23 50 
0031 90 51 
0032 29 52 
0033 DA2B00 C 53 
0036 7D 54 
0037 OF 55 
0038 80 56 
0039 1F 57 
OÓЗA 47 58 
003B 79 59 
OOЗC 1F 60 
THIS SUBROUTINE ACCEPTS THREE-BYTE F-P INPUT OPERANDS 
IN D-E-B (ENTRY FTSRX) OR H-L-A (ENTRY FTSRY) REGISTERS, 
RETURNS SQUARE ROOT IN D-E-B REGISTERS? MAX.ERROR < 1 LSB, 
MAXIMUM (WORST CASE) EXECUTING TIME 1479 CLOCK PERIODS. 
VftLENTIN CHAMRftD, MICROELECTRONIC SYSTEMS DEFT.. 
INSTITUTE OF INFORMATION AND AUTOMATION THEORY, 
































































ENTRY FOR INPUT OP IN H-L-A REGS 
ENTRY FOR INPUT OP IN D-E-B REGS 
SET FLAGS ACCORDING TO INPUT OP 
BRANCH FOR NEGATIVE OPERANDS 
EXIT FOR ZERO OPERAND 
ATUi BIAS TO EXPONENT 
DIVIDE BY 2, LSB TO CARRY 
STORE RESULT EXPONENT ON STACK 
REPEAT CARRY IN ALL ACCU BITS 
INVERT 6-TH BIT 
CLEAR ALL LOWER BITS 
ADD HI BYTE OF INPUT OPERAND 
DIVIDE BY 2 AND 
MOVE FIRS! ESTIMATE TO B-C. REGS 
SUBTRACT HI BYTES OF 
INPUT OPERAND AND ESTIMATE 
SKIP 2 LINES IF RESULT NEGATIVE 
SKIP FIRST ITERATION IF ZERO 
GO TO FIRST ITERATION IF POSITIVE 
SHIFT INPUT OPERAND LEFT 
ftiD 
B ! AND REPEAT SUBTRACTION OF HI BYTES 
H,OF802H; INITIALIZE RESULT REGISTERS 
; FIRST DIVISION LOOPrSHIFT REMAINDER 
h ! TRY IF SUBTRACTION POSSIBLE 
S+5 J SKIP NEXT 2 INSTRUCTIONS IF NOT 
H ! SET RESULT BIT 
B J SUBTRACT ESTIMATE 
H i SHIFT RESULT, TEST BIT TO CARRY 
TEST FOR END OF LOOP 
RESULT TO A 
SHIFT BACK 
ADD FIRST ESTIMATE 
SHIFT TO DIVIDE BY 2 AND 
REPLACE FIRST ESTIMATE BY THE 





Fig. 3a. Program listing for square root. 
ISIS-II 8080/8085 MACRO ASSEMBLERr V3.0 FTSRT PAGE 
























































61 мov CrA 
62 FTSR2: XCHG 
63 XRA ft 64 SUB C 
65 MOV ErA 
66 SBB A 
67 SUB B 
68 MOV DrA 
69 DAП n 
70 JNC FTSRЗ 
71 мov ArH 
72 ORA l. 
73 JNZ FTSR4 
74 ПAП B 
75 XCHB 
76 JMP FTSR5 
77 FTSRЗ: ПAD B 
78 ПAП H 
79 ПAП n 
80 FTSR4: MVI ArOSH 
81 CЛLl. FTПSR 
82 PUSH PSU 
83 MVI ftrOlH 
84 CALL FTDSR 
85 POP H 
86 MOУ Lrft 
87 ПAD B 
88 MÜV ArH 
89 RAR 
90 MOУ DrA 
91 MOУ ArL 
92 RAR 
93 ACI 0 
94 MOV ErA 
95 FTSR5: MVI ArO 
96 AПC n 
97 MOУ ПrA 
98 PÛP B 
99 RET 
100 
101 FTSNH: MVI ftr42H 
102 CALL FTECH 
103 FTSUR: XRA A 
104 MOУ BrA 
105 MOV П,A 
106 MOУ ErA 
107 RET 
108 
109 ПHTERNftL ШVISIÜN 
110 
111 FTПl: RAL 
112 RC 
113 FTDSR: ПAD H 
114 ПAD D 
115 JC FTПl 
116 ПAП B 
117 AПD A 
118 JNC FTDSR 
119 RET 
MOVE INPUT OP TO H-L 
PREPARE I.DE3 := -CBCI 
SUBTRACT 2-ND ESTIMATE FROM IN.OP 
BRANCH IF RESULT NEGATIVE 
GO TO SECOND DIVISION IF POSITIVE 
RESTORE ESTIMATF IF CORRECTr 
MOVE TO [i-E 
AND GO TO FINAL FLA6 SETTING 
RESTORE INPUT OPERAND 
SHIFT LEFT 
REPEAT SUBTRACTION OF ESTIMATE 
INITIALIZE RESULT REGISTER 
FIRST PART OF SECOND DIVISION 
PUSH FIRST BYTE OF RESULT ON STACK 
INITIALIZE RESULT REGISTER 
SECOND PART OF SECOND IHVISION 
POP FIRST BYTE OF RESULT 
APPEND SECOND BYTE FROM ACCUMULATOR 
ABU SECONH ESTIMATE 
SHIFT RIGHT TO DIVIDE BY 2 AND 
MOVE THIRH ESTIMATE TO D-E 
Aim SHIFTEH OUT BIT TO ROUND 
ADD CftRRY FROM LO BYTE AND SET 
FLAGS ACCORDING TO FINAL RESULT 
POP RESULT EXPONENT FROM STACK 
ERROR CODE FOR NEC.INPUT OPERANDS 
INVOKE USER DEFINED ERROR HANDLER 
SET DEFAULT RESULT ZERO AND 
CORRESPONDING FLAGS 
OF FTDIV SUBROUTINE INVOKED ABOVE ! 
r SHIFT IN RESULT BIT FROM CARRY 
i TEST FOR END OF DIVISION LOOP 
r SHIFT DIVIDEND (REMAINDER) LEFT 
r SUBTRACT DIVISOR 
r BRANCH IF RESULT POSITIVE OR ZERO 
r RESTORE DIVIDEND IF RESULT NEGATIVE 
r SHIFT RESULT LEFT ADDING ZERO LSD 
i TEST FOR END OF DIVISION LOOP 
Fig. Зb. Program listinp for square root (conťd) 
b) The same program, but with a better initial estimate (halving the exponent) and 
testing the difference (8) for end of loop would need appr. 60 bytes and execute 
in appr. 5 ms. 
c) A BASIC-oriented subroutine published in [2], using separate treatment of ex-
ponent and mantissa, with the same initial estimate and fixed number of iterations 
as described above, but standard arithmetic subroutines for mantissa operations, 
needs 68 bytes and executes in 4099 clock periods, i.e. slightly more than 2 ms. 
d) The subroutine described here needs 117 bytes (FTDSR not included) and executes 
in 1479 clock periods, i.e. 0-74 ms, which is 8-5% only more than needed for the 
speed-oriented multiplication subroutine described in [6], and even 14% less than 
for a standard multiplication (such as that described in [2] with a maximum 
of 0-861 ms). 
5. CONCLUSION 
This subroutine seems to be attractive for use in self-tuning controllers with 
square-root filters, because its execution time lies near the geometric average and 
thus fills the gap between that of the standard software solution and of a peripheral 
hardware unit (such as iSBC 310 with 0-205 ms), the price of which is much higher 
than that of the necessary extension of memory and seems to be unaffordable for 
usual controller applications. 
(Received September 27, 1982.) 
R E F E R E N C E S ' 
[1] S. N. Cope: Floating-point arithmetic routines and macros for an Intel 8080 microprocessor. 
Oxford University Engineering Laboratory Report No. 1123/75. 
[2] D. W. Clarke, S. N. Cope and P. J. Gawthrop: Feasibility study of the application of micro-
processors to self-tuning controllers. O.U.E.L. Report No. 1137/75. 
[3] KIMath Subroutines Programming Manual (KIM Mathematics Subroutines). MCDS 
Microcomputer Datensysteme GmbH, Darmstadt 1977. 
[4] C. B. Falconer: Falconer floating point arithmetic. Dr. Dobb's Journal of Computer Calisthe-
nics & Orthodontia, Vol. 4, No. 33, 4—14 and No. 34, 16—25. 
[5] D. E. Knuth: The Art of Computer Programming II — Seminumerical Algorithms. Addison-
Wesley, Reading, Mass. 1971. 
[6] V. Chamrad: A speed-oriented floating-point subroutine package for the Intel 8080 micro-
processors. In": Preprints SOCOCO '79, The 2nd IFAC/IFIP Symposium on Software for 
Computer Control, Vol. I., Prague 1979. 
Ing. Valentin Chamrad, CSc, Ustav teorie informace a automatizace CSAV {Institute of Informa-
tion Theory and Automation — Czechoslovak Academy of Sciences), Pod voddrenskou v&zi 4, 
182 08 Praha 8. Czechoslovakia. 
344 
