In this paper we investigate the design of macro-cell generators of division and square root floating-point operators. The number representation used in our operators is the IEEE-754-1985 standard for binary floating-point numbers. The design and implementation of the generators rely on a powerful multiview layout synthesis tool called GenOptim. This CAD tool is able to output a set of different descriptions for several VLSI technologies as well as FPGAs. The division and square root operators described in this paper use the signedbinary-digit representation. We start first by describing the operators for the significand, then we investigate the IEEE floating-point operators. All along this paper, and wherever appropriate, we present the implementation results using the GenOptim environment.
INTRODUCTION
Portability and design time are crucial issues in any system design. The procedural generator concept has proved to be of great interest since it allows the user to encapsulate his design knowledge into well-defined packages and to retrieve this information without difficulties and time penalty. So procedural generators tend to become an essential part of state-of-the-art CAD frameworks. The generators are built up from two distinct parts. The algorithm to be realised is proper to each generator and described logically using virtual cells. The other part, called GenOptim, is common to all generators and maps the virtual cells on the target library and solves automatically the electrical and placement problems by analysing the circuit and the target cells library. Generators of addition and multiplication arithmetic operators have already been written using the GenOptim environment [1] . This methodology is now used to realise generators for division and square root operators. These operations are not so frequent in common applications, but as processor speeds increase their latency may become a penalising factor. The use of the borrow save notation and the similarity of the two operations division and square root both in fixed-point and floating-point schemes motivated us to combine them in a single operator. This combination is appealing for infrequent operation since it allows saving on area without compromising the speed. The paper is organised as follows: section 2 describes the generator concept and the GenOptim environment. Section 3 treats the fixed-point operators and all the theory behind respectively the division, the square root and their combination. In section 4 we investigate the IEEE floating-point operators. This paper ends by presenting the implementation results and by drawing a conclusion from them.
METHODOLOGY
A methodology to design portable standard-cell generators has been developed [2] . This methodology is intended to ease the task of designing and implementing IEEE floating point arithmetic operators. The generator concept is supported by two main ideas. First, in the field of designing floating point arithmetic operators the designer has a large set of number formats and precision. As far as the IEEE standard [3, 12] is concerned, beyond the simple precision representation over 32 bits and the double precision representation over 64 bits, there exists a huge set of number formats in which the word lengths differ. These formats, called extended formats, increase the precision of the numbers represented. The second idea deals with the approach of floating point operators design. In fact, in order to realise these operators three main alternatives may be considered [13, 14, 15, 16] . The first is a full-custom realisation. This solution is ideal in terms of surface and propagation time, as well as other factors, but unfortunately this solution leads to huge amounts of design time and moreover the generators are not portable. The second alternative is the realisation of the operators by means of logic synthesis from the behavioral descriptions of the operators. This solution is extremely fast and portable but the complex operators may lead to an explosion of the size of the circuit. Finally we have the solution of realising the operators adopting the approach of generators using data-path or standard-cell libraries. This alternative gives the best compromise between logic synthesis and full-custom realisations.
1 GenOptim CAD Environment
Experience has shown that generator programs are quite often written by VLSI designers, as they hold the empirical knowledge better anyone. However, their ability does not necessarily include programming and debugging abilities: these designers have to focus on the problem at hand, not on the tools or the language they use to solve it. GenOptim has been created to quickly design efficient IEEE floating-point macro-cell generators that do not rely on particular target technologies. These macro-functions (shifters, multiplexors, priority encoders,...) are very common in the VLSI design world and may be used in many different contexts. As depicted in Fig. 1 , the circuit generation starts with a virtual description (netlist and layout of virtual cells) of the circuit. Then, GenOptim maps the virtual description to the target library and optimises the circuit.
2 GenOptim : the Tool

3 Generators Description
From simple parameters ( name of the circuit and the number of bits ), the generators create automatically a netlist view, a placement view, a behavioral view, test vectors and finally a file summarising all the circuit characteristics.
The generators are built up from two distinct parts. One proper to each generator and describes logically the algorithm to be realised using virtual cells. The other is common to all generators, called GenOptim that maps the virtual cells on the target library and solves automatically the electrical and placement problems by analysing the circuit and the logic cells library used. GenOptim allows the designer to concentrate exclusively on the algorithm to be realised. Furthermore it makes it possible to undergo modifications without any intervention on the source code of the generator.
Consequently the developer of generators is never bothered by the electrical, placement and timing constraints.
We list below some of the most important descriptions furnished by the generators.
3. 1 Placement View
Several GenOptim directives such as row compression or slice reconfiguration modify the initial placement in order to insert the generator in any specific topological context and to obtain homogenuous physical structures.
3. 2 Netlist View
In order to enhance portability, GenOptim uses multiple drivers and is able to generate netlist and layout views for various CAD systems and technologies. One of the essential motivations behind the generator approach is the portability. So the generators use a library of virtual cells. A virtual cell is defined as a cell having a unique behavior but having many possible constructions. This allows the mapping of the generators over many different technologies and libraries.
3. 3 The Datasheet View
It is a report of all timings and critical paths of the generated circuit. It is very useful when designing a generator. The user can easily and quickly see where are the critical paths and the evolution in performance/surface following the activated optimisations.
FIXED POINT OPERATORS
1 Division
The table below shows the notations and variables used along this part: ⊕, ∨, • denote XOR, OR and AND operators ; +, -, * , ÷ denote add, subtract, multiply and divide. i, j, k, n represent integers.
Number Digit In common applications the division operation is considerably slower than addition or multiplication operations [4] . There are two approaches for the hardware implementation of division Q = A÷D. The NewtonRaphson method uses a series of multiplications and additions to develop an increasingly accurate approximation of the desired quotient Q. Its main interest is the use of readily available adders and multipliers [4, 5, 6, 7, 8, 9, 10, 18, 19] . The digit-recurrence approach relies on subtraction and multiplication by the radix "b" and by the quotient digits "q j ".
Step 0
1. 1 Cellular Array Division: Borrow Save Division
In the borrow save radix 2 notation, each signed digit q j ∈ {-1,0,+1} is the difference of 2 bits (q -j -q + j ). This redundant notation [5] can be used for quotient digits selection through the test of only 3 digits of the partial remainder [6, 7] and for carry-propagation-free addition/subtraction as well. But the quotient has to be translated from the redundant notation into a standard one. 
1. 2 Slice of the Divider
To simplify the reading of Fig. 2 , and since we are interested in the jth slice only, let us put R = R (j) * 2 j and S = R (j+1) * 2 j . The role of the head cells (Fig. 3) is to :
1. select q j and the operation to execute 2. execute this operation on the head digits
The role of the tail cells (Fig. 3) is to execute the operation dictated by the head cells on the tail digits.
1. 3 Equations of Head and Tail Cells
An improvement over paper [8] is to parallelise the quotient digit selection and the processing of the remainder three most significant digits [9] .
Since D =1+∑ n j=1 d j* 2 -j , -D =-2+∑ n j=1 d j __ * 2 -j + 2 -n . So the iteration S = R -q j * D can be written:
The head cell preserves the identity:
and the tail cells preserve the identity :
The bit s + n weighting (+2 -n ) is connected to q + j .
1. 4 Overlapped Conversion
The quotient
In an n-bit binary subtractor, let us note T i j the group transmit and G i j the group generate, with 0 ≥ i≥ j ≥ n .
We note c i the borrow.
T i j means that the borrow propagates from position j up to i, that is c i+1 is equal to c j .
G i j means that a borrow is generated somewhere between positions j and i and propagated from this location up to position i and thus c i+1 = 1.
We have
The n bits of the converted quotient are given by p i = c i ⊕ t i .
For any k such that 0
as follows:
We note ∆ the operator introduced by Brent and Kung [11] :
In Fig. 4 the icon for the ∆-cells is represented as .
Each wire carries two bits (T,G). It is easy to check that
∆ is associative so to give the same result the ∆-cells can be connected in many different ways.
1. 5 Speed-Area Trade-Off of the Converter
The quotient is produced digit by digit, starting from the most significant one that has to be transformed into standard notation by a subtraction. Fig. 4 show two ways of organizing the ∆-cells for the generation of carries in a 10-bit subtraction. We want the result digits all at the same time, so the slice number gives the time (expressed in ∆-cell delay) when the inputs must be ready. 
2 Square Root
The square root operation principle does not differ too much from the principle of division [4, 17, 18] . It only supposes that the divisor is identical to the quotient : Q = A ÷ Q. Like the division we have two classes of algorithms: The Newton-Raphson that uses a series of multiplications and additions and the digit-recurrence method.
2. 1 Cellular Array Square Root
The digit-recurrence square root approach relies on subtraction and multiplication by the radix b and by the quotient digits q j . A borrow save square root is composed of an array of carry-propagation-free adders/subtractors, and a conversion circuit that transforms the quotient from a borrow save notation to a standard binary representation. In radix 2 the iteration becomes:
Note that in borrow-save we have:
The main skeleton of a borrow save square root operator is the same as the one of the non restoring operator, the difference resides in the fact that in the first one no propagation exists and the next operation is determined by examining the MSB's of the partial sums ( Fig. 6 ).
2. 2 The TRC Cell
The TRC cell carries out the construction of Q:
Assuming that Q (j) = ∑ j i=0 p i * 2 -i at step j then Figure 7 . The ∆-cell and TRC cell equations
The square-root converter has the same structure as in Fig. 4 , the cells being slightly different. This converter can not be optimised as proposed in section 3.1.5 since for the square root not only the final result Q (n) is needed, but the intermediate results Q (i) are used as well.
2. 3 Implementation Results
A fixed point square root operator has been implemented using the GenOptim tool. Shown in Fig. 16 and Fig. 17 are two versions of the layout of the same size fixed point operators. In the first one no optimisation has been activated. We can notice the odd triangular shape of the layout, this is due to the disposition of the cells building up the complete operator. In the second schematic a nicer version of the layout is shown. In fact, placement optimisations have been activated when generating the operator. The optimisation shows to be of great interest since it allows the compacting of the circuit.
3 Square Root and Division Combined
Based on the architectures of borrow save division and square root, we imagined a combined operator for both of them. The circuit comprises a regular array of cells, fundamentally identical to the cells of separate operators thus only a few multiplexors are introduced. The similarities of the recurrence for division and square root are apparent so that their combination turns to be an easy task [17] . We judge that this combination is of considerable value. Considering the frequencies of occurrence of a division or a square root and the probability of soliciting the system to perform the two operations at the same time, we are convinced that the combination of the two operators will be a contribution of value. The combination will lower down the silicon surface, thus prize, and enhance the speed performance of the system.
IEEE FLOATING-POINT OPERATORS
The IEEE floating-point operators must be created with generators because the precision (the width of the exponent-significand in bits) is not. The precision will be chosen following the needs of the application. According to Fig. 8 , an IEEE operator is performed in four stages, which are the prenormalisation stage, the operation stage, the renormalisation stage, and the exception computation stage.
The prenormalisation stage consists of the effective preparation of operands A and B before the operation is executed. The renormalisation stage takes the result of the operation and codes it in the IEEE standard format. This stage takes into account the rounding mode. At last, during the exception handling stage, some flags may be set whenever an exception condition has been detected. Figure 8 . Main stages of an IEEE operator.
1 IEEE Division
1. 1 Architecture
An IEEE division is more complex than a fixed point division, since it includes operator prenormalisation and renormalisation, exponent calculation and exception handling. The prenormalisation stage involves the alignment of the significands before the division operation is carried out. The exponent stage computes the non normalised result exponent and delivers some other signals used by the other stages. The normalization operator adjusts the exponent and significand according to the rounding mode to fit the IEEE standard. At last the exception handling stage sets some flags whenever a null value, an invalid operation or operand, and overflow or underflow conditions have been met. Fig. 9 . Figure 11 . Architecture of an IEEE floating point square root operator.
An IEEE square root is more complex than a fixed point square root, since it includes operator prenormalisation and renormalisation, exponent calculation and exception handling. The prenormalisation stage involves the alignment of the significand before the square root operation is carried out. The exponent stage computes the non normalised result
2. 2 Practical Results
A simple precision IEEE floating-point operator has been generated and validated. The table below summarises some of its characteristics. (Fig. 12 
3. 1 Architecture
The realization of the separate operators for division and square root encouraged us to consider the design of a combined IEEE floating point operator. This is of a course a harder challenge. A careful study of the separate operators shows that the same material is "reused" for both operators. A very few multiplexors (taking a very small space) have been introduced in the combined operator.
We think that there is no need to lay down every single detail of the overall architecture of the IEEE combined operator. The main idea that must be kept in mind is that an IEEE operator skeleton is the same for any operation. we will always encounter the now traditional stages of the IEEE operators, i.e., prenormalization, exponent, normalization and exception handling stages. (Fig. 13) The signal divrac selects the operation to be carried out (1 for division and 0 for square root). Figure 13 . Architecture of an IEEE floating point combined operator for division and square root.
3. 2 Practical Results
A simple precision IEEE floating-point operator has been generated and validated. The table below summarizes some of its characteristics. (Fig. 14) operation transistors nb.
size (µm 2 ) propagation delay 8:23 divider & square root 60000 14000x5000 340 ns Figure 14 . Characteristics of a simple precision combined operator for division and square root A layout of the IEEE floating point operator for combined division and square root is depicted in Fig. 18 .
CONCLUSION
This paper has been considering the design and implementation of portable standard-cell generators of arithmetic operators satisfying the IEEE 754 standard for binary floating-point arithmetic under the GenOptim environment. The GenOptim VLSI CAD tool is of great value as it allows the rapid and efficient design of macro-functions that can be reused in larger designs. We believe that in light of production of Intellectual Property (IP) cores, this tool will be of great interest. The motivation behind the realisation of the combined operator are the similarities of the division and square root operations and the borrow save notation used in both of the cellular arrays to implement the two operators. The combined operator offers the advantages of speeding up of the overall system and considerable gain in surface, thus cost. The Figure 15 . Implementation results
Our contributions in the field of computer arithmetic and VLSI design and architectures are the GenOptim tool and the implementation of up to now theoretical algorithms and the combination of the two operations of division and square root in a single operator, both in fixed-point and floating-point schemes.
