This paper proposes a divider structure that combines a novel self-timed ring structure and a carry-propagation-free division algorithm. The self-timed ring structure enables the divider to compute at a speed comparable to that of previously designed dividers with less silicon area. By exploiting the carry-propagation-free division algorithm, we can achieve even better performance. We designed a layout of 54-bit divider using 1.2 m CMOS technology and measured the area and speed. We obtained a speed of 135 ns per worst case division on 5.7 mm 2 of silicon area.
I. Introduction
There are two commonly used approaches to implementing a hardware divider | sequential and combinational 1], 2]. In general, the sequential approach requires less silicon area at the expense of slow operation and possible performance limitations caused by employing a clock. A totally combinational array is faster but at the expense of large silicon area. The structure of the divider proposed in this paper adopts the sequential approach for area e ciency. Even with the sequential approach, we can achieve high speed by exploiting self-timed logic and the redundant signed digit (RSD) number system 3] at the same time.
Self-timed logic has been recently utilized to reduce the malfunctioning phenomena caused by the clock skew, and to increase the throughput of the system 4], 5]. A divider that exploits self-timed logic has already been proposed in 6]. They used a self-timed ring structure to get a signi cant reduction in chip area and obtained a speed comparable to fully combinational array structure. They implemented an SRT(named after Sweeney, Robertson, and Tocher) division 7] chip that generates 54-bit oating point mantissas with a 5-stage ring.
The RSD number system has often proved useful for speeding up arithmetic operations 3], 8], 7]. The RSD number system has also been applied to division algorithms 9], 10]. The algorithms adopt the RSD number system to exploit the carry-propagation-free arithmetic thereby obtaining high speed addition and subtraction operations 9]. Especially, the algorithm in 10] uses only one full adder as the processing element for each bit and enables us to use minimum hardware per bit.
In this paper, we propose a ring structured self-timed divider, which combines the above two techniques. By employing a self-timed ring structure, it can be implemented on a signi cantly reduced silicon area as compared with the array type divider implementation. With the carry-propagation-free arithmetic in the RSD number system, the addition/subtraction required in each iteration of the division can be faster than the SRT division adopted in 6]. Moreover, the structure we adopted allows reduction of the number of stages in the ring thereby reducing the area further without performance degradation 11], 12].
II. Division Algorithm and Divider Structure
The algorithm adopted here is an SRT division algorithm that carries out conventional binary nonrestoring division through repeated add/subtract and shift-left operations. The division algorithm can be described by
?
where R j (j = 1; 2; :::) is the j-th partial remainder, R 0 is the dividend, D is the divisor, and q j is the j-th quotient bit. As described in 10], the value of q j is determined uniquely by examining the three most signi cant digits (r where`0' denotes ones' complement.
Carry propagation in the conventional addition/subtraction for the computation of equation (1) would be a bottleneck that limits the divider's speed. One solution to this problem is to use a redundant representation for the operands and the results 13], 3]. In this paper, we use RSD number system for the dividend and partial remainders. For the divisor, we use ordinary nonredundant binary representation. The RSD addition uses only type-1 full adder 14] con gured in parallel and requires no carry propagation. The operation of the type-1 full adder is shown in Fig. 1 . The gure also shows the binary encoding (sx, ax) of the RSD number x. Note the redundant representation of 0 in the RSD number system. is computed using the type-1 full adder by the following equations(refer to equation (3) We put these two cases in a group because they generate the same output. Proceeding this way, we can obtain a grouping as shown in the table.
Once the group is determined, each bit of the current stage's partial remainder can be computed by selecting one from aq j , sq j , aq j , sq j , 0, and 1, as shown in Table II . For example, if sr j+1 i?1 is in group3, we take aq j as its value because they have the same value. In this scheme, we no longer need adders. Instead, we need grouping logic(GL) and selection logic(SL) as shown in Fig. 2 . SL is a simple multiplexer which selects one from the values computed by the previous stage's quotient determination logic(QDL), or selects 0 or 1. The selection is controlled by the group values computed by GL.
The divider proposed in this paper has a ring structure similar to that proposed in 6], 11], 12]. However, our structure can be implemented with much simpler hardware. It no longer needs 3-bit adders, multiplexers, and C elements. Other logics are also simpli ed. Nevertheless, the new structure still exploits overlapped execution. While the current stage's GL is computing the group value from the previous stage's partial remainder, the previous stage's QDL computes the encoded quotient bit values concurrently. Then the SL selects values for the current stage's partial remainder. The critical path in each stage consists of one QDL and one SL. Assuming that the delay through SL is not longer than the delay through a multiplexer used in the other structure, we expect no performance degradation as compared with the previously proposed structures.
The structure in Fig. 2 has two substages. The rst substage consists of a QDL, a GL, and a completion detector. The second substage consists of an SL and a completion detector. Because each stage consists of two substages, we need four stages (eight substages in total) in the ring. Refer to section III for details.
Note, in Fig. 2 , that the delay of QDL is always longer than the delay of GL, and therefore the rst substage needs only one completion detector at the output of QDL.
III. Implementation and Experimental Results
In general, self-timed logic employs a dual-rail encoding to generate a completion signal when the functional block nishes its evaluation. QDL can be designed using DCVSL(Di erential Cascode Voltage Switch Logic) 15]. However, the circuit is relatively complex, and therefore does not yield an easy optimization. As mentioned above, each quotient digit needs two values, aq/aq and sq/aq, for binary RSD number representation.
The two values are generated by QDL's two circuits, AQ generator and SQ generator, respectively. Refer to 12] for details.
Since GL does not need dual-rail implementation, we implemented it with domino logic to reduce the hardware size. Moreover, group5 and group6 signals are not generated explicitly in GL because they can be implicitly generated in SL without performance degradation. The domino logic implementation of GL and the DCVSL implementation of SL are shown in Fig. 3 and Fig. 4 , respectively.
Employing the proposed structure, a full-custom layout of a 54-bit divider was designed using MOSIS 1.2 m CMOS design rules and tested through SPICE simulation. Table III compares the data obtained for our implementation with those in 6] and 12]. Note that we employed the technology identical to that in 6] and 12]. By using the proposed structure, we could obtain 30% speedup with 25% area increase over the divider employing nonoverlapped execution 12]. As compared with the divider employing overlapped execution 6] which has ve stages in the ring, our divider which has four simpli ed stages performs 15% faster with 16% less area.
The minimum number(N) of stages in the ring without performance degradation depends on the forward latency(L f ) and the reverse latency(L r ) as shown in the following equation 6]:
The ring structure proposed in 6] has ve stages because L r =L f is 1.1. Because the ratio of our design for non-overlapped execution is 2.3, we need eight or more substages. Therefore, we put four stages(eight substages in total) in the ring.
IV. Conclusion
The self-timed divider structure proposed in this paper requires less chip area and maintains higher hardware utilization than the previous implementations employing array or ring structures. The structure is more e cient in terms of execution time and chip area due to the adoption of a novel self-timed ring structure and a carry-propagation-free addition/subtraction scheme. A layout was designed using MOSIS 1.2 m CMOS design rules. The design occupies 5.7 mm 2 of silicon area and takes 135 ns for a worst case division operation. 
