Abstract -A 440 000-transistor second-generation RISC floating-point chip is described. The pipeline latency is only two cycles, and a donbleprecision result is produced every cycle. System throughput and accuracy is increased by using a floating-point multiply-add-fused ( 
I. INTRODUCTION
w ITH the second-generation RISC floating-point unit (FPU), the essential RISC concept of attacking the most frequently used functions by building simple self-contained low-latency hardware was extended to floating-point operations [1]- [3] . It greatly increases the floating-point performance and accuracy by including as the key feature of the FPU a unified floating-point multiply-add-fused (MAF) unit, which executes the doubleprecision accumulate instruction D = (A X B) + C as an indivisible operation, with no intermediate rounding [1] , [4] . By @sing multiplication with addition, the MAF building block allows one-cycle throughput and two-cycle latency, producing a floating-point accumulate with one rounding error. This single functional unit requiring an instruction set which others only emulate also reduces the hardware overhead associated with adders/normalizers by combining various operations necessary for fast multiplication with accumulation. In other words, the MAF primitive provides major improvements for the floatingpoint executions, provided that the unit can be designed without any substantial impact on the cost, cycle time, or other VLSI requirements.
This paper describes the MAF data flow implemented using high-speed CMOS circuits. It attains a peak of bipolar RISC systems, and is achieved with a cycle time comparable to CMOS RISC systems. The next section discusses the MAF concept, explaining in some detail why it is an appropriate addition to the floating-point architecture in VLSI. Section 111describes the interaction of logical and physical design required to incorporate several advances in VLSI arithmetic while accommodating various delay and technological (physical) constraints. The chip data are presented in Section IV followed by the conclusions in the final section.
II. THE MULTIPLY-ADD-FUSED (MAF) CONCEPT
A short introduction to the MAF concept will be given in this section. The MAF implementation should be consistent with the basic RISC philosophy [3] of heavily optimizing units in order to rapidly carry out the most frequently expected functions as fast as possible. In a significant majority of the SPEC floating-point benchmarks, most of the floating-point multiply and add operations were subsumed by the MAF, i.e., concatenating multiply and add. Therefore a single self-contained unit which forms the multiply-accumulate operation D = (A x B) + C would produce a considerable enhancement in the floating-point performance. The individual operations necessary would be:
1)
2)
3) a parallel effort of the true mantissa and exponent calculations for multiplication; a prenormalization stage for the bit alignment of the values to be added; addition, post-normalization followed by rounding is needed for all calculations.
Overlapping the data alignment with the early phases of multiplication would be the first step since the partialproduct compression is a time-consuming operation involving multiple stages of summation and a final addition. This approach leads to a first cycle of the pipeline shared by the multiplication, the exponent calculation, and product alignment, i.e., shifting the addend in either direction. to be paid for combining these operations, however, is an increment of the mantissa length resulting in a 1/2 increase of the add/normalize range compared to a conventional multiplier/adder pair (see Fig. 1 ). Hiding the normalization delay completely and making the add/normalize path comparable in time to the multiplication/shifter path requires some novel techniques. A leading-zero anticipator was added to the second pipeline in order to predict the shift amount for post-normalization before the addition is finished [5] .
It should be pointed out that the two-stage pipeline MAF unit would be feasible only under the conditions of building a very fast shifter, which eases overlap of the multiplication and prenormalization, and a leading zero anticipator (LZA), which allows post-normalization and addition in the same cycle by running in parallel with a carry lookahead adder incremented. The resulting pipeline structure is shown in Fig. 2 .
The architectural trade-offs made during MAF design were focused on keeping the balance between overall performance and circuits put into a chip. That was achieved by eliminating some excess hardware needs from the conventional designs, in particular 1/0 ports, buses, MUXS, etc., and replacing them, if necessary, by the new circuitry (like LZA) which improves system throughput and decreases control complexity.
III. MAF DATA-FLOW IMPLEMENTATION
As described in the preceding section, the two-cycle MAF concept is a very ambitious project. Since various operations of the dot product share a common time slot the desired logical realization of the MAF subsections and their interaction need special attention. Therefore, this section is devoted to a detailed description of the logical and physical implementation for the critical segments of this activity. We will discuss various design aspects in detail for the separate subunits of the data flow: trade-offs made during the data-flow implementation, particular emphasis is on two critical aspects: 1) throughput and 2) area occupied using the two levels of metal available at the time. A key objective in VLSI is to achieve an efficient design within the space allowed by the top layer of wiring for the circuit [6] . For simplicity and uniformity, a tool/design discipline called macro layout generator (MLG) was developed and used throughout this effort [7] . MLG is a grid-based symbolic design tool that conveniently aligns available metal layers with the gates, including drain/source and gate contacts, in order to accommodate the capabilities of the technology, i.e., one metal 2 wire for every contacted gate, and the gates are parallel to metal 2, allowing easy contact to metal 1. An image file for MLG defines all restrictions in terms of integration layers shared by various design parts.
Since the largest share of the MAF circuitry is in the multiplier, we will begin our discussion with that unit.
A. kfultiplier
The multiply array utilizes two basic techniques: modified Booth encoding and extended Wallace (carry-save) tree [8] , [9] . The major extension from the literature of the carry save or (3,2) addition is that of (7,3) counting (seven input bits reduced to their 3-b binary sum) [1], [2] . In order to be efficient, the number of significantly loaded stages must be kept to a minimum since the CMOS technology used has substantial delay because of the long wiring that occurs in a Wallace tree implementation. This is due to the high sensitivity of the CMOS technology to capacitive loading. A reduction in stages with long wires yields a considerable improvement in performance. The seven-input three-output cell built using carry-save adders depicted in Fig. 3 is more complicated and slow. With a careful physical design, however, the number of long wires that must be driven is reduced by a factor of 2.5, resulting in a timing-efficient design. Another important factor in VLSI design is connectivity defining both cost and performance. The (7, 3) adder eliminating 4 b from the partial-product compression requires only half as many connections as the (3,2) adder to produce a final result. Also, this cell organization reduces by a factor of 1.6 the number of stages needed, since each stage removes a factor of (7,3) in comparison to (3, 2) from the number of remaining terms to be summed. In The first stage consisted of a cell with seven 4-input multiplexors (for the Booth method), requiring 28 control and data inputs, all using polysilicon, which fit into the 36-track-wide cell. Thus, a combination of Booth encoding and (7, 3) compression produces a very efficient muhiplier, but it still must be implemented in CMOS VLSI. The wiring requirements for 7/3 counters were shown to occupy a width of 18 transistor locations or metal 2 (M2) wires per cell. The most efficient cell organization produced a mirrored pair of cells with twice that width, which shared a common control bay.
B. Shfters
Parallel execution of various fundamental operations of dot product is based on the assumption that wide-range shifting/rotating the data can be carried out without dramatically impairing the cycle time and area constraints. This is achieved by applying a particularly novel approach, called partial decode or modulo shifter. It is designated as partial decode because of its unique multistage structure with partial-shift shift groups, or modulo shifter since each nested shift amount can be calculated by modulo arithmetics. At this step, an illustrative example would clarify how this technique works. A 160+ bit shifter, for instance, is accomplished by shifting multiples of 16 bit positions (O,16,32,48,64,80,96,112, 128,144, 160) , then a shift of multiples~f four bit positions (O,48, 12) , and a final binary shifter (O,1,2, 3) . Note that, although the input data have to be moved in a broad range of first-stage shift positions, only a four-way multiplexer is enough because there are less than 64 data bits. A typical shifter stage is shown in Fig. 5 . Its NMOS-like shift circuits also allow very wide oRing within the function. The wide OR required for IEEE sticky bit calculation is carried out by oRing the control signals. This simple but powerful technique minimized the cell height due to less wires for data and control and attained extremely fast shift time (for our example above, 6 ns after the shift amount is calculated). It should be noted that the partialshift steps, their sequences, and control signals can be optimized according to each application. Also, having the shift groups subdivided facilitates other bit manipulations.
To avoid performance degradation during the two-cycle accumulation, a similar implementation with comparable performance is used for post-normalization to shift the adder output by the amount of completely overlapped zero detection.
C. Logarithmic Adders
For the very large adder (114 b) concatenated with an incremented (55 b), one of the fastest implementations is known to be logarithmic [10], [11] . This operation is performed in a one's complement form which involves an end-around carry. Its classical carry lookahead block is built using a binary tree of the true followed by complementary propagate/generate stages. In a Iookahead scheme only the most significant bit groups are being evaluated while doubling the number of carry signals at each stage to be computed. The well-known fan-out problem can be bound by using the least-significant-bit locations to buffer the increasing load, leaving a fan-out of about 3 for every stage. This logarithmic buffering scheme makes the end-around carry be generated by only a single gate delay. Also, if the sign of the result is negative, the bits of the word are complemented so that the sign magnitude result is correctly output in one more gate delay.
D. Leading Zero/One Anticipation
A minimum of the two-cycle latency and a second pipeline delay comparable in time to the multiplication/shifting path of the W could be achieved only by overlapping the normalization and addition. This novel approach, called leading 0/1 anticipation, requires that the shift amount for the post-normalization be computed by using the two operands.
The implementation of the LZA inputs the generate (G= both inputs are on), propagate (P= exactly one input is on), and zero (Z= no inputs are on) signals that are already created for the carry lookahead adder and outputs the normalize shift amount. Since the post-normalization is based on counting the consecutive bits matching the sign of the final result, the first instance of a nonzero element can be determined by examining the bits from the high-order end of the addition until the first one (or, in the case of a negative result, the first zero) is predicted. In fact, the sign token carries enough information to start with the leading 0/1 anticipation, namely: zero (Z) -adding two positive numbers, positive result; generate (G) -adding two negative numbers, negative result; and propagate (P) -subtraction. Note that a string of propagate tokens implies that the sign of the word is unresolved; the first Z labels the result negative, while the first G labels the result positive. Subsequently, after the word becomes negative, the next non-G signals the first bit of significance, since a string of G symbols produces all "l" symbols. Similarly, in the case of a positive result, the next non-Z represents the first nonzero in the word, and thus stops the shift count. The undecided state (P) will always go back either to the Z or G case that is defined entirely. Since the low-order bits are not monitored, a carry can cause a single bit position overshift, which is corrected during the binary normalization. Fig. 6 represents the general state diagram along with the adjustments for each output case. The carry propagation is known to take log(n) time, therefore it is desired that the leading zero count be computed in the same period. This would mean processing the string of P, G, and Z inputs as fast as the adder, e.g., by using a parallel algorithm similar to the carry lookahead structure. The leading 0/1 Iookahead construction is more complicated than carry calculation. Hence the first stage is a four-step process, to allow the more complex LZA parallel cell to fit in 4 b of the adder cell. Therefore, for logarithmic leading 0/1 anticipation, the state diagram of Fig. 6 should be generated for all hexadecimal subgroups of the P, G, and Z tokens. It can easily be shown that, in general, all possible segment 
G3Zd) +(P1G2 + G1Z2)Z3Z4 (4) PG = P1P2(P3Zq + Z3GJ + (P1Z2 + Z1G2)G3GA. " (5)
The subsequent prediction step is recursively combining the individual Iookahead blocks. At the nth stage, the logical equations for general state anticipation can be given as follows:
PG<.~~)~= PP.(~_~,PG.~+ PG.(~_~)GG.n . ( 10)
A typical circuit for state generation (PG) and recursion is depicted in Fig. 7 . Each cell at the following iterations will have the intermediate prediction signals as inputs (to receive the information from its neighbors on the left and the top) and outputs (to supply the anticipated states to its neighbors on the right and the bottom as well as to the network outputs). The doubling and buffering procedure mentioned above is abstracted in Fig. 8 In this implementation, the delays for hexadecimal shifting tuned to the delays for add/complement output, thus allowing the control and data to arrive nearly simultaneously at the shifter.
IV. CHIP PHYSICAL DATA
The FPU was fabricated on a 12.7-mm X 12.7-mm die using triple-level metal, single-polysilicon, l-~m minimum feature technology [12] . Fig. 9 shows a photomicrograph of the RISC FPU chip. This unit attains a peak execution rate of 50 MFLOPS with a 25-MHz clock frequency and is capable of sustaining nearly that rate in complex programs such as graphics and Livermore loops. It operates at 40 ns under worst-case conditions and dissipates 4 W of power.
The complete multiplier array utilizes all the M2 and all the polysilicon available, and occupies only 4 mm X 5 mm of area in a l-~m minimum feature technology. Table I summarizes the number of transistors, area, and s~eed txo~erties of the subunits described so far. 
V. CONCLUSION
The second-generation FPU chip that stretches fundamentals underlying the RISC philosophy beyond the CPU implementations has been described. Its MAF (multiply-add-fused) unit increases throughput and accuracy of floating-point computations over other existing designs. The single accumulate instruction is emulated by other designs, but does not maintain the accuracy or latency. The cycle time is competitive with other CMOS RISC systems, while the floating-point performance is considerably above other CMOS units.
Leading zero anticipation made the two-cycle pipeline possible by nearly eliminating the additional post-normalization time and allowing for reduced overall system latency. Partial decode shifters enable complete time sharing for the multiply and data alignment. Improved design techniques for logarithmic addition and higher order counters for multiplication complete this second-generation RISC FPU design. Its tightly coupled logical/physical correlation allows for reduced wiring area and delay in the logarithmic multiplication and probably logarithmic addition necessa~for the massive 169-b adder/accumulator.
Finally, the state of the art in floating-point/VLSI arithmetic has been enhanced in many ways by the RISC FPU. We believe that the MAF concept will provide a solid basis for the future RISC floating-point architectures.
ACKNOWLEDGMENT
First of all, there would be no second-generation RISC machine without J. Cocke, who is utterly inspirational. Acknowledgment is also due to research and development teams, with particular thanks to E. Kronstadt for his encouragement and support to the work described in this paper.
