The Data-Intensive Architecture (DIVA) system incorporates Processing-ln-Memory (PIM) 
Introduction
The Data-Intensive Architecture (DIVA) project [1] [2] is building a workstation-class system using embedded memory technology to replace the memory system of a conventional workstation with "smart memories" capable of very large amounts of processing. The goal of the project is to significantly reduce the everincreasing processor-memory bandwidth bottleneck in conventional systems. System bandwidth limitations are overcome in three ways: (1) tight coupling of a single processing-in-memory (PIM) processor with an on-chip memory bank; (2) distributing multiple processormemory nodes per PIM chips; and (3) utilizing a separate chip-to-chip interconnect for direct communication between nodes on different chips that bypasses the host system bus, as illustrated in Figure 1 . Based on our first PIM implementation, the resulting workstation system architecture that incorporates these PIM devices is projected to achieve speedups ranging from 8.8 to 38.3 over conventional workstations for a number of applications [2] .
One of the important target applications of DIVA is multimedia. Multimedia applications perform repeated computations on streams of data, often with little temporal data reuse. As processors exploit increased parallelism, multimedia applications become memory bound. Since the data type of these applications is often floating-point, floating point 'capability is crucial for further performance improvement of DIVA systems. This paper describes the design of the DIVA floatingpoint unit (FPU), which is the key improvement over the first implementation of the DIVA PTM chip [2] . The DIVA FPU supports the IEEE-754 single-precision data type, four rounding modes, and most precise exceptions. To support WideWord floating-point capability, eight FPUs are integrated in each PIM chip. Therefore, significant design tradeoff effort was focused to minimize area and maximize performance of the FPU. In particular, the division algorithm is carefully chosen and implemented to minimize the area overhead while achieving high throughput. The DIVA FPU is implemented using Artisan standard cells and ROM in TSMC 0.1 8pm CMOS technology. Post-layout simulation shows the FPU is capable of running up to 300MHz clock frequency. With a nominal chip clock frequency of 250MHz at 1.8V, the simulated average power dissipation for each FPU is 152mW:
The remainder of this paper is organized as follows. Section 2 presents a brief description of the DIVA PIM node architecture followed by a detailed description of the FPU microarchitecture in Section 3. Section. 4 presents the implementation details. Section 5 presents the simulation results followed by a brief-summary and conclusion in Section 6.
2.
DIVA Architecture Overview 
DIVA FPU Microarchitecture
The DIVA FPU implements a subset of the IEEE-754 floating-point standard [5] . Since target applications are mostly from the multimedia realm, only singleprecision numbers are supported. To achieve a better area-performance solution, operations on denormalized numbers are not supported and cause exceptions. In addition, whenever a result is a denormalized number, an underflow exception is raised and the minimum normalized number is produced for output. The inexact exception flag on division operations is not IEEE-754 compliant, which is common for multiplicative division algorithms. Additional operations are necessary to correct this. Other exception flags -Invalid, Divide by Zero, Overflow, Underflow and Inexact (except divide) -are accurately generated as specified by the IEEE-754 standard. All four rounding modes are implemented. Figure 3 depicts the microarchitecture of the FPU. The FPU has two main blocks: ALU and MuVDiv. Exponent computation functions for both blocks are combined in one datapath to reduce area. Similarly, converting logic to/from the internal number format and rounding logic are shared for both of the datapaths. As only one instruction can be issued at each cycle, combining common datapaths does not suffer any performance penalty. Input registers for the ALU and the MuVDiv Table I shows the supported floating-point instructions and their pipeline latency and throughput. Detailed pipelining issues will be discussed in section 4.
ALU
A block diagram of the ALU is shown in Figure 4 . AddJSub instructions proceed by swapping operands if necessary, aligning the fraction of the smaller operands, computing the fraction, normalizing the fraction with adjustment of the exponent, rounding and generating the exception flag, if any. The exponent datapath includes three exponent adders that are also used for multiply and divide instructions. For AbsoluteNegate instructions, OprB is preset to zero by the operand formatter then added to OprA in both the fraction and exponent datapaths. The controller determines the sign bit of the result based on the sign bit of OprA. For the Fp2Int (Floating-point to integer) instruction, the fraction is shifted right depending on the value of exponent (1 57-ExpA), forming a 3 1 -bit unsigned integer. If the floatingpoint number is negative, the fraction is inverted for the two's complement representation. Rounding and overflow detection is carried out thereafter. For Int2Fp (Integer to floating-point) instruction, the fraction is first converted to sign-magnitude format by conditionally inverting if the sign is negative. Then the result is shifted left to remove leading zeros. The exponent is adjusted accordingly by this leading zero value. Note that the exponent value of OprA is preset to 157 which corresponds to 230 in integer form.
Multiplier/Divider Fused Unit
To meet performance requirements of modern scientific applications such as 3D graphics rendering, high performance is crucial for division as well as multiplication. High-radix SRT dividers based on the digit recurrence algorithm are widely used for modem microprocessors. However, this type of divider requires a hardware block, which will substantially increase the area for datapath and communication buses. Since eight copies of the DIVA FPU are to be implemented, a good area-performance solution is the primary design goal. To achieve this, we adapted the multiplicative division algorithm proposed by Liddicoat and Flynn [6] [7], which is based on Taylor series expansion, as shown in Figure 5 .
This algorithm achieves fast computation by using parallel squaring and cubing units, which compute the higher-order terms significantly faster than the traditional serial multipliers with a relatively small hardware overhead. There are three major multiply operations to produce a quotient with 0.5 ulp (unit in the last place) error as shown in Figure 5 . One additional multiply operation is required for exact rounding. To maximize the area efficiency, all of these multiply operations are executed by one multiplier. By sharing the multiplier, the pipeline latency increases by four times. However, through careh1 pipeline scheduling, we were able to achieve high throughput for consecutive divide instructions. A lookup table for an initial seed value is implemented using a 128x7-b ROM. A two-stage pipelined multiplier is used for better synthesis results under the timing specification of 25OMHz clock frequency. Table I1 Figure 7 shows the pipeline diagram for three consecutive divide instructions. Although 15 clock cycles are required to complete one divide instruction, the pipeline is designed such that divide instructions can be issued every five clock cycles. If any other type of instruction follows a divide instruction, a pipeline stall for seven clock cycles is required to ensure in-order completion as shown in Figure 8 . All other combinations of instructions run without pipeline stalls. 
Pipelining

Implementation
The FPU design has been described in Verilog with the exception of the two-stage multiplier, where the netlist was generated using a special synthesis tool [SI.
These netlists along with the ROM needed for the divider were combined together. For balanced pipeline stages, register retiming techniques have been generally applied in the logic synthesis step. This netlist was then placed and routed using timing-driven algorithms. The layout of the prototype FPU is presented in Figure 9 . Features of the prototype FPU are summarized in Table 3 . 13.3%
Experimental Results
Area
In Figure 10 , the area occupied by each logic block is presented. The FPU is synthesized under the timing constraints of a 250MHz clock frequency. To show the area overhead by the divide instruction support, the area for each sub-block of the Mul/Div block is also presented. The proposed MuvDiv block requires approximately 2.3 times larger area than the two-stage multiplier. As the Mul/Div block occupies 54.4% of the total area of the instructions. This is because 40% of the pipeline stages are idle while consecutive divide instructions are processed. At the nominal supply voltage of 1.8V, the maximum operating clock frequency is 300MHz. A Schmoo plot is presented in Figure 12 . Performance is significantly-degraded when the supply voltage is less than 1.OV mainly due to the operating range of the ROM.
Conclusion
This paper presented the design and implementation of the floating-point unit for the DIVA PIM processor. A standard cell implementation based on TSMC 0.18pm CMOS technology has shown that area and performance are well balanced. This balance is achieved through an efficient divider implementation and sub-block sharing among different functions. This design will be incorporated in the next DIVA PIM chip, which will tape out in mid 2003.
