FPGAs have reached densities that can implement floatingpoint applications, but floating-point operations still require a large amount of FPGA resources. One major component of IEEE compliant floating-point computations is variable length shifters. They account for over 30% of a doubleprecision floating-point adder and 25% of a doubleprecision multiplier. This paper introduces two alternatives for implementing these shifters. One alternative is a coarsegrained approach: embedding variable length shifters in the FPGA fabric. These units provide significant area savings with a modest clock rate improvement over existing architectures.
INTRODUCTION
While modern supercomputers depend almost exclusively on a collection of traditional microprocessors, these microprocessors have poor sustained performance on many modern scientific applications [1] . FPGAs may provide an alternative, but scientific applications depend on IEEE compliant floating-point computations for numerical stability and reproducibility of results. Increases in FPGA density, and optimized floating-point unit designs, have made it possible to implement a variety of scientific algorithms with FPGAs [2] , [3] , [4] . In spite of this, there are still significant opportunities to improve the performance of FPGAs on scientific applications by optimizing the device architecture.
Floating-point units (FPUs) can be embedded in the FPGA fabric to provide a large area savings and increased clock rates for floating-point based kernels, but they also consume 17.6% of the chip [5] . An alternative is to focus on less area intensive enhancements to the FPGA fabric that improve floating-point units.
A fundamental component in floating-point arithmetic is a variable length and direction shifter. In floating-point addition, the mantissas of the operands must be aligned before the computation. For full IEEE compliance, floating-point multiplication and division require normalization of the mantissa before and after the calculation [6] . Shifters require a series of multiplexers, which are currently implemented using LUTs. In our double-precision floating-point cores, the shifter accounts for almost a third of the adder and a quarter of the multiplier. Thus, better support for variable length shifters can noticeably improve floating-point performance.
We consider two approaches that optimize the FPGA hardware for variable length shifters. The first approach is to embed shifters in the FPGA logic. Recent FPGAs have included embedded units including embedded multipliers, block RAMs, and even full microprocessors. Embedding variable length shifters allows configurable logic in the FPGA to be used for other purposes and yields a large area savings. The trade-off is an increase in the silicon area that is used for dedicated functionality that may not be used by all applications. The second approach is to modify the general purpose logic of the FPGA by adding a 4:1 multiplexer in parallel with the traditional LUT. This approach decreases the area required to implement shifters in the general purpose logic of the FPGA without wasting a significant amount of silicon area.
To test these concepts, we modified VPR to support embedded functional units and high-performance carrychains. VPR was used to place and route benchmarks that use double-precision floating-point operations.
The benchmarks included matrix multiply, matrix vector multiply, vector dot product, FFT, and LU decomposition. Each benchmark was tested using three versions of the FPGA. The first is similar to the Xilinx Virtex II Pro [7] , and is representative of current commercial devices. The second adds embedded shifters, while the third uses modified CLBs that have the additional 4:1 multiplexer.
BACKGROUND
The IEEE-754 standard [8] specifies the floating-point numbers used on most computing platforms. Floating-point numbers consist of sign, mantissa, and exponent. The mantissa, f, is multiplied by the base number (two) to an exponent, e, as shown in Equation 1 (double-precision).
( )
Compliance with the IEEE double-precision format is important for cross-platform portability and verifiability; double-precision also improves numerical stability.
Double-precision floating-point has a sign bit, an 11-bit exponent and a 52-bit mantissa. Since the mantissa is normalized to the range [1, 2) , there will always be a leading one on the mantissa. The implicit leading one gains a single bit of precision, but raises the complexity of floatingpoint implementations. The exponent is represented in a biased notation. All stored numbers are "positive", but have been "biased" by half the exponent range. This representation simplifies floating-point comparators.
The dominant style of FPGAs is the island-style FPGA consisting of a two dimensional lattice of CLBs (Configurable Logic Blocks). Connecting the CLBs are regular horizontal and vertical routing structures that allow configurable connections at the intersections. In recent years, embedded RAMs, DSP blocks, and even microprocessors have been added to the island-style FPGAs. However, floating-point applications still require a large number of CLBs to perform basic operations.
VPR
VPR [9] is the leading public-domain FPGA place and route tool. It uses simulated annealing, a timing based semi-perimeter routing estimate for placement, and a timing driven router. VPR was used to determine the feasibility of the proposed techniques.
A similar version of VPR was used to test the feasibility of coarse grained embedded floating-point units [5] . In addition to previous VPR supported types of input pads, output pads, and CLBs, the modified version of VPR supports multiple embedded blocks. These embedded blocks have parameterizable heights and widths that are quantized by the size of the CLB. Horizontal routing is allowed to cross the embedded units, but vertical routing only exists at the border of the embedded blocks. As in previous work [5] , fast carry-chains were added to insure a reasonable comparison. The synthesis and technology mapping approach are covered in the methodology section.
The baseline FPGA architecture was modeled after the Xilinx Virtex-II Pro family of FPGAs and includes most of the major elements of current FPGAs (CLBs, 18Kb block RAMs, and embedded 18-bit x 18-bit multiplier blocks). The CLBs include 4-input function generators, storage elements, arithmetic logic gates, and a fast carry-chain. In addition to the standard Xilinx Virtex-II Pro features, the architecture incorporates embedded shifters and a modified CLB with a 4:1 multiplexer in parallel with the LUT. The various blocks were arranged in a column based architecture ( Fig. 1 . ) similar to modern FPGAs (e.g. [10] ).
The ratio of the number of columns of each component type was based on the average requirement for the benchmarks and is shown in Table 1 . These ratios are based on the average resource requirements, so each benchmark be constrained by a different limiting resource. The percent of each resource used for each benchmark is given in Table 2. through Table 4 . where the limiting components for each benchmark are shaded.
The embedded units and fast carry chain are dedicated resources that use their own timing parameters. Embedded units can have optional registered inputs and/or registered outputs, and are characterized by two timing parameters: sequential setup time and sequential clock-to-q. The dedicated route of the carry chain also has its own timing.
Component Latency and Area
The CLBs that were used are comparable to the Xilinx slice. Each CLB has two 4-input function generators, two D flip-flops, arithmetic logic gates, and a fast carry-chain. VPR uses subblocks to specify the internal contents of the CLB. Representing the timing of an unmodified CLB required twenty VPR subblocks. The 4:1 multiplexer modification added two VPR subblocks. Each subblock can specify a combinational and sequential logic element and has three timing parameters, similar to the embedded units: sequential setup time, sequential clock-to-q, and maximum combinational delay. These timing parameters were found in Xilinx data sheets or using Xilinx design tools and are shown in Table 5 . We also used two embedded units that were modeled after the Xilinx Virtex-II Pro: an 18x18 bit multiplier and an 18 Kb RAM. However, like the Xilinx Virtex-4 [11] , these units are independent of each other. The timing parameters of the embedded multipliers and RAMs are based on the Xilinx Virtex-II Pro -6 and are shown in Table 5 . The area (including routing) of the CLB, embedded multiplier, and embedded RAM were approximated using a die photo of a Xilinx Virtex-II 1000 courtesy of Chipworks, Inc. The areas were normalized by the process gate length. All areas are referenced to the smallest component (the CLB) and are shown in Table 5 .
Track Length & Delay
Four different routing track lengths were used: single, double, quad, and long, where long tracks spanned the entire chip. The ratio of routing tracks (11:14:21:4) was modeled after the Xilinx Virtex-II Pro. The delay of routing in VPR is calculated based on a resistive and capacitive model. Appropriate values for the routing track segments were found experimentally by laying out and extracting them using Cadence IC design tools.
Embedded Shifter
For floating-point operations, the mantissa can be shifted by any distance up to the full length of the mantissa. Thus, up to 53 bits of shift can be required for IEEE double-precision, but shifters tend to be implemented in powers of two. Therefore, shifters of length 32 (for single precision) and 64 bits were implemented as shown in Fig.  6 .
The embedded shifter was designed with five modes (shift left, rotate left, shift right logical, shift right arithmetic, and rotate right) to increase versatility. In addition, the normalization shifting in floating-point units requires calculating a sticky bit. The sticky bit is the logical OR of all of the bits that are lost during a logical right shift. The logic to calculate the sticky bit is included in each shifter as it adds less than 1% to the shifter size.
The embedded shifter has a total of 83 inputs and 66 outputs. The 83 inputs include 16 control bits, 64 data bits, and 3 register control bits (clock, reset, and enable). The 66 outputs include 64 data bits and 2 sticky bits (two independent sticky bit outputs are need when the shifter is used as two independent 32-bit shifters).
The I/O connections are evenly distributed around the periphery of the shifter and connect to CLB-like connection blocks.
The benchmark circuits use the embedded shifters in the fully registered mode, so only sequential setup time (300 ps) and sequential clock-to-q (700 ps) were needed:. Internally, the combinational delay of the shifter was only 1.52 ns. The sequential times were derived from similar registered embedded components of the Xilinx Virtex-II Pro -6, while the combinational time and the area (0.843 10 6 L 2 ) were derived by doing a layout in a 130 nm process. The area is 1.27 times the size of the CLB and its associated routing, but it does not take into account the area needed for additional connections (relative to a CLB) or the area needed for connections to the routing structure. Because this area overhead is difficult to estimate, three different shifter sizes (two, four, and eight equivalent CLBs) were considered. There were only trivial differences in the area and clock rate results (data not shown), so this analysis uses size four, which have more connections than one shifter. 
Multiplexer
The fine-grained optimization attempts to enhance shifting without impacting the general routing. To accomplish this, the only change that was made to the CLB was to add a single 4:1 multiplexer in parallel with each 4-LUT as shown in Fig. 2 . The multiplexer and LUT share the same four data inputs. The select lines for the multiplexer are the BX and BY inputs to the CLB. Since each logic block using the unmodified Xilinx Virtex II Pro slice has two LUTs, each CLB would have two 4:1 multiplexers that share their select lines. For shifters and other large datapath elements it is easy to find muxes with shared select inputs. The BX and BY inputs are normally used as the independent inputs for the D flip-flops. This new usage prohibits that and requires that the input to the D flip-flops be from the logic within the CLB. This trade-off prevents increasing the number of inputs to the CLB.
To test the impact of adding the 4:1 multiplexer, a 4-LUT and associated logic was laid out and simulated with and without the capacitive load of the 4:1 multiplexer. Adding the 4:1 multiplexer increased the delay of the 4-LUT by only 1.83%. The delay of the 4:1 multiplexer was 253 ps, which is less than the 270 ps for the 4-LUT from the Xilinx Virtex-II Pro -6 datasheet. The area of the 4:1 multiplexer was 1.58 10 9 L 2 , and adding two of them to each CLB increases the size of the CLB by less than 0.5%.
METHODOLOGY
Five benchmarks were used to test the feasibility of the proposed modifications. They were matrix multiply, matrix vector multiply, vector dot product, FFT, and an LU decomposition datapath. All of the benchmarks use doubleprecision floating-point addition and multiplication, and LU decomposition includes floating-point division.
Each benchmark was tested in three FPGA versions. The first version is representative of a modern FPGA and includes a combination of CLBs and the embedded 18-bit x 18-bit embedded multipliers
The second version adds an embedded variable length shifter to the baseline, and the third version augments the baseline with a 4:1 multiplexer in parallel with the LUT.
The floating-point benchmarks were written in a hardware description language, either VHDL or JHDL [12] . The benchmarks were synthesized using Synplicity's Synplify 7.6 into an EDIF file. Technology mapping was performed with Xilinx ISE 6.3. While these are slightly older tools, the floating-point units were already hand mapped and so only small parts of the design were synthesized and/or mapped. The Xilinx NGDBuild and the Xilinx map tool were used to reduce the design from gates to slices (which map one-to-one with our CLBs). The Xilinx NCDRead was used to convert the design to a text format. A custom program converted the mapping of the NCD file to the NET format used by VPR.
The benchmarks vary in size and complexity. Table 6 . gives the number of components for the benchmarks in the baseline architecture. The number of IO, block RAMs, and embedded multipliers remain constant for all three versions of the benchmarks. Table 7 . gives the number of CLBs and embedded shifters for the benchmark versions that use the embedded shifters. Table 7 . also shows the number of CLBs for each benchmark version that uses the modified CLBs and the percentage of the CLBs that make use of the 4:1 multiplexer modification. Using embedded shifter reduces the average number of CLBs by 17.3%. Similarly, the 4:1 multiplexor provides an 8.4% reduction.
TESTING & ANALYSIS
Even with an extremely conservative estimate of the embedded shifter size, adding embedded shifters to modern FPGAs significantly reduced circuit size. As seen in Fig. 3 . and Fig. 4 . , adding embedded shifters reduces average area Only the floating-point units were optimized with the embedded shifters -the control and the reminder of the data path remained unchanged. If we consider only the units, the embedded shifters reduced the number of CLBs for each double-precision floating-point addition by 31% and required two embedded shifters. For the double-precision floating-point multiplication, the number of CLBs decreased by 22% and two embedded shifters were used as shown in Fig. 5 .
Use of the 4:1 multiplexer modification to the CLB also showed significant improvements. Even though only the floating-point cores were optimized, there was an area savings of 7.3% over the reference benchmarks. In addition to the area savings, there was a speed increase of 11.6%, as seen in Fig. 3. and Fig. 4 . Numerous multiplexers are known to exist in VHDL datapaths outside of the floatingpoint units, and so the size of this advantage should grow if this modification was exposed to the synthesis flow. If we consider only the floating point units, the addition of the multiplexer reduced the size of the double-precision floating-point adder by 17% and reduced the size of the double-precision multiplier by 10% as shown in Fig. 5 .
RELATED WORK
While there has not been a great deal of work dedicated to increasing the efficiency of floating-point operations on FPGAs, there has been some work that might be beneficial to floating-point operations on FPGAs. Ye showed the benefits for bus-based routing for datapath circuits [13] . Because IEEE floating-point numbers have 32 or 64 bits (single or double-precision) and these signals will generally follow the same routing path. This naturally lends itself to bus-based routing.
Xilinx recently announced their next generation of FPGAs; the Virtex-5 replaces the 4-LUTs with 6-LUTs [15] .
These 6-LUTs would clearly offer the same advantage as using a dedicated 4:1 mux, but would also consume somewhat more area. It is likely that the 6-LUTs would be more flexible than the dedicated 4:1 mux.
The embedded multipliers in the Xilinx architectures can also implement shifters, but this approach is infeasible in modern designs where the multipliers are consumed by the floating-point units to do multiplication. Xilinx AppNote 195 also implies that a 56 bit shift would be an inefficient technique with regards to silicon area.
CONCLUSION
The results indicated that adding shifters to the fabric or 4:1 multiplexers to the CLBs will significantly reduce circuit size for floating-point applications with an increase in circuit frequency. The embedded shifter provided an average area savings of 14.6% and a clock rate increase of 3.3%. The 4:1 multiplexer provided an average area savings of 7.3% while achieving an average speed increase of 11.6%. Neither modification significantly increased track count. The embedded shifters are only 1.5% of the 
