A polynomial accelerator implemented with a custom highdynamic-range number representation operates up to 534MHz in the slowest speed grade on a 28nm FPGA, a clock rate that a typical FPGA tool flow cannot achieve. This design tutorial shows how to achieve a physically scalable and high-speed numerical design by partitioning it into a cascade of identical stages, and balancing the LUT-to-DSP ratio within each stage to match the available resources on the FPGA.
INTRODUCTION
A th -degree polynomial ∑ can be evaluated with Horner's method [1] as a cascade of 1 multiply-add stages shown in Figure 1 . The authors of [2] show that this recursion is bounded given bounded coefficients and | | 1. For this design tutorial, each input sample can change every clock cycle and is represented by an 18-bit fixed-point number in the range 1 2 , 1 2 . The coefficients 's are known ahead of time, have up to 17 significant bits, and are in the range 1 2 2 , 1 2 2 . Mapping the Horner structure efficiently to Xilinx FPGA resources is the subject of this design tutorial.
Four design aspects are considered.
1. Even though the input can be represented as an 18-bit fixedpoint quantity, the dynamic range of the coefficients 's and the intermediate values 's and 's can exceed this range. A number representation in Section 2 facilitates efficient implementation of addition and multiplication of highdynamic range numbers.
2. The Xilinx DSP48E1 slice [3] consists of a 25 18 two's complement multiplier followed by a 48-bit adder. The multiplier products ( 's in Figure 1 ) are not directly accessible for scaling. A method to pre-scale the coefficients to prevent overflow in the multiply-add unit is developed in Section 3. 3. To achieve maximum clock frequency and to ease scaling, each stage in Figure 1 is physically mapped to the FPGA to pitch-match the regular layout of the resources, a subject of Section 4. 4. Once one stage is mapped, multiple stages are cascaded to implement the structure in Figure 1 . Section 5 presents the placement options beyond those provided by the standard tool flow.
NUMBER REPRESENTATION 2.1 Overview
Evaluating a polynomial requires a high dynamic range. A custom number representation called XDR is outlined here for efficient use of two's-complement hardware for high-dynamic-range operations. An XDR , number is represented by the function , , , such that , , 2 , where is an -bit two's-complement significand excluding the most negative number 2 , i.e. | | 2 , and is an -bit sign-magnitude integer exponent. Although normalization is optional in XDR and thus multiple representations of the same number are possible [4] , the exponent is chosen so that the significand can be placed directly on the wires in two's complement arithmetic hardware. For instance, an XDR(9,7) number , 68, 63 represents 68 2 7.37 10 . The significand 68 is placed on some input port of at least 9 bits wide of a two's complement arithmetic circuit, and the 7-bit sign-magnitude exponent -63 is kept track of elsewhere. One can also write the same number as , 136, 64 , effectively normalizing the significand. Note that by excluding the most negative -bit two's complement number, the zero significand is represented uniquely and there are as many negative significands as positive ones. This property results in the absolute value of any -bit XDR significand still fitting within bits, useful for normalization.
Specific to This Design Tutorial
Because the input sample is an 18-bit fixed-point number and | | 1, it can be written as , 2 , 17 . This value will be written as , ,
or sometimes just , , where the significand 2 and the exponent is a constant 17 . The significand is meant to be placed directly on the wires of an 18-bit port of a two's complement multiplier.
A REPEATING STRUCTURE 3.1 DSP48E1 as a Multiply-Add Unit
Horner's method evaluates a polynomial as a series of multiplyadd operations. Each multiply-add unit maps to one Xilinx DSP48E1 slice ( Figure 2 ) and forms the compute core of a repeating structure. The multiply-add unit implements as two's complement integers. Specifically, the internal product is added to to produce . The bit widths of , , , , and are 25 , 18 , 48 , 43, and 48 respectively.
Note that the XDR(18,5) significand of the input sample is placed on the wires of the port of the DSP48E1 slice. 
Input Pre-processing
The DSP48E1 slice is an integer multiply-add unit with neither its own leading-one detector nor arithmetic shifters. To minimize the area of the polynomial accelerator, pre-scaling inputs is necessary to fuse XDR multiplication and addition within one DSP48E1. Figure 3 shows a repeating structure that pre-processes the inputs to the multiply-add unit to avoid overflow while retaining as many significant digits as possible. A cascade of 1 stages of this structure implements a th -degree polynomial. The stages are labeled according to the indices of the polynomial coefficients attached to the adders. As a result, the leftmost stage in Figure 1 is stage 1, the stage immediately to its right is stage 2, and so on until the rightmost stage, which is stage 1.
, where the unsigned significand is a shifted version of | |, and fits into the port of the DSP48E1 multiplier 1 . The shift amount to go from | | to is chosen to minimize loss of significant digits, and is coordinated with the shift amount for the coefficient. The bit-loss minimization algorithm that solves a recurrence formula across all stages is beyond the scope of this paper. It suffices to say that the shift amount generator E in Figure 3 instructs the significand of the previousstage input to shift by the appropriate amount. 2. The second addend is some version of the coefficient, and again is shifted by the same shift amount generator that shifts the previous-stage input. Figure 5 Balanced LUT-to-DSP Ratio 1 An unsigned significand is used purely for implementation reasons and not for mathematical reasons.
REDUCING THE LUT COUNT
Converting some LUTs to DSPs results in a lower LUT-to-DSP ratio and is beneficial for pitch-matching each stage in the Horner form to rows of DSP48E1 slices. The absolute value block (abs ) is a good candidate to map to a DSP48E1 slice. Taking the absolute value of an XDR significand is a bit more work than taking that of a sign-magnitude value, and even that of a two's complement value. This is because the most negative two's complement value is not allowed; its absolute value is saturated to the most positive value. For instance, a four-bit significand ranges from -7 to 7 but -8 is possible to be placed on the four wires by two's complement arithmetic units. The absolute value of -8 in this case is saturated to 7.
Let be an -bit XDR significand. The absolute value of is
The DSP48E1 slice can be configured to implement the XDR absolute function by setting CARRYIN to be sgn & 2 0 and ALUMODE[0] to be sgn .
MULTI-STAGE PLACEMENT 5.1 The Intuition
Now that there are two DSP48E1 slices per stage, a floorplan like that in Figure 5 can be achieved. For instance, the absolute value block occupies the DSP48E1 slice in the left column while the multiply-add block occupies the DSP48E1 slice in the right column. Since the LUTs to the left and the right of the two DSP48E1 slices may not be sufficient, not all DSP48E1 slices may be utilized in the DSP48E1 columns, resulting in a floorplan similar to that in Figure 6 , in which a 5 th -degree polynomial requires seven DSP48E1 slots across two rows. -degree polynomial is chosen because it is the most common polynomial degree for digital predistortion. The maximum operating frequency is 519MHz and has sufficient margin above 491.52MHz for wireless base station applications. Table 1 summarizes the resource usage. The per-degree results in Table 1 show fewer resources and a higher speed compared to the single-precision fused multiply-add unit in [5] , which consumes 4 DSP48E1s, 802 LUTs, and 1233 FFs with a maximum frequency of 488MHz. Obviously, the two designs are not functionally equivalent, with the polynomial accelerator presented here especially created for a hybrid fixedpoint and high-dynamic-range application. These resource usage and frequency values nonetheless show it is worth the effort analyzing a high-dynamic-range design with XDR and architecting the Horner polynomial chain with only one normalizer and two shifters.
Expanding the Placement Search Space
Given a rectangular region in Figure 6 with two columns of seven DSP48E1 slices each, how good is the DSP48E1 placement in Because there are two DSP48E1 slices per stage, this large search place can be pruned so that exactly five DSP48E1 slices are used in each of the two DSP columns. This pruning generates only 441 placements. Along the dataflow of the polynomial, the absolute-value DSP and the multiply-add DSP are visited alternately, resulting in three reasonable orderings of these DSPs across the two columns: ping-pong, U-shaped, and S-shaped.
All placements are done stage-by-stage, starting with stage 5 of the 5 th -degree polynomial.
1. The ping-pong placement has all the absolute-value DSPs in one column and the multiply-add DSPs in the other ( Figure  7a ). 2. The U-shaped placement requires one column of DSPs to be placed first before using the other DSP column (Figure 7b ). For instance, the absolute-value blocks and the multiply-add units of stages 5 and 4 plus the absolute-value block of stage 3 occupy one DSP column, leaving the other column to the multiply-add unit of stage 3 and all of stages 2 and 1. 3. Like the ping-pong placement, the S-shaped placement places the absolute-value block and the multiply-add unit DSPs in the same stage in different DSP columns but the absolute-value DSP of one stage and the multiply-add DSP of its adjacent stage must be in the same DSP column ( Figure  7c ). 
RESULTS
The distributions of the three placements are plotted in Figure 6 . The ping-pong placement generally yields faster designs than the other two placement strategies. All final place-and-routed results are plotted in a histogram in Figure 9 . 46.5% of the results exceed the 500MHz requirement and are represented by the green bars. With only a rectangular region constraint from Figure 6 specifying where all cells of the netlist should be placed, and without specifying which 10 of the 14 DSP48E1 slots should be used, Vivado™ 2013.1 produced a design at 473MHz on the Xilinx Kintex™-7 XC7K410T device. The fastest design achieved 534MHz, and this last 13% of speed improvement was not straightforward to obtain. The initial manual placement from Section 5.1 achieved a respectable performance at 519MHz. 
CONCLUSIONS
A polynomial accelerator has been shown to operate up to 534MHz at the slowest speed grade on a 28nm FPGA. It accepts an 18-bit fixed-point input and high-dynamic-range coefficients. The FPGA polynomial accelerator has been architected with a custom number representation known as XDR. Because the two's complement significand of an XDR number models bits that are put directly on the wires of two's complement integer arithmetic circuits, XDR is useful for analyzing high-dynamic-range designs using two's complement integer arithmetic operators.
The polynomial accelerator is implemented with Horner's method and is physically partitioned into a cascade of identical stages. To balance LUTs and DSP48E1 usage, two DSP48E1s are used in each stage. One DSP48E1 slice serves as the XDR absolute value evaluator and the other DSP48E1 slices is the multiply-add unit. Place-and-route experiments of a 5 th -degree polynomial including three DSP48E1 placement strategies reveal a design space with a range of maximum clock frequencies from 460MHz to 534MHz using Vivado 2013.1 on a -1I (slowest) grade of the Xilinx Kintex™-7 XC7K410T device . These experiments provide space for placing 10 DSP48E1s into 14 DSP48E1 slots, which are partitioned into two 7-slot DSP48E1 columns. An intuitive placement (the ping-pong placement) with unused DSP48E1 slots in strategic locations to give more room for LUTs proves to be effective since it yields a maximum frequency of 519MHz, 13% above what automatic placement can achieve and very close to the high end of the 460MHz-to-534MHz design space.
