constant multiplication (MCM) operation multiplies an input variable with multiple constants. MCM operations are widely used in many applications such as video processing and compression. In this paper, a method is proposed for efficient implementation of MCM operations using DSP blocks in Xilinx FPGAs. The proposed method reduces number of DSP blocks used for implementing a given MCM operation by manipulating the multiple constants used in this MCM operation. In this paper, a high level synthesis tool implementing the proposed method is also proposed. The proposed tool takes the input variable bit length and multiple constants as inputs, and generates a Verilog RTL code which efficiently implements this MCM operation using DSP blocks. The proposed method and tool are used for one of the most complex video compression algorithms, HEVC 2D DCT. They reduced number of DSP blocks used in the FPGA implementation of HEVC 2D DCT algorithm by 35.8%.
I. INTRODUCTION
Modern FPGAs provide different full-custom built-in blocks in addition to look up tables (LUTs) and registers. FPGA implementations using these built-in blocks can perform operations faster with less resources and power consumption than FPGA implementations using LUTs. Therefore, they are used in wide range of applications. DSP blocks, which have full-custom multiplier hardware, are one of these built-in blocks in FPGAs. They are used in FPGA implementations of many applications such as video processing and compression, and machine learning [1] - [2] .
Multiple constant multiplication (MCM) operation multiplies a single variable with multiple constants. It is used in many digital signal processing (DSP) applications such as finite impulse response (FIR) filter, discrete cosine transform (DCT) and fast Fourier transform (FFT).
DSP blocks perform constant multiplications faster and with less energy than adders and shifters. A DSP block can be used to perform different constant multiplications by providing proper constant value to its input. Therefore, it is more efficient to implement constant multiplications using DSP blocks instead of using adders and shifters in an FPGA implementation.
In this paper, a method is proposed for efficient implementation of MCM operations using DSP blocks in Xilinx FPGAs. The proposed method reduces number of DSP blocks used for implementing a given MCM operation by manipulating the multiple constants used in this MCM operation. In this paper, a high level synthesis tool implementing the proposed method is also proposed. The proposed tool takes the input variable bit length and multiple constants as inputs, and generates a Verilog RTL code which efficiently implements this MCM operation using DSP blocks.
In this paper, to demonstrate effectiveness of the proposed method and tool, an FPGA implementation of one of the most complex video compression algorithms, HEVC Two Dimensional (2D) DCT, is done using them. They reduced the number of DSP blocks used in the FPGA implementation of HEVC 2D DCT algorithm by 35.8%.
There are many techniques proposed in the literature to optimize multiple constant multiplication operations [3] - [5] . These techniques try to find common sub-expressions between multiple constants, and they implement MCM operations using adders and shifters. They provide efficient solutions for ASIC implementations. Since their FPGA implementations use LUTs instead of DSP blocks, they are inefficient for FPGA implementations compared to using DSP blocks in FPGAs.
The techniques proposed in [6] - [7] use pipeline registers in DSP blocks to schedule operations efficiently. They also use different scheduling techniques to increase the resource sharing on DSP blocks. In this way, they can increase the throughput. The method proposed in [8] can map two constant multiplication operations into one DSP block by concatenating two constants and assigning them to one input of DSP block. Therefore, it reduces the number of DSP blocks used for MCM operations. The method proposed in [8] uses only two inputs (A, B) of a DSP block. It works only for unsigned input variables. However, the method proposed in this paper can map more constant multiplication operations to one DSP block by manipulating the constants and using three inputs (A, B, C) of a DSP block. In addition, it works for both unsigned and signed input variables.
II. XILINX DSP BLOCK ARCHITECTURE
Simplified architecture of Xilinx DSP48E1 block is shown in Fig. 1 . A DSP48E1 block has a signed 25 bit preadder, a signed 25x18 bit multiplier and an ALU which has 48 bit adder/subtractor and pattern detector. Since these subblocks are implemented as full-custom hardware on FPGA, they provide higher speed and lower power consumption than equivalent LUT implementations of the same operations. These sub-blocks can be configured to implement different operations in each clock cycle. In this way, a DSP48E1 block can perform different operations such as A×B and A×B+C.
III. PROPOSED METHOD The proposed method manipulates the constants in a multiple constant multiplication operation to map more constant multiplication operations to one DSP block.
Multiplication of a bit input variable with two constants bit 1 and bit 2 can be performed using two DSP blocks, where each DSP block performs multiplication with one constant. The method proposed in [8] concatenates two constants to perform two constant multiplications using one DSP block as shown in (1) . A DSP block has a 25x18 bit signed multiplier. Therefore, and ( + + ) should be less than 18 and 25, respectively, to map two constant multiplications to one DSP block. 
The symbols "×", "{ , }", "|", "≪" and "≫" represent multiplication, concatenation, bit-wise or, left shift and right shift operations, respectively. This method can be used for k constant multiplications if the bit lengths of input variable and constants satisfy the condition in (2) .
The proposed method reduces bit length of a constant by manipulating the constant and utilizing C input of DSP block. Manipulation of an bit constant multiplied with bit input variable is shown in (3)-(10). First, and can be represented as shown in (4) 
, V and [ − 1: ] are assigned to A, B and C inputs of DSP block, respectively. The bit length of is smaller than the bit length of . Therefore, more constant multiplications can be mapped to one DSP block by manipulating the constants before multiplication.
If the input variable is a signed number, sign extension of [ − 1: ]for each constant multiplication should also be added to multiplication result using C input of DSP block. Sign extension calculation for an m bit constant and bit signed input variable is shown in (11). After sign extension, { , [ − 1: ]} should be assigned to C input of DSP block.
Multiplication of a bit input variable with two constants bit 1 and bit 2 can be performed using the proposed method as shown in (12)-(14). Final concatenation and shift operations are not shown for simplicity.
The proposed method can be used for k constant multiplications if the bit lengths of input variable and manipulated constants satisfy the condition in (15).
An MCM example using the proposed method is shown in Fig. 2 . In this example, a 9 bit signed input variable is multiplied with two constants; 78913 and 10066336. Since a DSP block has a 25x18 bit multiplier, multiplication with 27 bit constant 100663360 cannot be mapped to one DSP block without the proposed constant manipulation. Therefore, concatenation method proposed in [8] cannot map this constant multiplication to a DSP block. However, the proposed method can map these two constant multiplications to one DSP block.
IV. PROPOSED HIGH-LEVEL SYNTHESIS (HLS) TOOL
In this paper, a high level synthesis tool implementing the proposed algorithm is also proposed. As shown in Fig. 3 , the proposed tool takes the input variable bit length and multiple constants as inputs, and generates a Verilog RTL code which efficiently implements this MCM operation using DSP blocks. If a constant is power of 2, this constant multiplication is implemented with shift operation. If a constant is a power of 2 multiple of another constant in the input constants, this constant multiplication is also implemented with shift operation. The remaining constant multiplications are mapped to DSP blocks using the proposed DSP mapping algorithm. Finally, a Verilog RTL code is generated for this MCM operation.
Flow chart of the proposed DSP mapping algorithm is shown in Fig. 4 . The proposed algorithm takes bit length of input variable and multiple constants as input. It groups and maps the multiple constant multiplications to minimum number of DSP blocks. The proposed DSP mapping algorithm is an iterative algorithm. It starts with Iteration 0 and Level 0. It puts all the constants that will be multiplied with input variable to Constants_List. It sorts the constants in Constants_List, generates all possible combinations of these constants and puts them to Combinations_List. Each combination has the same number of constants, and the number of constants in a combination is determined by input variable bit length. Table  I shows maximum number of constant multiplications that can be mapped to a DSP block (MaxConstDSP) for different input variable bit lengths. MaxConstDSP can be calculated by using (15) .
Counter keeps the number of combinations tried to be mapped to a DSP block at each level. Initially, Counter is set to zero for each level. The proposed algorithms takes the first combination in Combinations_List of the current level and determines whether it can be mapped to one DSP block or not by using the cost calculation function shown in Fig. 5 .
If the current combination cannot be mapped to a DSP block, Counter for the current level is incremented by one, the next combination in Combinations_List is taken and determined whether it can be mapped to one DSP block or not by using the cost calculation function shown in Fig. 4 . This process continues until either a combination that can be mapped to one DSP block is found or it is determined that none of the combinations in Combinations_List can be mapped to a DSP block.
If the current combination can be mapped to one DSP block, then it is added to DSP_List and the constants in the current combination are removed from Constants_List. DSP_List contains the combinations of constants that are mapped to DSP blocks. Counter for the current level is incremented by one. Level is incremented by one and the proposed algorithm continues with the next level. Level keeps the recursion depth of the current iteration.
If there are no constants left in Constants_List, this means all the constant multiplications are mapped to DSP blocks and the proposed algorithm terminates successfully.
If none of the combinations in Combinations_List of the current level can be mapped to a DSP block and Level is greater than 0, Counter for the current level is reset to 0, Level is decremented by one, the last combination is removed form DSP_List, the constants in that combination are added to Constants_List, and the proposed algorithm continues with the previous level.
If none of the combinations in Combinations_List of the current level can be mapped to a DSP block and Level is 0, this means the proposed algorithm cannot map the constant multiplications to DSP blocks in groups of MaxConstDSP and the proposed algorithm terminates the current iteration.
In that case, the proposed algorithm adds constant 0 to Constants_List and it starts the next iteration with this new Constants_List. In these iterations, if a combination contains 0 as a constant, it means that only nonzero constants in this combination are mapped to a DSP block. Therefore, some combinations with fewer nonzero constants than MaxConstDSP can be mapped to DSP blocks.
The proposed algorithm continues with next iterations by adding constant 0 to Constants_List until all the constant multiplications are mapped to DSP blocks successfully.
The proposed algorithm sorts the constants in Constants_List before generating Combinations_List. In this way, it generates combinations of constants in the same order as it goes back and forth between different levels.
An example of mapping constant multiplications to DSP blocks using the proposed algorithm is shown in Fig. 6 . In iteration 0, the proposed algorithm cannot map the four constant multiplications to two DSP blocks. Constant 0 is added to Constants_List in iteration 1 and iteration 2. The proposed algorithm mapped the four constant multiplications to three DSP blocks successfully in iteration 2. V. CASE STUDY: HEVC 2D DCT HEVC uses DCT-II for DCT operations. It uses 4x4, 8x8, 16x16, 32x32 Transform Unit (TU) sizes. HEVC performs 2D transform operation by applying 1D transforms in vertical and horizontal directions. The coefficients in the HEVC 1D transform matrices are derived from the DCT-II basis functions. However, integer coefficients are used for simplicity.
In this paper, three different HEVC 2D DCT hardware for all TU sizes are designed and implemented. The first (baseline) hardware uses DSP blocks for multiple constant multiplications. In this hardware, each multiplication is implemented using one DSP block. The second (concatenate) hardware uses concatenation method proposed in [8] . In the baseline and concatenate hardware, if a constant is a power of 2 or a power of 2 multiple of another constant, it is implemented using shift operation instead of DSP block. Finally, the third (proposed) hardware uses the proposed DSP mapping algorithm to reduce number of DSP blocks.
The proposed hardware perform 2D DCT by first performing 1D DCT transform on the columns of a TU, and then performing 1D DCT transform on the rows of the TU. After 1D column DCT, the resulting coefficients are stored in a transpose memory, and they are used as input for 1D row DCT. One 4x4 datapath is used for 4x4 TU size. Two 4x4 datapaths are used for 8x8 TU size. Two 4x4 datapaths and one 8x8 datapath are used for 16x16 TU size. Two 4x4, one 8x8 and one 16x16 datapaths are used for 32x32 TU size [9] .
Since different constants are used in HEVC 2D DCT for 4x4, 8x8, 16x16 and 32x32 TU sizes, four different multiplier blocks are used in the proposed hardware. Multiplier blocks in the first 4x4, second 4x4, 8x8 and 16x16 datapaths multiply a single input with 3, 4, 8 and 16 different constants, respectively. The proposed DSP mapping algorithm is used for MCM operations implemented in these datapaths. As shown in Table II , the proposed DSP mapping algorithm performs 27 different constant multiplications in HEVC 2D DCT using only 14 and 21 DSP blocks in the column (DCT_Column) and row (DCT_Row) transforms, respectively.
The Verilog RTL codes are synthesized and mapped to a Xilinx XC6VLX550T FF1760 FPGA with speed grade 2 using Xilinx ISE 14.7. FPGA implementations are verified with post place and route simulations. The implementation results are shown in Table III . The proposed FPGA implementation uses 35.8% and 13% less DSP blocks than baseline and concatenate FPGA implementations, respectively.
VI. CONCLUSIONS
In this paper, a method is proposed for efficient implementation of MCM operations using DSP blocks in FPGAs. The proposed method reduces number of DSP blocks used for implementing a given MCM operation by manipulating the multiple constants used in this MCM operation. In this paper, a high level synthesis tool implementing the proposed method is also proposed. The proposed tool takes the input variable bit length and multiple constants as inputs, and generates a Verilog RTL code which efficiently implements this MCM operation using DSP blocks. The proposed method and tool reduced number of DSP blocks used in the FPGA implementation of HEVC 2D DCT algorithm by 35.8%.
