Abstract-To aid in the hardware/software partitioning of the reconfigurable computing systems, it is necessary to conduct fast and accurate FPGA-based delay estimations before the partitioning. Most previous works predict the delay by adopting a high-level delay estimation based on empirical formulae. In such method, the empirical formulae are often obtained by a regression analysis on the real values reported by the synthesis and place-and-route tools of FPGAs. With alternative properties of tools or different FPGA devices, the empirical formulae need to be reanalyzed and decided. However, it is time-consuming due to inevitably repeated running synthesis and place-and-route tasks, which results in slow estimation and always beyond the tolerance of the estimation time. To address this problem, we present an improved high-level delayestimation method in this article. We derived theory formulae called increasing formulae for HLL (High Level Language) operations from the basic idea of the hardware circuit design. These increasing formulae can be fit for most FPGAs. Combining the proposed formulae, the paper proposes a rapid estimation algorithm also. And the algorithm can obtain hardware delay of different hardware versions, thus reduces the number of times of running the time-consuming tasks greatly. Experimental results show that our method can achieve error within 2.69% for virtex-5 FPGA, compared with the real values.
I. INTRODUCTION
One of the consequences of the increase in circuit densities over the past decade is the emerging viability of FPGA-based reconfigurable computing systems (RCSs), which can achieve speed-ups that are several orders of magnitude [1, 2] over the conventional processors. A general RCS often consists of a co-processor and one or more hardware accelerators. The software running on the co-processor is more flexible and much easier to develop. Hardware accelerators, however, provides better performance although are expensive in terms of cost and development time. In high-level-synthesis design method of RCS, the hardware/software partitioning helps to partition HLL program to software part and hardware part, which are executed on the co-processor and accelerators, respectively. In order to achieve a performance-area optimal RCS, efficient hardware/software partitioning strategies [3] must be incorporated in the emerging design flows of RCSs.
The metrics, especially the hardware delay of the HLL programs, according to what designers can decide which part ought to be mapped on accelerators, should be determined before the hardware/software partitioning. At present, approaches to obtain accurate hardware delays are by translating the entire HLL programs to hardware descriptions, then synthesizing and place and route [4] . However, the FPGA design flow is time-consuming, which may take up to hours or days for all possible partition options. So obtaining the hardware delays before hardware/software partitioning now forms the main bottleneck in the RCS design process based on HLL programs.
To solve this problem, researchers proposed a highlevel delay-estimation based on the empirical formulae. However, it is time-consuming due to inevitably repeated running synthesis and place-and-route tasks, which results in slow estimation and always beyond the tolerance of the estimation time. An improved high-level delay estimation method is presented in this article. We derived theory formulae, called increasing formulae for HLL operations, from the basic idea of the hardware circuit design. These increasing formulae can be fit for most FPGAs. The implementation of loops has multiversion features on the FPGA. Different versions have different hardware delay. This paper proposes a fast general estimation algorithm also. Combining the proposed formulae, the algorithm can obtain hardware delay of different hardware versions.
The outline of the paper is as follows: Section 2 gives an overview of the related work in the high-level delayestimations. This is followed by the proposed increasing formulae in section3. The proposed increasing estimation algorithm is in section4. The experiments and results are presented in section 5 and the conclusion is in section6.
II. RELATED WORKS
Hardware delay estimation has received considerable interest in the research community for many years. An excellent survey of the hardware delay estimation techniques is presented in [5] . This paper gives details of the related works of high-level delay estimations for HLL programs and groups similar estimation methods into categories. There are about four categories: (1) synthesislike method [19] , (2) the method based on neural network [20] , (3) the method based on the Correlation [21], (4) the method based on weighted sum [6] [7] [8] [9] [10] [11] [12] . The timing complexity of the (4) is lower and this method is suitable for the hardware /software partitioning. In such method, each operational unit (e.g. an arithmetic operation or a data flow graph node) of the HLL is assigned an abstractweight to get the delay of each operational unit. By combining the results of all abstract-weights, the total delay of the hardware part can be obtained.
[ [6] [7] [8] regard the sum of logic delays and routing (interconnection) delays as the abstract-weight of an operation. The logic delays come from the device description, and routing delays were obtained from the proposed planning and routing algorithm. [6] performed the floor-planning process and [7] performed a manual planning algorithm. However, accurately estimating the routing delay is very difficult because the routing process depends on the synthesis tool and the algorithms used in the planning and routing procedures which are difficult to model to a great extent [9] . [10] [11] [12] combined logic and routing delays in a single model, and computed the empirical formulae of the single delay model, namely a function determined by parameters of bit-width, as the abstract-weight of operations. In these works, the empirical formulae were often obtained by a regression analysis on the real values reported by the synthesis and place-and-route tools of FPGAs. The shortcoming is that hardware delay depends on the properties of tools while the operation mapped to hardware circuit. For example, the project properties of ISE tool form the Xilinx Corporation including a lot of device properties, such as: package and speed, while modifying any property, the corresponding hardware delay would change. Besides tool-chain properties, process properties also influence the size of hardware delay. With alternative properties of tools or different FPGA devices, the empirical formulae need to be reanalyzed and decided. It is time-consuming due to inevitably repeated running synthesis and place-and-route tasks, which results in slow estimation and always beyond the tolerance of the estimation time.
In addition, the high-level estimation method based on weighted sum needs to consider the influence of hardware-oriented compiler optimization technology. For example, for the loop structure, the optimized compile technique of the loop unrolling can improve the loop parallelism. Meanwhile, it also consumes more hardware resources. Besides loop unrolling, the loop pipeline and the systolic array both have great influence on hardware delay. Therefore, a complete high-level estimation method needs to estimate the hardware delay of different hardware versions generated by different hardwareoriented compiler optimization technology. To authors' knowledge, the estimation method mentioned in [22] is the only work which regards the effects of loop unrolling on hardware results. However, the normal compiler optimizations such as pipeline have not been involved in any existing method for deciding hardware partition parameters.
III. THE PROPOSED INCREASING FORMULAE
The hardware delay of the HLL operations is obtained by transforming the HLL to the HDL (Hardware Description Language), and then mapping the HDL to the FPGA by the synthesis, place-and-route tools. The hardware delay changes as different properties of tools or FPGA devices. In different FPGA, the delay and the number of basic elements is different. However, the basic idea of the hardware design is not changed, that is, the simplest logical expression of the operations is not changed. According to this thought, the paper defined an increasing formula for each HLL operation, which denotes the increase trend of the bit-width of the operation. The synthesis results are very near to the results reported by the synthesis and place-and-route tool, while obviates the very long place-and-route step. Therefore, a series of synthesis experiments are utilized to revise the increasing formulae.
The followed steps are involved in building the increasing formulae:
(1) Derive an increasing formula for each operation from the basic idea of the hardware circuit design.
(2) Vary the bit-width of the operation and create the corresponding VHDL instances.
(3) Synthesize the VHDL files and record the real values reported by the synthesis tool.
(4) Do a curve fitting analysis on the real values to generate a closed formula.
(5) Repeat steps (3-4) on different FPGA devices, including Spartan-3E, virtex-4, virtex-5, virtex-7 and Kintex-7 FPGA.
(6) Revise the proposed increasing formulae according to the generated closed formulae.
In this paper, the increasing formulae can be categorized as follows. Parameter t(x) gives the estimated delay, and is a function of bit-width (b) generally. When the bit-width of the multi-inputs is different, b is the smaller. The input operands of operations considered in this paper are variable.
 Constant Function: t(x) = C In hardware circuits of these operations, each path from any input to any output consumes a fixed amount of resource on the FPGA, so the delay is a constant denoted by C. Example of such operations include the logical operation: BIT-AND, BIT-OR, BIT-XOR. A logical operation applied to two operands A and B will provide a result Z depending on the bit pattern of the operands. The formal description as (1) Table) on FPGA devices, and parallel executes. The delay of the logic operands is the delay of the critical path of its corresponding hardware circuit. The critical path only have a LUT, so the delay can computed as (2):
The delay of LUT is 0.195ns in the Virtex-4 (XC4VFX20) FPGA, while 0.086ns in the virtex-5 (XC5VFX100T) FPGA when all the properties of the synthesis tool are defaults. The net delay between different operations is constant also.
 Linear Function: The hardware circuit of the addition is usually composed by a cascade of full-adders. A full-adder [16] adds binary numbers and accounts for values carried in as well as out. The full-adder implementation can be denoted by (3)(4)(5) i and B i are the operands, and C i HE is a bit carried in from the next less significant stage. i is the bit number which ranges from 1 to n.
In the FPGA, the sum part (3), the carry part (4) and the final sum part (5) are usually implemented by LUT, MUXCY and XORCY block, respectively. The operands of HE (5) i is the i-bit number, is related to the other bit. So the HE i of each bit is parallel. The E i of each bit is parallel too. The C i-1 need wait for the carry bit C i-2 to be calculated from the previous full adder. This carry part is constructed using the ripple carry chain way. In this light, the longest path of the full-adder is the ripple-carry chain. When the bit-width of the addition increases, the carry chain elongates. So the delay of the addition is liner function of the bit-width, and can be written as (6) On modern FPGAs, the multiplication can be constructed by two components: the full-adders and the special multiplier and/or DSP blocks. If the bit-width is less than 8, the multiplication is only implemented by programmable building blocks; otherwise the multiplication can be implemented by two ways. The number of the DSP blocks is limited. When the DSP blocks are used up, the multiplications are uniquely built by the full-adders of FPGA.
For the implementation way with full-adders, the implementing structure is array multiplier composed of a large array of full-adders. The longest path of an n-bit array multiplier includes n -1 CSAs (Carry Save Adder) and n RCAs (Ripple Carry Adder) in terms of adder [14] . The delay of the longest path is linear growth with the increasing of bit-width n. However, from the context, the adder is constructed by LUT, MUX, XOR and so on, and the delay of each component differ when the input differ [10] . So the critical path of the array multiplier may be not the longest path. Figure2 shows the delay when the bit-width is varied from 2 to 32 in the Xilinx FPGAs. The trend is linear growth generally, but changes little in the small interval. For all FPGAs in Figure 2 , the interval is [2, 7] , [8, 15] and [16, 31] .
The DSP blocks are specified multiplier, and the bitwidth of input is different for the different FPGAs. The input is 18x18 in Virtex4 FPGAs, Virtex5, Virtex-7 and Kintex-7 FPGAs. Figure3 shows the delay of multiplication when the bit-width is varied from 2 to 32. All curves are piece function. The bit-width pieces are [2, 18] and [19, 31] for the Virtex 4 FPGAs, [2, 18] , [19, 25] and [26, 31] for the other FPGAs. In summary, the delay of HLL operations is constant or liner function of bit-width (b). The parameter of the formulae can be obtained form the feedback information of our ASCRA compiler [16] . Figure 4 shows the overview of the feedback framework. The starting point is the functional description given as a C-language algorithm, because C-language is extensively used in various systems and most researchers are familiar with it.
The LLVM-GCC is used as the frontend compiler that transforms the C codes to the intermediate representation (IR) of the LLVM compiler infrastructure [15] . The software information, such as the execution time and the communication, and the hardware information are estimated, then these information are transferred to the hardware and software partitioning as inputs. The framework also incorporates a process to automatically software/hardware partitioning, generate RTL (Register Transfer Level) codes for one hardware implementing and implement on the target device in the last. Then we will get the real delay information of the hardware part IRs, which are fed back to the delay estimation algorithm for adjusting the formulae of the operations.
IV. THE PROPOSED ESTIMATION ALGORITHM
In view of the hardware multi-version features, our algorithm uses the weighted sum estimation method, analyzes and accesses the number of operation included in every version. Then combining estimation formulae which are called "weight", using "summation" method, and the proposed estimation algorithm estimates hardware delay of multiple hardware versions. The estimation algorithm is divided into interface design and algorithm description.
A. The Interface Design
Now loop pipeline, loop unrolling EP i are the common hardware compile optimization technology. And the generated corresponding hardware version is called pipeline version, unrolling EP i version and systolic array version. EP i indicates the number of loop unrolling. With different EP i , the hardware delay of the corresponding hardware version is different. In addition, different compiler optimization technologies are not mutually exclusive. A hardware version can be generated by using multiple compiler optimization technologies. For example, pipeline unrolling EP i version can be generated by loop pipeline and loop unrolling EP i . There are sequenced, unrolling EP i , pipeline unrolling EP i
In order to use unified estimation algorithm to estimate the hardware delay of different hardware versions, this paper uses operations as unit and designs a general algorithm interface, combined with the analysis for the information of the hardware version. The algorithm interface shows in the form of tow-tuples <Mtype,BIs>, Mtype stands for the version type(pipeline, unrolling, sequenced etc), BIs is the set of operation blocks and can express as tow-tuples <Insts, Nclk >. Nclk stands for the and systolic array versions in our project ASCRA [16] . Among these versions, sequenced version is the hardware version which never using any optimization technology. Combining compiling technology and target architecture, the automatic generated RTL description files of hardware versions consist of operation module, memory module, control module, address generation module. By analyzing the automatic mapping process of hardware version, the estimation method gets the operation information of each version. This provides input to estimation algorithm. number of the clock cycle. Each operation block needs one clock circle to complete the compute. Insts is the instruction set. Every instruction can express as towtuples <OPCode,BitW>. OPCode , BitW indicates arithmetic type and the bit-width of operation, respectively.
B. Algorithm Description
Hardware delay is determined by the clock cycle and the number of clock cycles, the number of clock cycles can be obtained accurately from the analysis of the hardware version, the paper mainly estimate the clock cycle. The clock cycle is the hardware delay of the critical path between registers in the hardware circuit, so estimation algorithm firstly establishes data flow graph (DFG) for each operation set BIs, then with the consideration of the relationship of two ways to construct multiplication traverses all paths in the DFG and calculates the hardware delay for each path, finally according to the hardware delay sorts all paths, and the clock cycle is the execution path with longest delay. The following is the description of the process of the hardware delay estimation algorithm and the Figure6 is the pseudo code for the algorithm: 1) Establish DFG for each operation set DFG reflects the data dependencies between operations, depicts the calculation when the information flow and data moving from the input port to the output port, can be regarded as the hardware circuit extracted graph without timing information, is a directed acyclic graph with the vertices and edges. In this section, the vertices represent the operation and low-level control (the operation supporting condition calculations), and the edges represents data path showing the flow of data streams. DFG established for each operation block set consists of two steps: firstly each instruction set Insts in the operation block set BIs should be traversed in order to find out all the input data, then traversing again the operation in the instruction set Insts, according to the data flow, the data path between the instructions will be established. To take the program in the Figure5 for an example, its corresponding DFG is as Figure6: 2) Computing the delay of each path in the DFG In the DFG, there are many paths that from the input data to the output data, and each path contain a plurality of operations. The Hardware delay of an operation can be obtained from its estimation formula. The net delay also exists between the two operations, and all the net delays are treated as same in this paper. Moreover, the crucial path not only includes operating delay and net delay between two operations, but also includes register delay. So, the hardware delay formula of a path is shown in (7) . n represent the number of operands, T ff is register delay, T opi is delay of the i-th operation(1<=i<=n),T neti net opn op net ff path
is net delay between registers and operations or between two operations. (7) In addition, there are two methods to build the multiplication, they are array multiplier and dedicated multipliers. Different implementations lead to the different hardware delay, the following rules exist between the different implementations:ⅰ) when the Synthesis, place-and-route tools construct multiplication, they give priority to the dedicated multiplier in default, when dedicated multipliers are empty then array multipliers construct multiplication replacing dedicated multipliers, that is, when DSP blocks are used up, then LUT resources will be consumed to construct multiplication. ⅱ) When there are many multiplications of different bit-width, then the multiplication of the higher bit-width will be constructed with dedicated multiplier in default priority. ⅲ) Through a lot of experiments, DSP blocks cannot all be used to construct multiplications, wherein one quarter of DSP resources are used to construct the connection between the DSP blocks. 3) Calculating the clock cycles and hardware delay All the path delays in the operation set are calculated, and are sorted in descending order of the path delay. The maximum value is clock cycle T clk , Hardware delay is determined by two parts: the clock cycle and the number of cycles, the formula as shown in (8) . T all is hardware delay, T clk is the clock cycle, N clk is the number of cycles.
V. EXPERIMENTS AND RESULTS (8) There are two sets of experiments that we wish to present in this paper. In the first set we apply our estimation to the single basic operations supported in C language, and the aim of this experiment is to determine the accuracy of the proposed increasing formulae. The second set addresses the larger benchmarks, and we will discuss the impact of hardware compiler optimizations on the delay. All benchmarks used by the experiments are compiled for the Xilinx Virtex5 (XC5VLX110T) FPGAs [17] . In this FPGA, the DSP block is called DSP48E whose total number is 64, and the number of LUT is 69120 and the number of the input of the LUT is 6.
A. Single Basic Operations
The basic operations in the Table 1 are the representation of the simple operation that can be used to form more complex applications. In this experiment, we have written the primitives in C language, and compiled them as individual programs by LLVM-based ASCRA compiler [16] to generate the corresponding VHDL and then used ISE tool 10.1 to get the actual delay. We then use our delay estimator to estimate the delay. Table1 shows the comparison between the delay estimation results and the synthesis results of the basic operation. The estimation error ranges from 1.2% to 5.09%, and the average is 3.27%. The average execution time for the estimator is only 1 millisecond while the actual time is about 3 seconds. 
B. Larger Benchmark
This set of experiments demonstrates the performance of the estimator in more complicated algorithms which are written using the basic operations. The following benchmarks were used in the experiment: Intra Qualification (IntraQ), the seventh algorithm in the Livermore Fortran kernels (Kernel-7), 5-tap-FIR and Demodulation (Demod). These benchmarks include one loop structure, and give a greater opportunity to perform the parallel optimizations. Since the loop bodies are restructured due to the optimizations and are transformed to different hardware versions [18] . In this set, the resource estimation of such larger applications includes two kinds: the delay of the sequenced hardware version, and the delay of the pipeline version. Table2 shows the results of the estimation tool, versus synthesis. The estimation error ranges from 0.74% to 3.93%, and the average is 2.1%. The sequenced version is got by directly map the basic blocks of the loop structure to the process of the VHDL codes. This kind of hardware version is the most consuming in all the hardware versions of the loop structures. The synthesis tool takes 72 seconds on an average to do the synthesis. Our estimator takes only 1 millisecond to do the estimation. For the partition which contains many loop structures, the synthesis task may take hours or even.
The pipeline version is the high-level parallel computing model. The pipeline compiling technique partitions the pipeline segment due to the delay of the basic operation. On the premise of not changing the hardware's max frequency, the small-delay pipeline stage was compacted. In the basic operation, the most consuming is the multiplication operation whose delay doubles the other operation within the same bit-width. So in the Table2, the delay of most of the applications is same and equal to the delay of the multiplication.
VI. CONCLUSIONS
In this paper we have presented an improved highlevel delay-estimation method for FPGA-based RCSs. The estimation method is mainly developed to aid the hardware/software partitioning. We have successfully demonstrated our estimation technique on a variety of both simple and more complex benchmarks. Experimental results show that our technique achieves accuracy within around 3.27% for simple basic operations and 2.1 % for larger benchmarks. The worstcase error we have observed is slightly over 5.09%. The time required for our estimates is no more than seconds, as compared to hours for a synthesis tool. Even though the estimation is specific to the Xilinx Virtex (XC5VLX110T) FPGA, it can be easily modified to work for a variety of other FPGAs. 
