In reconfigurable platform, before convert and download program into real hardware, reliable estimation of speedup factor is of great importance for task schedulers. In this paper, a novel technique for speedup factor estimation is proposed. From the event patterns collected by hardware counters built in modern processors, a formula is given to estimate speedup factor of target process. Experiments on programs from SPEC2006 show that the speedup feature is able to be estimated at an acceptable cost.
Introduction
With the progress in Reconfigurable Computing technology, CPU+FPGA hybrid platforms become increasingly popular in both academic and industry world. Various architectures, algorithms and tools are proposed to accelerate software programs with FPGA accelerators [1] . Different speedup factors are reported, ranging from tens to hundreds, according to the application and program being transformed and the hardware platform they used.
It is widely accepted that transforming existing program into FPGA hardware code is time consuming. Before convert and download program into real hardware, reliable estimation of speedup factor will save much work by eliminating unattractive ones.
Speedup factor is defined as the execution time ratio between modified hybridplatform and traditional Von's platform. Published researches [2] [3] focus on afterward speed-up factor report. Little has been reported about speed-up factor estimation before implementing a hardware version of targeted algorithm.
Event counter is nearly a standard feature for modern processors and exist on most major processors today, such as Intel Pentium, Core, IA-64 [4] and AMD Opteron [5] . Different kinds of events can be recorded by event counters while processors are executing programs, including the clock cycles elapsed, instructions ever executed, cache misses, branch prediction failures, and so on [3] . Previous researches using event counters mainly focus on gathering performance statistics to evaluate the underlying hardware architectures or help programmers in code optimization and system management [6] [7] [8] [9] , leaving speed-up factor estimation as a blank section.
In this paper, a technique using event counters built in modern processors to estimate speedup factor is proposed. Per-formance event information of the program is collected at run time. By processing the records of the events, metrics of the event series are calculated and compared with the reference values to sort out the features of the monitored process.
The rest of the paper is organized as follows: Section 2 discusses the related work about determining speedup factor and the technique and applications about build-in event counters in modern CPUs. Section 3 gives a detailed design on speedup factors estimation by collecting event patterns online. Section 4 introduces the experiment including the test platform, performance counter tools, test programs and the results. Discussion and conclusion are given in section 5 with some hints to future work.
Speedup factor
Though there are number of reports about successful accelerating hardware / software co-designs [1] [3] , few have explored the potential about a hybrid-platform before implementing certain algorithm in hardware [2] . This vacuum may be explained by the following two reasons.
The main reason is wide variety of hybrid platforms and target program. Since reconfigurable computing is an emerging technique, it has a long way to become standardized. Nearly each team interested in this realm, has their own hardware layout, from integrating FPGA cells into general purpose CPU, to attach a FPGA card to the slow ISA slot. Parameters of reconfigurable resource is much more varied, including number of RCUs (Reconfigurable Units), build-in memory capacity and the data width of IO channel. The algorithm under scrutiny focuses on multimedia compression / decompression, data encryption / decryption, biology and wireless application. These application share one common characteristic -data flow is much heavier than instruction flow.
It is very common for published online scheduling research to assume that the conversion from software algorithm to hardware specification is finished beforehand. Some even assume the execution time can be deduced at the time when a hardware task is arriving. This assumption is acceptable as long as experimental platform is designed for special usage, such as data compression and or encryption [2] . But for a general purpose hybridplatform system, online converting and scheduling shall consider the conversion cost and the performance gain before take any substantial action. In a performancecritical system, online services cannot be stopped. The kernel loop must be identified, located, converted and finally deployed into FPGA-based co-processing components. In this environment, managing module will be able to pick the most promising one, so as to improve the overall performance, when the beneficial-cost rate of a certain task can be estimated with enough accuracy.
Speedup factor estimation

Hardware and software Support
We setup the experimental a system with Pentium 4 processor (family 15, model 2, stepping 4) running Fedora Core 5 Linux. The processor has a working frequency of 1.8 GHz, 512KB L1 cache. It contains 18 available performance data registers, 65 performance control registers, supporting 46 types of events to be counted.
The kernel of the Linux is of version 2.6.22.9 downloaded from kernel.org [10] and patched with perfmon performance counter driver [11] . On this platform, a background daemon is designed to collect events from certain running process. Performance events, such as CPU cycles, instructions completed and memory access can be simultaneously stored in hardware registers while the CPU processing its instruction flows. The monitoring daemon only need to periodically checkout the value and reset the counter. The monitoring and logging overhead is limited.
Dynamic behavior observation
Cycles spent on traditional general CPUs includes instruction code memory access, memory access on data, and algorithm on data processing. Time needed for a task run in FPGA hardware consists of hardware setting up, data transfer, and calculation on FPGA. Statistics show that data transfer codes have taken up to 80% [3] . When a program was translated into hardwire circuit on a FPGA chip, the performance gain comes from the following reasons.
Calculation steps on traditional CPU are integrated into one single step in FPGA. Temporary variables that cannot be handled by registers with in CPU core, can be handled by hardwire within FPGA, for its unlimited register number. When the hardware resource to implement algorithm is enough, less IPC, may yields more speedup factor.
Data transfers are handled in a pipeline mode in traditional CPU. FPGA hardware often suffers from slower interface to memory. So, those algorithms with heavy data transfer will gain less than those without too much data access.
A sample program have three types of kernel loop, namely non dependant loop, carry dependant loop and a mixed loop, which loop through an array randomly. In Figure 1 , the blurred dots are actually clusters of samples, which are collected periodically through the build-in hardware-based Performance Counters. It can be observed that non-dependant loop, represented by the lower cluster of sampled points, enjoys better performance, or instructions completed per CPU clock, because current CPU pipelining technique suite the kind of task better. Carry dependant loop suffers from the worst performance because the pipeline is more frequently stalled for it have to wait for the output of former iteration to start a new one. In the clock to memory view, carry dependant loop have lest memory access because its iteration need only a small area to store the variables and thus encountered much less memory access. The non-dependant loop, simply iterate through a big array, have more memory access than dependant loop. At the same time, sequential access enabled the processor to handle cache more efficiently than the random version, the mixed loop, which fires event more memory access than non-dependant loop.
Speedup factor estimation
Based on these observations, we give the following formula to estimate the speedup factor of a target program.
(1) In the formula above, F is the friendly level. Friendly Level can be defined as the expectable speedup factor. A, B make up a gradient in IPC-MPI space, and C set up a reference line. These parameters are simply estimated by experiments and may vary a little, since different hardware configurations may yield different speedup result.
Experiment
Test on programs from CPU SPEC 2006 benchmarks
Testing is carried on CPU SPEC 2006 benchmark programs [12] to further verify our estimation, and the results go promising as Table 1 show the event matrix from spec benchmarks. Estimation here is calculated with parameter set as A = 9, B=1 and C = 1, which is an empirical value set.
From the results in Table 1 , we can safely point out that, kernel loops within programs like astar and bzip have complex dependent map, not very suitable for existing pipelines and FPGA accelerators; while kernel loops within programs like bwaves are loop carry dependent, in which most computations are restrained in local variables, and are suitable for FPGA accelerators.
It can be seen that from the experimental data collected by performance counters, the character of the target program deduced is consistent to statistical analyzes. The potential to accelerate can be estimated. 
Discussion and conclusion
To estimate the speedup factor of running task on running system, we use performance events data collected by PMU to pinpoint the most used codes, to determine the feature of that sequence of codes, and to estimate the potential of target process to be accelerated by means of binary modification. Experiments show that the speedup feature can be estimated with the performance event counters at an ac-ceptable cost. From our experiment, the conclusion can be drawn that from performance pattern observed on a process, its feature can be safely estimated. More experiment on various type programs can make this conclusion more convincing and concrete. Currently we use overall dynamic behavior as the estimation base, but actually program have phases, i.e. stages that focus on deferent operations. Dynamic behavior analysis can be implemented so as to make more accurate estimation on phrases. Detailed restrains, such as hardware capacity for algorithm, memory interface bottleneck, are simply assumed to be irrelevant in this paper. Further research effort can focus on these limits and yield a more concrete conclusion.
This work is supported by NSFC No. 90607001 and the EPSRC EP/C544706/1, EP/C544692/1.
