This paper presents experimental measurements of power consumption for core logic of a 65-nm Cyclone III FPGA and its comparison with the value predicted by the power estimation tool. The laboratory work is described, including the measurement setup, the benchmark circuits, and the CAD flows utilized to obtain power estimations. The selected circuits used as benchmarks were different type of multipliers implemented in LUTs and in embedded blocks both with or without pipelining stages. Three type of results are presented: first, the error between power measurements and power estimations; second, the power savings by using pipeline stages, and third, the quantification of power savings by using embedded blocks.
INTRODUCTION
Compared with ASICs (application-specific integrated circuits), FPGAs offers many advantages including reprogrammability, tolerance to design errors, reduced nonrecurring engineering costs, and shorter time to market. This flexibility, however, is afforded through a significant amount of additional silicon area, mainly configuration SRAM, abundant routing tracks and programmable switches. This makes FPGAs near 4 times slower, 35 times larger, and 14 times less power-efficient compared to ASICs [1] .
Since first introduced in the mid-80s, the focus of research on FPGA architecture and CAD tools has been centered on improving area and speed. But in the last years, with the growth of portable applications, power efficiency has become more and more important. Power analysis tools have been integrated to commercial and academic CAD tools: Xilinx announced Xpower in December 2000 [2] , Altera introduces PowerPlay in 2004 [3] , and in 2002 a power model was integrated to VPR CAD tool which is commonly employed by the research community [4] . This paper describes a set of experiments developed in order to compare the results of PowerPlay Power Analyzer with real measurements in a 65-nm Cyclone III device.
The remainder of the paper is organized as follows: Section 2 describes previous works on comparisons between measurements and power estimations. The experimental work is detailed in Section 3, including the measurement setup, the benchmark circuits and the CAD flows used to obtain power estimations. In Section 4, results are presented and analyzed. Finally, main results are presented and analyzed in Section 4.
RELATED WORKS
Several previous works performed directly onboard power measurements and compared them with low or high level estimation tools. For example, early measurements in Xilinx XC4000 families before the development of most widely used power analysis tools are [5] , [6] . In [7] , dynamic power consumption is analyzed in Virtex II devices presenting the power distribution in three factors that contribute to total power dissipation: capacitance, resource utilization, and switching activity. Comparisons of power measurements and power estimations with Xpower of different dynamically reconfigured applications in Virtex devices are presented in [8] , and between Xpower and PowerPlay in [9] .
A method for early estimation of FPGA dynamic power consumption was presented in [10] , applying this methodology in a Spartan-3 device the error founded was 18% from the measured value.
The paper in [11] presents several differences between measured and estimated power consumption, which varied from 15.52% to 208% for the Xilinx devices (Virtex II-Pro, and Spartan 3), and from 5.64% to 32.15% for the Altera device (Cyclone II).
An interesting idea based on switched capacitor to get a cycle-by-cycle energy measurement in FPGAs is presented in [12] . The advantage of this method is that it is possible to determinate the static and dynamic energy per cycle. Also the authors report that Xpower highly overestimate the predicted values comparing with the measured ones. Finally a recent work presents a dynamic power estimation methodology for the embedded multipliers in Xilinx Virtex-II Pro chips [13] ; and [14] proposes a methodology for power measurements in FPGA devices.
This paper presents several measures performed in an Altera Cyclone III device using multipliers of different word sizes, different pipeline stages, and design-cases targeted to both embedded multipliers blocks or entirely in LUT implementation. The values obtained by the measures are compared with low level estimations of power consumption based on simulations.
EXPERIMENTAL WORK
As it was mentioned in previous section, the main objective of this work is to analyze the errors of low level power estimations, the estimations were made directly with PowerPlay Analyzer.
The selected circuits used as benchmark were different type of multipliers. They are simulated to generate signal activity files, and this information were used to feed PowerPlay Analyzer.
A fixed clock frequency of 50MHz was used in all cases.
All measurements were performed using a Terasic DE0 board, with an Altera Cyclone III 3C16 FPGA device.
Measurement setup
The DE0 board (Figure 1) is not specifically designed to perform power measurements. Then some modifications were necessary, in order to measure the internal core power consumption (the IO power is a value mainly independent of the technology and was not measured). The on board 1.2 Volts regulator was removed, and substituted by a circuit that includes an external regulator and a serial shunt resistor. Then, a calibration procedure of the shunt resistor and the measurement probes was performed.
The voltage across the shunt resistor was measured with a Tektronix TDS3052C oscilloscope and with a Fluke 45 multimeter. The RMS value of the waveforms recorded by the oscilloscope were compared with the voltages read by the multimeter, and the difference between both instruments were less than 0,5% in a large number of cases. As a consequence, in order to simplify the measurement procedure, we finally decided to use only the multimeter method.
Repeating several times each measure for the same circuit, the maximum observed variation was ± 1 in the third significant digit. The relative error in the measures is less than 1,5%.
Benchmark circuits and power estimation
Three types of unsigned integer multiplicators with different word size were utilized as benchamk circuits: 32x32, 54x54 and 64x64.
For each one of this multipliers, several implementation alternatives were tested: extensive use of LUTs without pipelining, implementation in LUTs with different pipelining stages, implementation with embedded multipliers without pipeline, and implementation with embedded multipliers with one to three pipeline stages.
The inputs to the multipliers were generated internally in the FPGA by an auxiliary circuit, that was used in all the studied cases. It is based in a linear feedback shift register. The only external input to the FPGA was the 50MHz clock signal.
All the circuits were simulated and the generated signal activity files were used to feed the PowerPlay Analyzer. This tool gives a detailed power estimation and the currents consumed from the different power suppliers: VCCINT, VCCIO, VCCA and VCCD. The measurement setup read the current from the 1.2 Volts power supply. That is, it includes the current drained by both VCCINT and VCCD. VCCINT is the current of the internal core, and VCCD is the current that supplies the power for the digital circuitry in the PLL.
RESULTS
Three type of results will be analyzed, first the comparison between current measurements and current estimations, second the power reduction with pipelining and third, the power trade between intensive LUT utilization vs. embedded blocks implementations. Table 1, Table 2 and Table 3 show the measurement core current, the estimated current, and the error for the 32x32, 54x54 and 64x64 multipliers respectively. The worst case error is 13,0%, the PowerPlay tool in some cases overestimates, meanwhile in others underestimates the power. Altera manual says that PowerPlay usually provides ± 10 percent accuracy when used with accurate design information [15] .
In the same tables we can verify the well-known result that adding pipeline stages reduces power consumption. In order to compare only the multiplier blocks it is necessary to subtract the current consumed by the auxiliary circuit (12,11 mA). The power reduction is very noticeable in the LUTs implementations, but is much less evident in implementations with embedded blocks. In Figure 2 we can see that the minimum power is achieved with 2-pipeline stages for the 32x32 multiplier, 5 stages for the 54x54, and 3 stages for the 64x64.
The last interesting result is to compare the power consumption between multipliers implemented in LUTs with those implemented in embedded blocks. It is obvious that embedded blocks will consume less power, but it is interest to see the quantification of this power. The ratio varies: 8,7 times less for 32x32 multiplier, 7 times less for the 54x54 multiplier and 5,6 for the 64x64 multiplier.
CONCLUSION
This paper presents a comparison between low level power estimations based in simulations and using PowerPlay with real measurements. The maximum obtained error between estimated value and measured value is 13,0%. It also confirmed that placing pipeline stages in big combinatorial circuits reduces the power consumption. Finally, it has been quantified in 8,7 to 5,6 times the power savings using embedded multiplier blocks instead of LUTs implementations.
Future works will extend the benchmark suite to study dispersion in power consumption.
