Abstract-The ability of scaling power and performance at run-time enables the creation of computing systems in which energy is consumed in proportion of the work to be done and the time available to do it.
Introduction
Energy and power efficiency in FPGAs has been estimated to be up to one order of magnitude worse than in ASICs [1] and this limits their applicability in energy constrain applications. According to device vendors recent 28 nm FPGAs consume 50% lower power than previous generations [2] and this contributes to close this power gap. Additional power savings are possible if FPGAs can make use of techniques such as Adaptive Voltage Scaling (AVS) which results in significant reduction of the dynamic and static power by dynamically adjusting voltage and frequency in a closed-loop configuration. AVS is a popular power-saving technique that enables a device to regulate its own voltage and frequency based on workload, fabrication, and operating conditions and compares favourably with open-loop DVFS (Dynamic Voltage and Frequency Scaling). Our previous work [3] presented a novel design flow and IP library that enable the integration of closed-loop variation-aware adaptive voltage scaling in commercial FPGAs. This approach adapts the operational point over a wide range of voltage and frequency levels at run-time adapting to temperature, process and workload changes automatically. The investigation results were based on a 65nm Virtex-5 device and reveal that although the device has not been validated by the manufacture at below nominal voltage operational points; savings approaching one order of magnitude are possible by exploiting the margins available in the chip. For this adaptive voltage scaling system to be beneficial there must be performance and voltage margins in the device that can be exploited. In this paper we investigate the presence of these margins in state-of-the-art FPGA devices manufactured in a 28 nm process maintaining other aspects of the system as described in our previous work [3] . The contributions of this work can be summarised as follows:
1. We introduce a low overhead IP Core that controls the system voltages using the PMBus (Power Manager BUS) standard and which can be employed in an adaptive voltage scaling system. 2. To the best of our knowledge, this is the first work which investigates the run-time power and performance scaling capabilities of 28 nm FPGAs and shows its benefits.
This work could be applied to high-performance computing systems based on FPGAs that do not require or cannot tolerate working at maximum levels of performance constantly. This could be similar to modern microprocessors that include a Turbo mode that must make sure that thermal limits are not exceeded. In this case this technology could use data from temperature sensors to locate frequency and voltage points that ensure safe and stable operation. The concept of trading performance and energy as demonstrated in this work can benefit many applications. For example, financial computing for low-latency trading requires responses of just fractions of a second and a configuration set at maximum voltage and frequency will be the most suitable in this scenario. Clock gating could be used to reduce temperatures when new operations are not required while transitions to active states possible in a single clock cycle. On the other hand, background calculations happening with a closed marked or based on medium-frequency trading approaches will benefit from a different configuration points focused on energy efficiency at a reduce voltage and frequency.
The rest of the paper is structured as follows. Section 2 describes related work. Section 3 presents the voltage and frequency scaling IP cores and test platform architecture. Section 4 explores the performance and power margins available in 28 nm FPGAs. Finally, section 5 presents the final conclusions and future work.
Previous works
In this section we review the related work in the area of FPGA power optimization. In order to identify ways of reducing the power consumption in FPGAs, some research has focused on developing new FPGA architectures implementing multi-threshold voltage techniques, multi-Vdd techniques and power gating techniques [4] [5] [6] [7] [8] .
Other strategies have proposed modifying the map and place&route algorithms to provide power aware implementations [9] [10] [11] . This related work is targeted towards FPGA manufacturers and tool designers to adopt in new platforms and design environments. On the other hand, a user level approach is proposed in [12] . A dynamic voltage scaling strategy for commercial FPGAs that aims to minimise power consumption for a giving task is presented in their work. In this methodology, the voltage of the FPGA is controlled by a power supply that can vary the internal voltage of the FPGA. For a given task, the lowest supply voltage of operation is experimentally derived and at run-time, voltage is adjusted to operate at this critical point. A logic delay measurement circuit is used with an external computer as a feedback control input to adjust the internal voltage of the FPGA (VCCINT) at intervals of 200ms. With this approach, the authors demonstrate power savings from 4% to 54% from the VCCINT supply. The experiments are performed on the Xilinx Virtex 300E-8 device fabricated on a 180nm process technology. The logic delay measurement circuit (LDCM) is an essential part of the system because it is used to measure the device and environmental variation of the critical path of the functionality implemented in the FPGA and it is therefore used to characterise the effects of voltage scaling and provide feedback to the control system. This work is mainly presented as a proof of concept of the power saving capabilities of dynamic voltage scaling on readily available commercial FPGAs and therefore does not focus on efficient implementation strategies to deliver energy and overheads minimisation. A comparable approach also based in delay lines is demonstrated, by the authors in [13] . A dynamic voltage scaling strategy is proposed to minimise energy consumption of an FPGA based processing element, by adjusting first the voltage, then searching for a suitable frequency at which to operate. Again, in this approach, first the critical path of the task under test is identified, and then a logic delay measurement circuit is used to track the critical point of operation as voltage and frequency are scaled. Significant savings in power and energy are measured as voltage is scaled from its nominal value of 1.0V down to its limit of 0.6V. Beyond this point, the system fails. Xilinx has also investigated the possibility of using lower voltage levels to save power in their latest family implementing a type of static voltage scaling in [14] . The voltage identification bit available in Virtex-7 allows some devices to operate at 0.9 V instead of the nominal 1 V maintaining nominal performance. During testing, devices that can maintain nominal performance at 0.9 V are programmed with the voltage identification bit set to 1. A board capable of using this feature can read the voltage identification bit and if active can lower the supply to 0.9 V reducing power by around 30%. This is a static configuration that maintains the original level of performance and takes place during boot time in contrast with the dynamic approach investigated in this paper.
In-situ detectors located at the end of the critical paths remove the need for delay lines. This technology has been demonstrated in custom processor designs such as those based around ARM Razor [15] . Razor allows timing errors to occur in the main circuit which are detected and corrected reexecuting failed instructions. The latest incarnation of Razor uses an optimized flip-flop structure able to detect late transitions that could lead to errors in the flip-flops located in the critical paths. The voltage supply is lower from a nominal voltage of 1.2V (0.13μm CMOS) for a processor design based on the Alpha microarchitecture observing approximately 33% reduction in energy dissipation with a constant error rate of 0.04%. The Razor technology requires changes in the microarchitecture of the processor and it cannot be easily applied to other non-processor based designs. It also uses utilizes a specialized flip-flop. Our work in [3] presents the application of in-situ detectors to commercial FPGAs that deploy arbitrary user designs. The presented approach removes the need of delay lines as done previously by the authors in [13] increasing the system robustness and efficiency. Additionally, it only uses the technology primitives already available in the FPGA and it does not require chip fabrication or redesign.
In this paper we extend the work of [3] by presenting the additional blocks required to regulate voltage and frequency at run-time using state-of-the-art devices and leveraging the availability of the PMBus in off-the-shelf FPGA boards. In addition, we investigate the run-time power and performance scaling in 28nm devices and compare it with the work in [3] based on 65 nm FPGAs. to guide the designer in this task. We have selected the second method because we need to access the PMBus interface internally to scale the voltage dynamically and autonomously.
IP Cores and test platform architecture
We have created two hardware units to have full control of the voltage and frequency in the system and these are described in the next two sections: Figure 1 shows the Dynamic Voltage Scaling (DVS) unit architecture. The DVS unit has three main components which are a MicroBlaze processor (MB); a register file implemented using a Dual-Port RAM (DPRAM) and an I2C IP core. These components are connected to a local AXI bus. The DVS unit has full configuration and monitoring capabilities of the power rails connected to the PMBus. The DPRAM is used to receive the commands from the system processor. The commands control and record power and voltage values. The MB is responsible for the execution of the commands, communicating with the PMBUS via the I2C IP core and writing the results to the DPRAM. The need for a MB processor is mainly due to the relatively complexity of I2C communications that means that a state machine implementation will be complex to design and maintain for different boards with slight PMBus implementation differences. Although using a simpler core such as a PicoBlaze could be an alternative, code size limitation could be a problem since it is possible to monitor and configure many parameters related to the main core in the processing subsystem, the FPGA fabric and the external DDR memories. The initialization, configuration and monitoring code is written in C and compiled into a .elf file using the standard Microblaze compiler. The DVS core is controlled with commands which are issued by system processor. A command has 32 bits and contains six parameters as it can be seen in Figure 2 . Table 1 presents the details of the commands and parameters. Setting Action0 to 1 indicates that there is a new task to do for DVS IP Core. The Read/Write field indicates if the task is a monitoring or a voltage scaling task.
A. Dynamic Voltage Scaling unit
When the task is monitoring, Read (PL (programmable logic), MEM) and Read (V, I, P) determine which power supply (PL and Memory) and which parameter (Voltage, current and consumed power) are selected to monitor.
The reading voltage, current and power values will be recorded in address offsets 0x1, 0x2 of the DPRAM. The reading parameters and address offsets in the DPRAM can be changed or modified depending on the user requirements.
When the task is voltage scaling, the DVS IP Core scales the voltage to the value written in the Voltage value field. The scaling voltage range is from 650mV to 1V and from 1V down to 650 mV. The IP Core is designed to maintain the voltage in this range to avoid damaging or cutting off the power supply of the board. This means that the IP core will automatically reject commands that indicate a voltage value out of these ranges.
When a monitoring/ voltage scaling task completes, the MB will clear the command in the DPRAM and set the Action1 to 1 to inform that the task has finished to the system processor.
We have employed a Xilinx VC707 evaluation board in this work which uses a Xilinx Virtex 7 XC7VX485T device. Table 2 shows the complexity of the DVS unit components after implementation in the XC7VX485T device. As it can be seen in this table, this unit is area efficient and it only consumes a small fraction of the available resources.
To help the debugging of the system five error report codes have been considered for the DVS unit. The list of the error codes can be seen in Table 3 . When one of the errors is detected, the MB will clear the command in the DPRAM and set the Action1 to the related error code in this table to inform that there is an error to the system processor. The IP core will read the voltage/ current /consumed power of the PL/ Memory
Read/Write=1
The IP core will scale the PL voltage (VCCint )
PL,MEM Read Read(PL,MEM) =0
The PL is selected to monitor its voltage/ current / power Read(PL,MEM) =1
The Memory is selected to monitor its voltage/ current / power  16-Bit Accumulator: connected to the 16-bit LFSR counter.
In addition, each PCASTM includes a simple 'speed test' (ST) circuit to evaluate the performance of the chain.
The ST circuit of each module has an 8-bit LFSR counter and an 8-bit comparator. Each module is connected to its neighbours in the chain of PCASTM and compares the value of its own counter with the value of the counter in the previous module. Failure will be detected and reported as soon as any pair of counter values do not match.
We have implemented the test systems with an initial 100MHz clock frequency and the Picoblaze increases the frequency to detect the maximum operational frequency and performance. We have measured the latencies between the issuing of a monitoring command and when its execution completes at 1.63 ms. Also, commands that request a voltage scaling operation need approximately 8.64 ms to complete. These read and write latencies should be taken into account when developing energy proportional systems based on these devices and boards.
We have also measured that the minimum safe voltage is 700 mV.
Power and performance analysis
In this section we have implemented different test systems with a varying number of test modules to evaluate the run-time power and performance scaling of the systems. 4-2-Analysis at a fixed frequency of 100 MHz PCASTM modules. Figure 6 shows the monitored power consumption at the nominal Voltage (i.e. 1V) compared to the estimated power from the Xilinx power tool (Xpower Analyzer) for different test modules. Figure 6 shows that the measured power is aproximately 30% higher than the values estimated by the Xpower Analyzer. Figure 7 displays the temperatures reached by each of the configurations. As expected, more complex configuration increase the temperatures measured in the device but in all the cases, the temperatures remains below dangerous levels.
4-1-Area

4-3-Analysis at the maximum frequency
We have increased the clock frequency with the DFS IP core to investigate the maximum clock frequency for each configuration as well as measuring the power consumption and temperature at the maximum frequency. The maximum allowed operating temperature for the device is 85°C according to . In all these experiments an FPGA cooling fan is active at a constant rate. Figure 10 displays the temperature of the device when it operates at the maximum frequency. Although a higher temperature at the maximum frequency is expected, the FPGA cooling fan keeps the temperature close to that of the 100 MHz case shown in Figure 7 and it stays well below the recommended 85°C value.
4-4-Static Power
We have implemented the systems with different complexities to measure the static power. The clock generator is stopped so that only static power remains using a user switch available on the board. We changed the monitoring method to the TI monitoring tool to measure the static power since the DVS core does not operate without clocks. Figure 11 shows the static power and voltage analysis. As it can be seen in this figure 
4-5-Margins analysis
We have created timing constraints to analysis the maximum frequency of a single PCASTM module for each configuration with varying numbers of modules using the Xilinx timing analyzer software, which is available in the ISE package, and compare these frequencies with the maximum achieved frequencies in the physical prototype at nominal voltage to investigate the existing margins. Figure 16 displays the software reports and achieved maximum frequencies. This figure shows that the static timing analysis reports a maximum frequency of around 200 MHz which is consistent with the value reported by the manufacturer in [23] . The figure also shows that there is a large margin compared with the measured performance. We have verified that the test circuits exercise the critical paths in the design validating this result.
Conclusion and future work
Our previous work in [3] investigated the capability of standard FPGA devices to operate out of their nominal ranges with over and under scaling of frequency and voltage. The work presented in [3] was based on older Virtex-5 devices fabricated using a 65 nm process. In this paper we investigate if these margins are still present 
