In ultra-low power era, one of the most effective ways of reducing power consumption is to lower supply voltage level. When programs execute on processors, voltage fluctuations can occur due to sudden changes in current draw between successive instructions. Such voltage fluctuations can reduce the voltage levels below acceptable levels and cause unreliable operation in microprocessors. Voltage droops due to di/dt effects have been studied in the past, however no prior work studies the effect of compiler optimizations on voltage droops. Past work has studied the impact of compiler optimizations on performance and power, but not reliability. In this paper, we analyze voltage droops with different compiler optimization levels. We also report corresponding performance, power and energy results to put the results into perspective. No clear trends could be observed regarding the effect of compiler optimizations on voltage droops. We conclude that dynamic voltage noise mitigation is necessary because we cannot guarantee voltage noise reduction with static compiler optimization.
INTRODUCTION
Increasing CPU clock frequency for high performance has been limited because of power constraints. One of the most effective ways to decrease power is to scale down supply voltage. However, circuit becomes more susceptible to supply voltage noise due to near threshold voltage operations, and even a small amount of supply voltage fluctuation may cause reliability problems at the lower power supply voltages. Now designers need to analyze the supply voltage noise and devise solutions for guaranteeing reliable processor behavior with very low supply voltages. Low power goals and techniques need to be managed in tandem with the reliability goals of the processor.
Supply voltage fluctuation is caused by sudden change of current draw in microprocessors and the power distribution network. Parasitic inductance on the die, package and the board often disturb the current flow from the voltage regulator on board to processor components on die. Such disturbance causes a temporary lack of electric charges that is needed for powering the processor components. Decoupling capacitance can be a solution for storing and providing electric charges to processor components when voltage emergency arises. However, capacitance can also induce voltage fluctuation because of the characteristics of RLC circuit.
The rate of current draw is affected by program behavior. When an instruction sequence flows through a microprocessor architecture, internal microprocessor components will be turned on and then be turned off, and it changes current draws. It is difficult to predict the amount of current draws cycle by cycle because many instructions are on the fly across different pipeline stages and different paths.
The order of an instruction sequence, instruction scheduling, is important in performance and power. Compilers usually optimize scheduling for high performance, but do not for low power because it is difficult to provide power details for compilers. Compilers also affect the choice of instructions used to accomplish the task.
In this paper we are highly motivated by Valluri and John [1] , where the authors studied compiler impact on performance and power. Valluri and John [1] studied the impact of compiler optimizations on performance and power, and concluded that (1) performance improvement by reducing the number of instructions induces energy reduction and (2) performance improvement by increasing the overlap in program average induces power increment. We extend Valluri and John's discussion [1] by adding a new perspective and by studying reliability with voltage droops while running various programs on a real SMT, multi-core processor hardware. This paper is based on measurement on actual hardware. In contrast to Valluri We compare the performance between the cases with no optimization versus full optimization, because there were only slight differences among O1 to O4.
We measure performance, power, and reliability with different compiler optimization levels. Then, we find energy with the measured performance and power numbers.
RELATED WORK
Valluri and John [1] studied compiler optimization effects on performance and power. The conclusions are that (1) performance improvement by reducing the number of instructions brings energy reduction and that (2) performance improvement by increasing the overlap in program increases average power dissipation.
However, in Error! Reference source not found., power is represented as average power of overall execution. It is problematic because voltage emergency occurs in much shorter period than the whole program execution time, that is, from tens of nano-seconds to micro seconds compared to several minutes.
Reddi et al. proposed a dynamic scheduling workflow based on a checkpoint-and-recovery mechanism to suppress voltage emergencies [3] . Once a code part causes a voltage margin violation, it is registered as a hotspot, and NOP injection and/or code rescheduling is conducted by the dynamic compiler. This flow is independent of architecture or workload. However, users should be careful to set an initial voltage margin properly not to make voltage emergency so frequent.
COMPILER SCHEDULING IMPACT ON PERFORMANCE, POWER, AND SUPPLY VOLTAGE VARIATION
In [1] , Valluri and John showed a data dependence graph (DDG) to show the possibility of peak power variation even with the same execution time (1/performance). We expand the DDG to explain voltage fluctuation due to switching activity of execution units and registers in Figure 1 . To make it simple, we assume (1) that each operation has one cycle's execution time except operation E. Operation E needs two cycles to run, (2) that operations do not consume extra cycles for register reads and writebacks, and (3) each operation, register read, and writeback consume one unit power per cycle. For example, when operation A executes, two register reads for operands, operation A's execution, and one writeback complete in one cycle and consume four unit powers. In Table 1 , each column contains the number of usage of hardware resources and the total power is calculated in the last row. Both (b) and (e) have the largest peak power, 11, at cycle 2 and cycle 1, respectively. However, the largest voltage droop may come at cycle 1 of (e) if an idle state, whose instantaneous power is nearly 0, comes first at cycle 0.
EXPERIMENTAL RESULTS
In this section, we show our experimental methodology and the results.
To analyze performance, power, and voltage droops with different compiler optimization levels, we run various benchmarks from a small but highly scalable program to a high performance program and a standard benchmark suite. Also we use real hardware system rather than a simulator. Through the measurement on silicon, we expect the study significantly reduces possible errors and uncertainty in the abstraction and modeling steps of a processor for simulation method.
Experimental Setup
In our experimental hardware system, we use a state-of-the-art x86-64 multi-core, SMT processor. Our AMD Bulldozer processor consists of four Bulldozer modules on the single processor chip, and each Bulldozer module includes two simultaneous multiple threads (SMT) hardware. One Bulldozer module has one 64 KB I-Cache, two 16 KB D-Caches for two hardware threads, and 2MB of L2 cache. Four Bulldozer modules share 8MB L3 cache. The Bulldozer architecture is described in more detail in [6] . We use several metrics to analyze the impact of compiler optimization levels on programs. Performance is measured in runtime or in the inverse of runtime, and is reported by a benchmark program itself. Power is measured in wattage, and is calculated from supply voltage and current variations measured as voltage drop on a unit resistor. Voltage droop is measured with oscilloscope and differential probes, which are attached to main supply voltage pins on the processor package.
miniFE
miniFE [2] is a mini-application that mimics finite element generation, assembly and solution for an unstructured grid problem. Table 2 shows runtime, power, voltage droop values according to the number of multiple threads. Each value is normalized to 1 thread (1T) case. We observe the following:

At 8T or 16T, the runtime is reduced to one fifth of 1T.  Voltage scaling will be needed for 8T and 16T if there is a power constraint. However, if the voltage margin is not enough, frequency scaling is required despite performance degradation.
 Energy starts to saturate from 4T. Therefore, if thermal constraint should be considered, 4T is optimal not only for sustaining the same battery life but also for keeping good performance. Voltage droops in miniFE are not seriously impacted by compiler optimization levels because it highly depends on the OpenMPI library, which is already optimized with -O3. For Table 2 and Figure 2 , we set its dimension to nx=150, ny=150, and nz=150. The high level of compiler optimization used in the library makes the difference between unoptimized and optimized code fairly small. 
High-Performance Linpack
High-Performance(HP) Linpack benchmark is very popular to measure performance of supercomputers (Top 500). First, we compiled and ran the benchmark with two different optimization levels, -O0 and -O3, but there is no difference in performance, power, and droop between -O0 and -O3. We found out that the benchmark highly depends on Basic Linear Algebra Subprogram (BLAS) library such as daxpy and dgemm. The library is necessary for multi-threading and is usually provided by a processor vendor for a specific architecture. For the processor vendor's pre-compiled BLAS library on its developer's site [9], we could not get the source codes of the library that is required to recompile the library with different compiler optimization levels.
Finally, we obtained and used the original BLAS library on the national lab's web page [10] for the following experiments. The performance of the original BLAS library is much worse than that of the processor vendor's BLAS, but we could see the clear changes in performance, power, and voltage droop according to compiler optimization levels.
In Table 3 , HP Linpack's performance is highly affected by optimization methods. The compiler optimization gives five times performance improvement than no optimization, and the library optimization by the processor vendor increases performance more than five times beyond the compiler optimized case. The performance results also show the importance of the library optimization for a multi-threaded application.
The increases in performance usually are accompanied by increases in power, but it is reduced by 35% from -O0 to -O3. 
SPEC CPU2006
We run all the SPECInt and SPECFp benchmarks in the SPEC CPU2006 suite to see performance, power, and voltage variation in a multi-programmed manner. Each SPEC benchmark is a single-threaded program, so multiple copies of the same program are running on each Bulldozer module to calculate SPECrate. First we run each benchmark with a single thread on one of four Bulldozer modules with an affinity in order not to cause thread migration effect that could distort voltage droop measurement results. Then, we run 4 copies of the same program on each Bulldozer module (there are 4 Bulldozer modules in the current processor) and compare 4T cases to 1T cases.
The single thread performance is discussed first. Table 4 presents the relative values of -O3 when -O0's value is equal to 1.00. Every SPECrate value in Table 4 is greater than 1.00 meaning that performance is always improved with compiler optimization. However, power does not show any uniform trend with compiler optimization. Voltage droop changes from -15% to +15% according to benchmark, but it has no trend, either.
Next, we study impact of compiler optimization on 4T cases (Table 5) , and compare the effects to 1T cases.
Performance improvement (SPECrate), due to optimization, is less in 4T case compared to single thread case. It is because of the contention of multiple threads for shared resources such as L3. Even though none of four threads run on the same Bulldozer module, contentions are unavoidable for L3 and memory accesses.
Most benchmarks take less power in -O3 compared to -O0 indicated by the ratios in column 2 except gamess, gromacs and tonto. This could be because idle time out of total runtime increases due to resource contentions.
The voltage droop changes from -15% to +15% according to benchmark, but it has no clear trend with optimizations. In some benchmarks such as h264ref, voltage droop increases when compiler optimization for 4T cases, but voltage droop decreases with compiler optimization for 1T case.
With compiler optimization, energy is reduced from 30% to 90% in 1T, and from 7% to 89% in 4T. Due to the degradation of performance in 4T, 4T's energy reduction ratio with compiler optimization decreases, compared to 1T.
The following is the summary of our observations:  Regarding compiler optimization and its effect on performance: programs compiled with -O3 are faster than those with -O0 in most cases. However, lbm did not follow this trend with 4T. Another perspective is the impact of multithreading on voltage fluctuations and hence reliability. When SPEC programs are run in the SPECrate mode, the droops increase as we go from 1T to 4T cases.
SUMMARY AND CONCLUSION
We conducted experiments to study the impact on compiler optimizations on the voltage fluctuations during program execution. Several programs were run with optimized and unoptimized versions of code from the same program, and performance, power, energy and voltage fluctuations studied.
We can conclude the following:

Energy can be dramatically reduced by increasing the number of threads and performing compiler optimization. 

ACKNOWLEDGMENTS
Lizy John and Youngtaek Kim are partially supported by NSF grant 1117895 and by AMD unrestricted research funding. This work was conducted when Youngtaek Kim was an intern at AMD Research. We thank lots of AMD people who gave great help and comments during this research.
