Processing embedded applications is essentially a tradeoff between power and performance. Increasing level of complexity in present day microprocessor at the expense of more power cell for different optimization methodologies at architecture level. The study of general characteristics of program execution phases gives insight to dynamically reconfigure or enable/disable additional resources on-demand basis. This leads to significant amount of power saving with negligible or tolerable performance degradation. In this paper we characterize execution of such programs into execution phases based on their dynamic IPC profile. We show that program execution of selected phases (based on IPC profile) can be dynamically boosted by activating additional standby functional units which are otherwise powered down for saving energy. Through simulation we show that speedup ranging from 1.1 to 1.25 can be achieved while reducing the energy-delay product (EDP) for most of the media benchmarks evaluated.
INTRODUCTION
Power dissipation and energy consumption are becoming primary design concerns in both embedded System on a Chip (SoC) solutions and high performance processors. This is primarily due to clock rates and die sizes which are constantly increasing with advances in technology scaling. POW~I translates directly to heat which may lead to thermal run-6 away, junction fatigue, localized hot spots and other reliability problems. New packaging techniques and separate cooling solutions may not be cost effective in SoC platforms consisting of multiple (possibly heterogeneous) processors. The design of complex systems involves analyzing the interaetion between different components of the system (bath hardware and software) and their impact on power, performance and area. It is becoming increasingly necessary to evaluate power and energy consumption at different stages of the design and make important decisions early in the design process. Research in architectural level power optimiza, tions aim at analyzing the impact of micrsarchitectural parameters (both resources and data dependencies) on power and performance and designing architectures that are power aware. Runtime optimization of processor resources is b s coming increasingly important from the perspective of power aware processor architecture. With the increasing complexity of the application programs, the processing requirement runtime. An analysis of basic block distribution in [l) leads to an automated approach to identify a smaller subset of code which represents overall program characteristics. The periodic behavior of, the program, observed in [I] actually reflects the behavior of basic blocks that are frequently executed in B section of a code. In our work we identified different state of processor utilization by using dynamic IPC profile and imte rate and decision of resource allocation is taken at every different phase. Marculescu [6] propose a mechanism to dynamically adapt the fetch and execution bandwidth based on profile information at the basic block level using a compiler based scheme. Brooks et al. [7] o b serve that many modern processors are over-provisioned and minimize power consumption in functional units by exploiting the fact that the sizes of operands are usually small for many programs. In this work, we borrow the concept of dynamically turning onfoff resources t o compensate for the throttling at fetch stage which tends t o inhibit performance. Processor Model Parameters 
BTB

INTERACTION BETWEEN PROCESSOR
It is practically impossible t o simulate all possible comhinations of a set of parameters t o estimate the interactions between them. To evaluate the interactions between parameters we chose Plackett and Burman design t o limit the number of simulation about the number of parameters considered [4] . This design is a two level fractional factorial design for studying k = N -l variable in N runs where N is a multiple of 4 [51. In P-B (Plackett-Burman) design all the k parameters are varied simultaneously over N simulations. An improvement over basic P-B design is fold-over P-B design which incorporates 2 N simulations. Table 2 shows a smaller version of P-B design matrix with fold-over for k = 7, N = 8. A +1 value for a parameter indicates a high value and -1 indicates the low value. T h e first row is a sequence which is fixed for a given value of k. Row 2 t o row N-1 is generated by circular right shift of the previous row. Row Parameters Low vious section we identify the phase or state of the processor based upon dynamic I P C (averaged over 10,000 cycles).
Additional resources are powered on when the processor attains a state where it records IPC more than P. Our study of behavior of different programs based on IPC, average RoB occupancy and average issue rate, shows that these parameters exhibit discontinuity in their profile (see figure 1) With this justification we select Q as 4. We set P equal to 2.5 based on the fact that most of the benchmarks show ILP around 2. Figure 2 shows the normalized IPC, power and EDP with dynamic resource increase (extended configumtzon) with and without fetch throttling. Bar (a) gives the base value with respect t o which normalization is done. All benchmarks except adpcm show an average 1% to 3% power savings with negligible performance degradation. The EDP also decreases with throttling and resource allocation indicating that introducing stalls is beneficial for most benchmarks considered. With the extended configura. tion, the performance improves in all programs (very little improvement in adpcm due t o limited parallelism) with increased power dissipation. JPEG decoding yields the best improvement in the EDP (15% reduction) with a 25% improvement in speedup and 35% extra power dissipation for the extended configuration.
CONCLUSIONS AND FUTURE WORK
In this paper we analyzed the phased behavior of program to determine the execution state of processor and quantify it in terms of IPC and issue rate. Optimizations by incorporating resource scaling with stall a t pipeline stages, produced improvement in E D P with speedup by 1.15 on average. Runtime optimization of execution speed and processing power depends on efficient detection of program phases. Power, which is an important design parameter, depends on the instruction density in pipeline stages. Effective utilization of resources minimize the occupancy of instruction in the pipeline, In our scheme additional stalls clear the pipeline congestion and on average 10% power can be saved with negligible reduction in performance. Through this work we establish the need for automatic identification of program phases and corresponding micrc-architectural support t o o p timize these program phases at runtime.
