INTRODUCTION
The tremendous increase in bringing multimedia functions on handheld devices posed a challenge to embedded application developers in term of many diversified goals, e.g., reduced development cost, low execution time, and optimal battery usage under hard real time constraints [1, 2, 3] . Reducing energy, on both as per cycle basis and as the total energy used over the lifetime of an application, has become more important as small and embedded devices become increasingly available. The later is implicitly connected to a demand for expensive packaging and cooling technology, an increase in product cost, and a decrease in product reliability in all segments of the computing domains.
Traditional DSP compilers do not meet above mentioned efficient application goals [4] . In fact, the sophistication in current compilation techniques can be characterized as 'manufacturing implementation' of software, more or less like design and implementation for embedded hardware boards in their early development stage [4, 5] . Though compilers have employed computation and data reordering to improve locality, this still requires expert analysis due to the obscured parallelism and communication patterns in traditional languages. Inline with this problem is ambiguity between the informal design methods and manual coding techniques. Software applications are developed with a partial or completely unawareness to the underlying hardware architecture. The crux of problem that has to be addressed in this context is to consider the application static and dynamic profile at all stages of application to binary build profile [6, 9] . A framework can be defined as a contiguous executable models relevant to system architectural aspects and behavior within a given application domain. Embedded systems are inherently component, which interacts, process and port the data, which are later mapped onto real time systems. We achieve this objective to some extent using vertical profiling approach and profile application at layers as shown in Figure 1 . Multilevel profiling leads to a huge search space for an optimal objective solution [8] . An optimal solution seeking for each subtask is known to be an NP-hard optimization problem. This paper approximates the NP-hard solution with the help of genetic algorithm. A genetic algorithm belongs to the class of stochastic optimization methods [26] . Although it does not guarantee finding of the global optimal solution, the result is a good approximation of it. Different optimization schemes are arranged based on application static profile and eventually fitness function is defined for a good solution.
The key contribution of this paper is that it demonstrates that significant potential exists for energy savings in multimedia applications without an undue increase in execution time. Therefore, it shows that code restructuring during iterative compilation is feasible and advantageous. This paper is organized as follows: First we approach the problem of profiling from optimization viewpoint, and then we apply code restructuring algorithms in the context of energy-cycle optimization.
FORMALIZING PLATFORM-BASED FRAMEWORK
Software application can express at different level of abstraction, starting from plain 'c' code until the binary generation and eventually execution at real time platform. Sections 2.1 describe our profile monitoring methodology. In Section 2.2 we formally define our objective function.
Methodology Flow and Performance Monitors
An embedded system can be viewed as a finite set of state machines, where a data dominated operations triggers the state transfer. E.g., control transfer with if-else conditions. Such transfer may lead to cache misses. Therefore a careful consideration of state transfer event is necessary. In this work extend the state transfer monitors [7] , here we call as performance monitors we described the methodology flow in our framework [7] . The VLIW 'C' source code is processed to extract profile monitoring at different layers of the framework (shown in Figure 2. ).
An application is divided in to functional sub blocks and further into to atomic blocks. Each atomic block is independent of each other and performs only load/ store, logical/ arithmetical or transfer operations only. These transfer operations does not involve any conditional or logical operations. Conditional/ non-conditional branches are considered as one functional block. Following the static code analysis, E-C profiling is done at two layers of code, first, it is estimated for basic code blocks, each basic block has the same characteristics as mentioned above, followed by this, all sub blocks are assembled together to restore their functional level appearance and estimate is made again. Second, inter procedural effect is considered to cope up with the implicit cache miss, that leads to offchip traffic and hence increases CPU cycles that inevitably leads to more energy consumption.
A pre/ post compiler analysis is scheduled later to optimize VLIW architectural usage. Architectural description is provided as a generic text file following a predefine header that consists of CPU, memory and on-chip cache signatures. E.g., a list of pseudo and machine operations, the latency of the operations, the op codes, the slot assignment schemes, the processor operating frequency, the instruction cache feature as well as main memory features.
For the optimization space search we use genetic algorithm (GA). The application profile monitor capture session can be predefined in user constraints as endless and stopped manually. Additionally, the framework is equipped with an automatic optimization termination which adjusts the GA search space suitable to the desired objectives.
Current is measured at the target platform (the setup is explained in [7] . All these parameters are fed back to the transformation cost analyzer. In each successive transformation it is decided that whether energy-cycle performance has been optimized or not. Source code is optimized by undergoing code restructuring schemes known as loop unrolling, decision tree grafting, and loop tiling.
Objective Function
For every point lying on Pareto front in transformation space, we considered to optimize both CPU cycles for code execution and the energy consumption per process. To meet such objective it is necessary that the successive architecture utilization (in terms of functional units, internal register usage, best cache fit) must be greater than a predefined, system dependent limit (execution cycle and energy threshold). In the same vein, the predecessor transformation scheme must overlap the successor in order to follow a smooth optimization. The smooth optimization over two samples of code is defined by minimum and maximum limits of transformed code. If the output profile of code is between these limits, this point must lie on smooth curve for optimization.
Fitness Function and Optimization Methodology
The desired energy-cycle competition is optimized by introducing objective function as a single task and all constraints are modeled as linearly dependent on each other. This is a valid assumption, because our results [6] show that architectural utilization, energy consumption and CPU cycle efficiency can be approximated in linear relations. The desired aims of architecture-based energy-cycle optimization are formulated as penalty terms of such objective function. The maximization of objective function is achieved using a genetic algorithm (GA) [6, 12] .
EXPERIMENTAL METHODS
The new method of application profile monitoring has been investigated from different architectural attributes to show the variability of application and hence optimization potential for architecture efficient code generation.
Benchmarks
Multimedia applications use DSP algorithms and streaming data schemes to compute and later to produce high throughput for real time video or audio applications. The quality of throughput depends on the application domain, E.g., bandwidth and frame rate for a typical MPEG-2 application is different at mobile device and set-top box. We chose the application sfor their importance in real systems and to be representative enough to make the inferences in this study. This application set contains MPEG-1 transcodec, MPEG-2 transcode, G-728 transcodec and generic DSP algorithms ( iir, fir, dct, idct etc..). We obtained codes for these applications form various public domains sources. For lack of space, we only report their names, details may be found on public websites.
RESULTS AND DISCUSSION
The main focus of this research is the optimization of multimedia application to an efficient energy cycle responses on hand-held or mobile devices. Therefore, binary code for different multimedia application domain was executed at the same target platform [13] . Both energy consumption and cycle efficiency were successfully measured to optimal architecture utilization. For these performance indices however a major drawback is the high architectural dependency. For the brevity, here we show optimization results for speech codec G-728enc, a high bit rate speech coding schemes based on G.728 standard. It used low delay CELP (code excited linearly prediction) coding at 16Kbps, compress frame of five 16-bit samples into 10 bits). The code is inherently dominated with many branches between its different blocks, this feature restrain its effective utilization of parallelism offered by underlying hardware architecture. The low cache hit in the beginning leads to higher number in execution cycle. An efficient search for appropriate loop fusion improves the code and hence results in better cache hits; leads to lower in execution cycles in Figure 4 .2. At first glance, though cache hits is improved by 78%, while cycle reduction was 25%; but if you see the same result in conjunction with Figure  4 .1, a different scenario will emerge. At 25% cycle reduction the benefits offered by architectural parallelism are only utilized by 40%, while as we said earlier there is 20% more parallelism exists in architecture, which G.728enc can exploit. Similar conclusion can be made based on Figure 4 .3, that show scheduling factor versus execution cycles. We obtain these results with the help of GA, and this is one salient feature that it does not stuck in local minima and always approximates a good solution rather than optimal solution based on multiple constraints. as functional unit, cache misses, scheduling factor and the code static profile). 
CONCLUDING REMARKS
In this paper, we address how by employing various techniques at the compiler level multimedia application can be optimized for a given hardware architecture. We demonstrate that it is necessary to profile application expression at all layers to understand existing performance problems such as poor architecture usage, increase in execution time, and high energy consumption. In particular presented how GA can be used to converge to energy-cycle efficient code generation. We demonstrate the proposed framework at commercial multimedia application with a significant performance gain. In case of G-728 speech codec, by carefully restructuring code the and the control flow in source code energy saving as much as 20% and cycle efficiency upto 50% can be achieved. More work needs to be done to accurately estimate the energy cost of software including operating systems, API interfaces and applications written in embedded C++ or embedded JAVA. 
ACKNOWLEDGMENT

