Abstract-In this paper we address the problem of the architectural exploration from the energy/performance point of view of a VLIW processor for embedded systems. We also consider an architectural modification we introduced in order to extend the reference processor so that it can exploit both instruction level parallelism and thread level parallelism. A power model obtained by applying an instruction-level power estimation technique is presented and validated with experimental results. This power model was plugged in a parametric cycle-accurate simulator in order to support architectural exploration. Experimental results derived from the proposed framework show a comparison among different implementations of the reference processor: single and dual cluster implementations, and dual cluster with multithreaded extension.
I. INTRODUCTION
Power dissipation, which was previously considered an issue only in portable devices, is rapidly becoming a signifcant design constraint in many system designs. Since power estimation at low abstraction levels is very slow because of its high complexity, high level power estimation techiques are becoming more and more important.
In this paper we address the problem of the architectural exploration of an extended version of a VLIW processor from the power performance point of view. The extension to traditional VLIW processors is the possibility to flexibly exploit parallelism at different levels: instruction level only or instruction level and thread level jointly. This extension allows higher performance with respect to traditional VLIW processors, especially for multimedia applications. We also present a technique used to build a power model of different implementations of the target embedded processor for multimedia applications in order to make fast power estimation. One of the most attracting advantages offered by the adopted methodology is that it allows an early virtual prototyping of a modified target processor. In fact, using this technique we are able to estimate the power dissipated by a VLIW processor in single cluster, multicluster and multicluster/multithreaded version using banking and interleaved management of the instruction cache. The solution has been validated with reference to an industrial core processor.
Accurate information about power consumption for a possible implementation of the reference architecture is used to set up the parameters for a model that can be used to estimate power consumption of other implementations of the same architecture. The model we derive is based on the decomposition of instruction-level power consumption in different components based on the microarchitectural blocks used by instructions. Estimation of new instructions may be obtained by summing the components due to the blocks activated by the instructions. These components are known for already implemented microarchitectural blocks, and must be otherwise estimated during the design of the new blocks.
In Section II we discuss some previous work the relevant area, while in Section III we describe the reference architecture and some of its possible implementations. In Section IV we present our power consumption model. Experimental results are presented in section V.
II. RELATED WORKS
With few exceptions, complex applications do not exhibit a constantly predominant type of parallelism through the set of procedures of which they are composed and massive ILP can be found only in some segments of the application, leading to low usage of VLIW CPU resource. This fact limits the efficiency of highly parallel architectures when executing such an application and is the reason for the interest in flexible architectures that can exploit different types of parallelism in order to increase the overall performance. Some examples of this trend based on superscalar architectures are constituted by [1], [2] , [3] , [4] .
When designing a new implementation of an embedded processor, power consumption is one of the key issues to be considered. In order to speed-up the exploration of different architectural solutions, it is necessary to be able to fastly estimate the power budget for the processor. This is the reason why high level models for power dissipation are very useful especially in the first phases of the design flow.
Since efficient techniques to access power estimation at the highest levels of abstraction are of primary importance for successful system-level design, many solutions have been proposed in literature [5] , [6] , [7] , [8] . Many contributions appeared recently in literature consider the problem of increasing the level of abstraction of the power estimation process. Some of these works have addressed the problem of power estimation for high performance microprocessors, where pipelining and instruction-level parallelism must be considered from the early phases of the design process. The pioneer works in this field are [9] , [10] , [11] . In these papers there is proposed an empirical approach based on physical measurements of the current drawn by the processor during the execution of embedded software routines, in order to make an instruction-level model for the energy dissipated by a target processor. Other approaches proposed in literature for the power estimation of a microprocessor are, for example, Wattch [12] and SimplePower [13] , that consist in two different frameworks for analyzing and optimizing microprocessor power dissipation at the architecture-level.
In [14] the authors propose a methodology aimed at deriving the power model of a microprocessor knowing the power consumption of a particular subset of the instruction set, called learning set. Power consumption of characterized instructions is decomposed based on functionalities involved in the instruction execution, allowing the estimation of instructions not yet characterized.
III. THE REFERENCE ARCHITECTURE
The reference architecture was considered in three different implementations. The first one is the standard ST220 VLIW processor, the second one is a dual cluster implementation of the processor, and the third one is a modified version of the reference processor that can also exploit thread level parallelism.
The reference architecture is based on the ST200 VLIW cores family, jointly developed by STMicroelectronics and Hewlett-Packard Laboratories for embedded systems [15] [16]. The ST200 family ensures scalability and customizability. Scalability is granted by an architecture allowing an organization of the CPU over multiple "clusters", each provided with its own register file and load/store unit. Customizability can be attained with application-specific extensions to the instruction set, or with the introduction of ad-hoc clusters.
Since ST200 is a family of processors, in this section we have considered a possible instance of ST200, named ST220 1 and dedicated to the consumer market of DVD recorders [17] .
The main features of a multicluster ST200 are illustrated in Fig. 1 . It can be noted:
• an intercluster bus where communication between clusters is based on instructions that copy the content of a register from the register file of a cluster to the register file of another cluster; • a shared unit for instruction fetch and issue, composed of an instruction issue unit and a first level instruction cache; • a first-level data cache for each cluster.
1 Other implementations of ST200 may differ under many aspects from the one we are considering, so the presented architecture is not meant to be a complete description of the ST200 processor family. Some of the possible differences among ST200 implementations are also the maximum number of clusters allowed, fi rst-level cache organization, number and type of functional units present in each cluster, issue width for each cluster. The microarchitecture of a typical ST200 cluster is depicted in figure 2 and is composed of:
• four integer ALUs;
• two 32x32 bits multipliers;
• one load/store unit;
• one register file with 64 general purpose 32-bit registers;
• one register file with 8 1-bit branch registers. Since the register file has eight read ports and four write ports, each cluster can issue up to four operations in a single cycle. The instruction level parallelism degree of a multicluster ST200 is therefore four times the number of clusters.
The multithreaded version of the reference processor is described in [18] and allows each cluster to execute a different thread. Since this approach is very scalable, we will consider for simplicity a processor that is composed of two clusters. The processor can switch at run time between two different computational models:
• ILP mode: all clusters execute one single thread using long instructions (bundles) composed of up to 2M possible operations (where M is the issue width);
• MT mode: each cluster executes a different thread using bundles composed of up to M operations. This implementation mainly differs from the standard ST220 processor for the presence of a branch unit per cluster and for the different design of the instruction issue unit. Different organizations are being considered as far as instruction cache organization and instruction issue unit design are concerned.
The first organization is based on an interleaved instruction cache, composed of eight banks, with one read port. In such a case, we fetch at each cycle operations from one thread and store them in a buffer. At the following cycle we read eight operations from the other thread. This allows the execution at each clock cycle of four operations per thread.
The alternative organization provides a partitioning of the banks composing the instruction cache, allowing us to read in a parallel fashion four instructions per thread provided that the instructions are in different banks. When such condition does not hold, we must stall the execution of a cluster to solve the conflict. This solution introduces stall cycles due to conflicts, but has a lower branch penalty with respect to the former.
IV. POWER MODELING METHODOLOGY
The goal of obtaining a fast evaluation of the performance in terms of energy of the modified architecture, was attained by creating a power model based on functional decomposition of the processor activity while it is running a particular instruction. Energy consumed by each instruction in each part of the pipeline may be determined by dividing the total energy consumed by the instruction in a set of disjoint components.
The first step is to identify a functional decomposition for the instruction set in order to build a n × k matrix A, where n is the number of instructions and k is the number of the identified functionalities. The functionalities considered as sources of energy consumption are such as:
• Fetching and decoding the instruction;
• Reading from a register file; • Writing to a register file;
• Using a given part of an ALU (adder, shifter, . . . ). Element a i, j , called activation coefficient, of the matrix A will be 1 if the functionality j is used by instruction i. For example, if we consider the functionalities shown before, for each instruction the elements corresponding to Fetch and Decode are equal to one.
More in detail the model we derived is formalized as follows.
Let k be the number of identified functionalities and let n L > k be the number of power-characterized instructions in the learning set S L . Then, let A L be the n L × k matrix whose entries are the activation coefficients a L s, j relative to the selected learning set, P F be the k × 1 column vector whose entries are the unknown power consumptions p f , j , and P L be the n L × 1 column vector whose elements are the known terms of power dissipated by the instructions in the learning set, p L,i . The linear system:
represents the available knowledge on the variables that have to be estimated. Let p L,i be an estimate of p L,i and P F be an estimate of the real parameters P F . The minimization of the square error P L − P L 2 yields:
To estimate the model parameters P F , the columns of matrix A L must be linearly independent, otherwise the problem has infinitely solutions and the model is not identifiable with respect to the measurements available.
In our case the learning set is the power consumed by the ISA of single cluster. We obtain the power model of single-cluster and multi-cluster multi-threaded architectures presented in section III by applying our methodology with some modifications.
To reach our goal we add in the second step three functionalities not derivable from the basic instruction set: multicluster and multithreaded fetch and decode, and intercluster communication. The power consumed by these functionalities are estimated using the techniques presented in [6] and are stored in the column vector P NF . Now we can define a new n I × k I matrix A M , where n I is the number of instruction in the instruction set of the target processor while k I is the number of disjoint functionalities obtained concatenating the original ones with the new ones. The power model of the target processor, P T is obtained as follows:
For on-chip L1 caches, we used an analytical energy model, derived from [19] , that accounts for: (a) Technological parameters (such as capacitances and power supplies); (b) Architectural parameters (such as block size and cache size); (c) Switching activity parameters (such as number of bit line transitions). The switching activity parameters in the cache energy models have been computed by directly importing the actual values of hit/miss rates and bit transitions on cache components, that have been derived by the system-level simulation environment to account for the actual profiling information depending on the execution of embedded applications.
V. EXPERIMENTAL RESULTS
In this section we present the results obtained by applying the obtained power model using the proposed methodology to the reference architecture. These results were used to add, to an existing cycle-accurate simulator for the reference architecture [20] [21], a module that keeps track of power consumption cycle by cycle. This module may be used either for estimating the average power consumption during the Image conversion from jpeg to ppm Matrix Matrix multiplication execution of the simulated program, or to dump cycle by cycle power consumption. In this section we will show these two different results that can be obtained with the module implemented: first one shows the performance and the average power consumption for a set of benchmark, see Table I , and finally we show the instant power consumption during the execution of a benchmark (AES) in multithreaded mode. All the power results are shown in STMicroelectronics 0.18µm technology.
The energy model was validated by comparing our results for a single cluster architecture with the results obtained in [22] where is presented an error with a standard deviation of less than 5%. Validation was performed using some benchmarks chosen among different benchmark suites such as Mediabench [23] and MiBench [24] . The proposed model has a good accuracy with respect to the reference results: the maximum error was lower than 1%. Table II presents the simulation results obtained by applying our framework to different benchmarks. The results are shown for all the implementations of the VLIW architecture presented in Section III in terms of simulation cycles (Delay), energy consumption (Energy) and energy-delay product (E*D). Multithreaded version of the processor was considered with either interleaved instruction cache (MT-I) or banked instruction cache. For the latter case, we considered different number of banks from 8 to 64 (MT-Bx, where x is the number of bank of the instruction cache). All results are compared with the original dual cluster architecture.
The multithreaded version, in each of its different organizations, always performs better than the original dual cluster architecture in terms of clock cycles. As far as performance is concerned, the best organization instruction cache organization is a banked memory with 64 banks. In fact, in Table II it can be noted that increasing the number of banks for the banked-MT architecture, the performance increases too. This is due to the the fact that increasing the number of banks, the number of access conflicts between threads decreases.
From the energy consumption point of view, most of the benchmarks show that the multithreaded architectures dissipate more than the original dual cluster one. As we expected, the architecture with 64-banks is the most expensive. This is a very common situation: increasing the performance of a processor very often leads to an increase in the energy consumption. This is the reason why embedded processors for multimedia applications are often evaluated in terms of energy delay product.
When we consider the energy delay product, the multithreaded version of the processor performs better than the original dual cluster processor for all benchmarks but AES. The best instruction cache organization as far as the energy delay product is concerned, varies from benchmark to benchmark.
In figure 3 we show the decomposition of the energy consumption for the selected set of benchmarks in three components: core, instruction cache and data cache. Observing all the subfigures, it can be noted that, as we expected, the single cluster dissipate less energy than the other implementations especially for the core. When increasing the number of banks for the MT architecture, it can be observed that typically the energy consumed from the core decreases because the number of conflicts decreases. On the other hand a different trend can be observed for the energy dissipated from the instruction cache with respect to the number of banks. Finally it can be noted that the energy consumed by the data cache is quite similar for all the architectural implementations. Figure 4 shows the instant power consumption of the multithreaded architecture with an interleaved instruction cache when executing the AES benchmark. It is possible to notice that when the processor is in MT mode it dissipates more power than in ILP mode. As stated above, this is mostly due to the higher resource usage that stem from the better parallelism exploitation.
VI. CONCLUSION AND FUTURE WORK
In this paper we addressed the problem of the exploration of an extended version of a VLIW processor for embedded systems from the power performance point of view. We also showed a technique used to build a power model in order to make fast power estimation of different implementations of the target processor VLIW, ST200. One of the most attracting advantage offered by the adopted methodology is that it allows an early virtual prototyping of a modified target processor. Using this technique we are able to estimate the power dissipated by a VLIW processor in single cluster, multicluster and multicluster/multithreaded version. The use of the implemented power framework to make a design space exploration and the integration in a multiprocessor system on chip are two future directions.
