It is envisioned that future system-on-chip hardware platform designs will be based on reuse of a customizable processor core. Consequently, being 
Introduction
With growing complexity, System-on-Chip (SoC) developers have to resort to platform based design. One of the key components in SoC hardware platform design is a customizable processor core. System-level design methodologies often utilize logic synthesis of reusable blocks. Since synthesis can drastically alter the physical level design, a problem originates from this kind of design approach concerning the predictability of physical level performance metrics such as clock frequency, power dissipation and device area. Therefore, it becomes essential to have the ability of quickly evaluating performance metrics in different design points, while recognizing design space limits of the block at hand.
Accurate knowledge of the actual design, architecture, and technology parameters is needed to realize the benefits of early design phase performance estimation models. Inaccurate parameters are likely to cause significant errors in the estimation results. The existing processor performance estimation models yield good results if used for certain types of designs with similar architecture, logic style, and technology generation. These models include SUSPENS by Bakoglu [1] , models of SaiHalasz and Mii [2] [3], BACPAC [4] , and RIPE [5] . They are commonly based on globally defined clock cycle time and random logic presentation of sub-blocks. Average onchip wire length is estimated using Donath's wiring statistics [6] and organization is described by the wellknown Rent's rule [7] using only one global Rent's exponent. The rule states that if we have N gates number of logic gates in a block, #I/O number of input and output pins/connections between the block and its environment, and we have an average amount of K p I/O connections for an individual gate/instance inside the block, then we can link these variables with the following equation [7] where K p is Rent's constant and p is the Rent's exponent that is derived experimentally.
If a logic synthesis tool is used to generate varying block implementations, a common Rent's exponent fails to accurately predict performance characteristics. In order to solve this problem we have extracted Rent's exponent as a function of delay separately for each block, and developed a method for estimating block performance characteristics when the delay varies. The analysis was made for XIRISC [8] , an extensible reduced instruction set computer (RISC) developed at the University of Bologna.
The Extensible RISC Core
Hardware platforms for SoC development need to be customizable by reconfigurable interconnection scheme or programmable functionality. The major shortcomings of reconfigurable interconnections are unpredictable delays because of variable length interconnections, high implementation costs due to inevitably complex communications, slow reconfiguration, and possibly, need for adoption of new data transport oriented software languages. Approaching SoC platform design from the programmable functionality point of view pushes complexity towards less expensive implementation features and allows use of legacy software. Customizable functionality can be achieved with low cost by using an extensible RISC processor core like XIRISC as a basic building block. Extensible means that the instruction set architecture (ISA) is easily modifiable for specific needs and thus the hardware resources can be selected accordingly.
XIRISC is provided as an open source synthesizable Very High Speed Integrated Circuit Hardware Description Language (VHDL) description accompanied by GNU Compiler Collection (GCC) based C/C++ software development environment. Modifications to the ISA and word widths used can be easily made through single VHDL files. XIRISC is based on the Harvard architecture with five pipeline stages as depicted in figure 1 . The Harvard architecture is a processor architecture that has separate buses for instruction and data memories. In addition to the traditional single data path implementation, a double data path Very Long Instruction Word (VLIW) implementation is supported. The top-level block structure of the XIRISC core consists of roughly seven parts. System Control Coprocessor (SCC) is an entity responsible for handling exceptions and interrupts, maintaining processor state and controlling context switches. The remaining control logic consists of address generation, program counter (PC) value calculation, instruction decoding, and hazard handling. Data memory is accessed in a load-store fashion using register file (RF) to store the operands and results of operations. Selectable Functional Units (FUs) include Arithmetic Logic Unit (ALU), shifter, multiplier, Multiply and ACcumulate (MAC) logic, and even a divider, which is not considered in this study. Multiplier can be implemented as a single-cycle device producing as accurate results as the operands are or as a two-cycle device producing full precision results (double width result compared to the operands). This two-cycle device implements the multiply part of the MAC logic.
Performance of XIRISC can be significantly affected by modifications to the ISA, data word width or block level architecture. Because of the high abstraction level nature of the core description, also constraints set to the synthesis program have a dramatic effect. Typical 32-bit implementation of the XIRISC core on a 0.18 µm CMOS process occupies silicon area less than half a square millimeter, dissipates approximately 0.4 watts of power and operates at or above 100 MHz.
The scope of this study had to be tightly restricted due to the extensiveness of the XIRISC design space. To cover as wide design space as possible, focus was concentrated on the block level architecture and a simple system level case study. When the extensibility of XIRISC is exploited and modifications to the ISA are made, the set of needed FUs may change while the design space of a specific FU may not. Of course, modifying the ISA results in altered performance metrics for the control logic, and may affect the SCC. The scope was further refined to block-level design space exploration of 32-bit implementations emphasizing timing constraints. The standard ISA of the XIRISC distribution was not modified for this study.
Rent's Exponent Extraction Methodology
To be able to predict system level performance metrics already in the early phase of design cycle without executing many time-consuming synthesis runs, we have developed a method for linking different synthesis runs of given timing constraints to an early estimation method having only a few organizational parameters with which one can estimate performance metrics of a processor accurately enough in the early phase of design without the need to know too many details about the design itself. The key component in early estimation analysis is Rent's rule [7] presented already in the first section. In this paper we have used 2 organizational parameters: Rent's constant (K p ) and Rent's exponent (p).
Systematic workflow was used in our method to derive Rent's exponent p and a regression curve for the exponent as a function of delay. First, one block was taken under examination. By using (1) synthesized values. Then we have to plot Rent's exponent p as a function of delay and make a regression analysis for the curve. We decided to use linear regression for curve fitting although it seems that some of the curves could obey polynomial expression rather than linear expression. The regression process is explained in section 5 in more detail.
Block-Level Design Space Exploration
Before the actual design space exploration of individual blocks, a technology exploration was carried out. The focus was on performance metric deviation with changes in operating environment. Environment parameters include the actual operating voltage and chip temperature. For a best-case environment, assumed operating voltage is higher than nominal and chip temperature is at or below freezing. For a worst-case environment, assumed operating voltage is lower than nominal and chip temperature is somewhere around boiling point. Different 0.18 µm static CMOS processes with either 1.8 V or 1.3 V nominal operating voltage were examined using the XIRISC multiplier as a case study.
It was observed that the multiplier delay degraded from the best-case environment to the worst-case environment by a factor ranging from 2.3 to 3.5 for the 1.8 V nominal voltage, and a factor of up to 4.2 for the 1.3 V nominal voltage. Of course, power consumption is also affected by the lowered operating voltage assumption and higher resistance due to raised temperature assumption. The observed impact on this case study was that the power consumption lowered by a factor ranging from 1.6 to 1.7 for the 1.8 V nominal voltage, and a factor ranging from 1.8 to 1.9 for the 1.3 V nominal voltage. The differences between technologies with the same nominal voltage are caused by slight variations in the characterized best-case and worst-case environments in addition to differences in sensitivity to environmental changes. The higher factors for 1.3 V nominal voltage are due to a higher relative deviation between best case and worst case operating voltage assumptions compared to 1.8 V nominal voltage.
Nominal voltage of 1.8 V and best-case operating environment were chosen for the block syntheses. This yields the highest power consumption and lowest delay. Hence, the delay figures from block syntheses should be used for comparison purposes only, not to determine the actual clock frequency as the assumed environment parameters are not realistic in the vast majority of operating environments. Referring to the technology exploration results, a rough estimation of the delay in a typical environment could be obtained by multiplying the best-case delay by two.
The design space exploration was realized by synthesizing the individual blocks using seven different delay constraints for each. In addition to that, synthesis runs for minimized power consumption and silicon area were performed. Initial synthesis runs for all blocks were executed with absolutely no constraints set for the synthesis tool. For the following six runs, the delay constraint was lowered in equal sized steps. The magnitude of these steps was chosen to be 15 percent of the delay figure obtained from the initial synthesis run. All of the synthesis runs were performed with highest possible mapping effort, i.e. maximum number of heuristic optimization cycles. Optimization over subblock boundaries was allowed for the synthesis tool.
XIRISC block synthesis results are gathered together into tables 1, 2, and 3 presenting delay, area, and power dissipation figures respectively. The figures have been evaluated from the gate level designs by the synthesis tool. Performance metrics obtained from the initial runs without constraints are organized under the label "initial result". The following column labels describe the magnitude of the given delay constraint as a percentage of the delay figure in the "initial result" column. The results for power consumption and silicon area minimization runs are given in the last two columns. Power dissipation figures in table 3 are normalized for a clock frequency of 100 MHz to ease comparison between implementations. When interpreting the power consumption figures of individual blocks in table 3, one has to bear in mind that these figures are dominated by internal power consumption of logic cells inside the block, whereas the overall power consumption of the processor core is mostly determined by interconnect switching power.
Some general synthesis features are evident in the results. Because area is the highest priority parameter in design optimization, forcing the synthesis tool to minimize occupied silicon area yields practically the same results as giving no constraints at all. This can be verified by comparing the "initial result" and "minimum area" columns of tables 1, 2, and 3. Minimum area implementations are generally considered to be close to minimum power implementations. In this study the observed difference in power consumption between minimum area and minimum power implementation ranged from 15% to 26% being 19% in average. The area penalty due to minimization of power consumption ranged from 8% to 30% being 22% in average, while delay penalty ranged from 0% to 27% being 13% in average. These figures are given excluding the RF, which exhibited 51% increase in delay and 11% increase in area for a power consumption drop of 13%. The exceptional behavior of the RF is caused by dominance of a single standard cell, the flip-flop register, having only a few discrete implementations. The results indicate also that considerable differences in performance metrics are incurred by synthesis alone. The lowest delay figures differ from the highest by a factor ranging from 2.5 to 5.8 being 3.7 in average. The corresponding factor for area figures ranges from 1.3 to 3.5 being 2.3 in average, and for power consumption figures from 1.6 to 3.9 being 2.5 in average. 
Rent's Exponents and Regression Analysis
The values of Rent's exponent p found in the literature very often refer to a specific design case implemented with specific technology and specific logic design style. Hence, there is a need to derive Rent's exponent for individual blocks separately. Because in the logic synthesis process a CAD tool changes the organization and type of standard cell components according to various delay, area and power consumption constraints, we need to define our own Rent's exponent for each individual block as a function of a specific constraint. In this paper we have used only seven different delay constraints as explained earlier in section 4. We have used synthesis results to define a specific Rent's exponent p for each block used in our XIRISC processor case study as a function of delay constraint.
Linear regression analysis is made for Rent's exponent variation as a function of delay constraint [9] . Error function is given by (2) value and x i is here the delay constraint. This is a mean square error (MSE) function. Our aim is to minimize the error function by setting partial derivatives of J to zero with respect to an intercept value b and a slope w. After doing this the following equations are derived [9] [ ]
where x i is here the delay constraint and d i is the real exponent value. Symbols x avg and d avg describe average (mean) values of the delay constraint and Rent's exponent, respectively. Additionally, there are two separately optimized cases for deriving Rent's exponent: areaoptimized case and power consumption-optimized case. For both cases we use the information received from the synthesis report and assume that a standard 2-input NAND gate with normal drive strength, fan-out of 2 and an input rise time of 17 ps represents an average gate. In the first case we use area value extracted from the synthesis report and in the latter case we use power consumption value from the report. These two cases give a bit different Rent's exponent values and also different regression lines which can then be applied separately in area and power consumption estimation. One must notice that these values and regression lines apply only to this specific design style (static CMOS gates) and this specific technology (0.18 µm in our case). The advantage is that we can vary the block delays as long as we stay inside the boundary values. Thus the analysis presented here applies as well to totally synchronous (Sync), totally asynchronous (Async) or globally asynchronous, locally synchronous (GALS) design scheme of future SoC platforms. Tables 4 and 5 give the real area-optimized and power-optimized Rent's exponents calculated by using Rent's rule and synthesis reports for the blocks. Those values are then used in linear regression analysis, which finally yields a formula for Rent's exponents as a linear function of delay constraint. Some linear graphs are presented in section 6.
Analysis of the Estimation Method
After performing linear regression for calculated Rent's exponent values of each block, the obtained regression line can be used to estimate the value of Rent's exponent as a function of delay. Figures 2 and 3 illustrate the regression lines derived for the area-optimized Rent's exponent values of the ALU and RF blocks respectively. Figures 4 and 5 depict regression lines for the power consumption-optimized Rent's exponent values of the control logic and two-cycle multiplier. An arrow is used to mark the point of greatest deviation from the regression line. For all blocks, the curve shape is very similar for the area-optimized and power consumption-optimized Rent's exponents. In the regression procedure, all design points have been given equal weight. In general, there are more evaluated design points in the neighborhood of the minimum achievable delay due to the synthesis procedure. Hence, the designs close to the minimum delay have higher weight in the derivation of the regression line.
To evaluate the highest possible error in estimation results, the point on the regression line corresponding to the point of greatest deviation was used. The obtained results were then compared to the synthesis results associated with the point. Results of this comparison are gathered together into table 6. In addition to the error percentages of area and power consumption estimates, the error percentages of respective Rent's exponents are shown to illustrate the fact that a small change in the value of Rent's exponent results in considerable alteration of the estimation result. It has to be noticed that the absolute maximum error might be found in between some of the evaluated design points. figure 3 , to appear inside their physical-level design spaces. For the group of multiplier-based blocks, moderate modeling accuracy was achieved. Referring to figure 5, this group exhibits abrupt changes that are almost orthogonal to the regression line. This can be explained by topology changes being allowed for the synthesis tool. Also characteristic but not unique to this group was the tendency of Rent's exponent to grow towards the low-delay implementations. It was observed that whenever Rent's exponent exhibited an abrupt change, also Rent's constant was dramatically altered. Because Rent's constant is defined by a number of connections between a logic gate and its surroundings, it is obvious that this abruptness is due to synthesis tool being forced to utilize smaller gates to meet the timing constraint. For example, the most dramatic increase of Rent's exponent visible on figure 5 is accompanied by Rent's constant drop of 34% indicating considerably smaller logic gates in the design.
Finally, a cycle time estimate was done for the processor based on the assumption that the cycle time is defined by the execution stage. In this case we did not take on-chip memory into account in our calculations. The cycle time was assumed to consist of the sum of the delays in a logic block itself (based on the synthesis results we here assumed that MAC has the longest delay), in the global wire and in a pipeline register. We used 0.32 µm wide and 0.565 µm thick, metal 4 wire for global signaling. RLC delay equation presented in [10] was used for global wire delay, synthesis results for logic delay and standard cell library information for the pipeline register. For 85% delay constraints (see Tables 1-3) , cycle time in the XIRISC processor was 8.2359 ns (logic 95.3%, global wire 0.1%, pipeline register 4.6%). For 70% and 55% delay constraints cycle times were 5.8961 ns and 5.1671 ns, respectively. The relative delay of logic was still 93.5% and 92.5 %, respectively. These relatively high logic delay percentages give a hint of the fact that global wire delay is not so critical if proper pipelining is used and there are bottleneck logic blocks in the design causing high cycle times for the processor.
Conclusion
Development of an early design phase performance estimation method was described for logic blocks of the XIRISC processor core. Compared to traditional approaches, which use a single average exponent for all logic, we used block-wise exponents and took variation due to varying timing constraints into account. We noticed a large variation in Rent's exponent between different blocks: the exponent value ranged from 0.29 to 0.68. Biggest estimation inaccuracies originate from the synthesis tool tendency to abruptly change average logic gate size to meet a tighter delay constraint. Restricting the set of logic gates allowed for the synthesis tool e.g. to gates having a maximum of 3 inputs would not affect the delay or power optimization results very much, but would facilitate more accurate performance modeling. Focus of future research will be in the neighborhood of the abrupt changes in average gate size, that is, abrupt changes in the Rent's constant. It is anticipated that using polynomial regression instead of linear regression would result in more accurate estimates of the performance metrics.
