Abstract-We evaluate the validity of the fundamental assumption behind application-specific programmable processors: that applications differ from each other in key parameters which are exploitable, such as the available instruction-level parallelism (ILP), demand on various hardware resources, and the desired mix of function units. Following the tradition of the CAD community, we develop an accurate chip area estimate and a set of aggressive hardware optimization algorithms. We follow the tradition of the architecture community by using comprehensive real-life benchmarks and production quality tools. This combination enables us to build a unique framework for system-level synthesis and to gain valuable insights about design and use of application-specific programmable processors for modern applications. We explore the application-specific programmable processor (ASSP) design space to understand the relationship between performance and area. The architecture model we used is the Hewlett Packard PA-RISC [1] with single level caches. The system, including all memory and bus latencies, is simulated and no other specialized ALU or memory structures are being used. The experimental results reveal a number of important characteristics of the ASSP design space. For example, we found that in most cases a single programmable architecture performs similarly to a set of architectures that are tuned to individual application. A notable exception is highly cost sensitive designs, which we observe need a small number of specialized architectures that require smaller areas. Also, it is clear that there is enough parallelism in the typical media and communication applications to justify use of high number of function units. We found that the framework introduced in this paper can be very valuable in making early design decisions such as area and architectural configuration tradeoff, cache and issue width tradeoff under area constraint, and the number of branch units and issue width.
I. INTRODUCTION
It has been predicted that the "micro-brain boom" (sic) will greatly increase demand for application-specific microprocessors for media applications [2] . Sales of handheld computers and personal digital assistants grew almost sixfold from 1994's total, to 5.6 million units in 1999. The market for programmable DSP chips increased 20% in 1998 to the $3.9 billion level. The new DSP markets, which are beginning to emerge, including digital cameras, satellite phones, smart antennas, voice over IP, ac motor control, and even digital TV, is forecast to grow at a 33% compound rate to the $13.4 billion level in 2002 [3] .
This market growth coincides with an interesting technological advance that will change both the semiconductor business and microprocessor design. Since 1992, microprocessors account for 23% of total semiconductor sales. In 1998, these chips accounted for 30% of total value of the semiconductor production. The increasing share of microprocessors in semiconductor market is due to a new phase of silicon integration enabled by deep submicron fabrication technology.
For example, SA-1100 from Intel [4] incorporates many functions such as a memory controller, color LCD driver, PCMCIA interface, IrDA and USB communication channels, and extensive power management into a single chip along with its core logic, previously available only through "glue logic" chips. One implication of this technology is that almost all semiconductor manufacturers are entering the microprocessor business.
As a consequence of this trend, the market will be more crowded and competitive in spite of increasing demand. This pressure will force manufacturers to focus on microprocessors that are cheaper and more aggressively optimized for specific applications. A challenge to microprocessor designers will be to design a microprocessor that executes a targeted application very well yet can achieve economy-of-scale. For example, video-game players such as PlayStation from Sony and Nintendos64 from Nintendo need to employ ever-more powerful processors for the application and yet remain cheap enough to sell for under $300.
On the technical side, recent advances in compiler technology and microprocessor architecture for instruction-level parallelism (ILP) have significantly increased the ability of a microprocessor to exploit the opportunities for parallel execution that exist in various programs. Key ILP compiler technologies, such as trace scheduling [4] , superblock scheduling [5] , treegion-scheduling [6] , hyperblock scheduling [8] , and software pipelining [9] are in the process of migrating from research labs to product groups.
At the same time, a number of new microprocessor architectures have been introduced. These designs present hardware structures that are well matched to most ILP compilers. Architectural enhancements found in commercial products include predicated instruction execution, VLIW execution, and split register files. One of the best examples that has these features is TMS320C6X from Texas Instruments [9] . Although TI considers the TMS320C6X to be a DSP, the architecture is almost a copy of the Multiflow Trace [10] . Multi-gauge arithmetic (or variable-width SIMD) is found in the family of MPACT architectures from Chromatic [11] and the designs from MicroUnity [12] . Most of the multimedia extensions of programmable processors also adopt this architectural enhancement [14] .
The arrival of production quality ILP compilers and commercial DSPs with VLIW and SIMD architectures stimulated the idea of custom-fit processors [15] . The premise of such an approach is that applications differ from each other in exploitable measures, for example the available ILP, demand on various hardware components (e.g., cache memory units, register files) and the number of function units. The presumption is that a microprocessor can be designed by adding hardware components tailored to a specific application so that it can execute the single application extremely well. Of course, an obvious drawback of this approach is that it provides no guarantee that other applications will run as well as the targeted application. While the current microprocessors for media applications (mediaprocessors) are claimed to target general applications in a domain [13] , a custom-fit processor targets a single application (although they remain programmable).
We report on a method of system-level synthesis of single or multiple application programmable processors. We use a benchmark suite consisting of complete applications written in a high level language [16] . We use the IMPACT tool suit [18] to collect performance measurements of benchmarks on various machine configurations. The IMPACT C compiler is a retargetable compiler with code optimization components supporting multiple-instruction-issue processors. The target machine is described using the high-level machine description language. A high-level machine description supplied by a user is compiled by the IMPACT machine description language compiler. IMPACT provides cycle-level simulation tools. This paper is organized as follows. The next section briefly surveys related works and summarizes the contributions of this work. Section III presents the background materials including machine model, benchmarks, experiment platform (such as tools), and an example set of results obtained using the tools. Our approach in this project is explained in Section IV in detail. Section V formulates the search problem defined in the previous section in formal terms. The solution space exploration strategy and algorithm is described in Section VI. Extensive experimental results are reported in Section VII. Finally, Section VIII draws conclusions.
II. RELATED WORKS AND OUR CONTRIBUTIONS
The work on synthesis and evaluation of application-specific programmable processors has been conducted independently in two research communities, computer-aided design and architecture. There is, however, a strong converging trend of the two areas due to recent technological advances and application trends. In this section we survey the related works in these two fields.
There have been a number of efforts related to the design of application-specific programmable processors and application-specific instruction sets. Comprehensive survey of the works on computer-aided design of application-specific programmable processors have been conducted by Goosens [18] , Paulin [19] , and Marwedel [20] . In particular, a great deal of effort has been made in combining retargetable compilation technologies and design of instruction sets [22] - [26] . Several research groups have published results on the topic of selecting and designing instruction set and processor architecture for a particular application domains [27] , [28] .
Early work in the area of processor architecture synthesis tended to employ ad hoc methods on small code kernels, in large part due to the lack of good retargetable compiler technology. Conte and MangioneSmith [29] presented one of the first efforts that focused on large application codes (i.e., SPEC) written in a high-level language. While they had a similar goal to ours, i.e., evaluating performance efficiency by including hardware cost, their evaluation approach was substantially different. Conte et al. [29] further refined this approach to consider power consumption. Both of these efforts were limited by available compiler technology and used a single applications binary scheduled for a scalar [15] studied the variability of application-specific VLIW processors using a highly advanced and retargetable compiler. However, their study considered small program kernels rather than complete applications. They also focused on finding the best possible architecture for a specific application or workload, rather than understanding the difference among attractive architectures across a set of applications.
We adopt a methodology of system synthesis combining the key paradigms of both communities. Following the tradition of the CAD community, we develop an accurate area estimate and aggressive optimization algorithms. We follow the tradition of the architecture community by using comprehensive real-life benchmarks and production quality compilation and simulation tools. This combination enables us to build a unique framework of system-level synthesis and to gain valuable insights about design and use of application-specific programmable processors for modern applications.
Unlike previous works, we use a set of complete applications written in a high-level language as benchmarks. We incorporate the role of cache memory units in machine performance into the machine model, which is essential for producing meaningful results. We focus on the number of machine configurations that should be developed in order to maximize performance for all of the benchmarks given an area constraint. We understand that it is in the best interest of a processor designer to understand which architecture and how many functional units or cache size is best for one particular application. However, our first goal is to develop a framework for managers to understand how big the chips portfolio should be in one particular domain. It is not intended for a single designer to find his best application-specific system. The objective function of the optimizer is minimization of selected machine configurations, thereby maximizing the number of benchmarks that can be run on a processor as though it is optimized for each individual benchmark. In one extreme case, we end up with as many machine configurations as the number of individual benchmarks. On the other extreme, we need only one machine. Clearly, the most interesting solutions lie somewhere in the middle.
Power consumption evaluation and optimization is very often an important aspect in multimedia processors; however, it is beyond the scope of this paper. We have published a thorough investigation of power consumption using similar framework and tools in another paper [30] .
III. PRELIMINARY DISCUSSION
In this section we discuss the experimental environment that has been adapted and developed for the investigation. First, we describe the machine model used to estimate the area of a machine configuration. The benchmark suite is introduced along with the characteristics of its components. Finally, we explain the experimental platform, including tools and their example outputs.
A. Machine Model
To estimate the cost of a machine configuration, we adopt a simple model developed by Argyres [31] . Given the area of the issue unit, the cost of any scalar machine configuration is a linear function of the numbers of branch, memory, and arithmetic units. A machine may include any number of each function unit. For a superscalar machine, the issue unit area cannot be estimated using a simple linear model since it requires more complex logic for runtime code scheduling. We assume that the issue unit area will take O(n 2 ) space since the complexity of dependency checking algorithm is O(n 2 ). When a VLIW machine is considered, the issue unit area is known to be of complexity O(n) or A dc data cache area; Aic instruction cache area.
The baseline architecture chosen for the analysis is the PowerPC 604 [32] , a four-issue processor. The 604 has two simple integer ALUs and one complex integer ALU, one floating-point unit, one branch unit, and one memory unit. We assume that machine configurations that have an issue unit smaller than the baseline machine have at least one complex integer ALU. The area of the complex integer unit is assumed to be half of the baseline integer unit (two simple integer units and one complex integer unit). The area of issue unit is scaled based on the area complexity (O(n 2 )). We did not include floating-point units in any machine configurations because the benchmarks we used have mostly integer (or fixed-point) operations. Finally, we scaled the area for 0.35m technology rather than the original 0.5 technology used by Argyres. A set of example machine configurations and their respective area estimates are shown in Table I .
B. Benchmarks
The set of benchmarks used in this work is composed of complete applications which are publically available and coded in a high-level language. The collection is composed of 21 applications culled from available image processing, communications, cryptography, and DSP applications. Brief summaries of benchmarks and data used are shown in Tables II and III, respectively. More detailed descriptions of the benchmarks can be found in a previous publication [16] .
As discussed in the Introduction, the idea that a programmable processor can be tuned to a target application is based on the assumption that applications differ from each other in exploitable features. As an illustration, Table IV shows measured characteristics of the benchmarks used in the experiment. Note that the combination of the instructions per cycle (IPC), bus utilization, branch issue, and ALU issue exhibit distinctive characteristics for each benchmark. Although the target was a single-issue machine, we found that there was strong evidence that performance tuning for an individual application could be beneficial. Note that in order to reduce the effect of memory operations on other measurements, the target machine has 32 KB instruction cache and 32 KB data cache, resulting in high cache hit rates.
C. Experimental Tools
We use the IMPACT tool suit [18] to automatically tune application codes and collect performance measurements of benchmarks on various machine configurations. The IMPACT C compiler is a retargetable compiler with code optimization components developed for multiple-instruction-issue processors. It incorporates code improving techniques such as function inline expansion, instruction placement, loop unrolling, loop peeling, memory disambiguation, register renaming, branch prediction, critical path depth reduction, and an integrated register allocation and code scheduling algorithm for both VLIW and Superscalar architectures. The target machine for IMPACT C is described using a high-level machine description (HMDES) (see Section IV for an example) supplied by a user. IMPACT provides cycle-level simulation of the processor implementation. Fig. 1 shows the flow of simulation using the IMPACT tools.
We collect run-times (expressed as a number of cycles) of the benchmarks on 175 different machine configurations. First we build executables of the benchmarks for seven different processor configurations. They are machines with a single branch unit and one of the one-, two-, four-, and eight-issue units, machines with two branch units and one of the four-and eight-issue units, and machines with four branch units and a eight-issue unit. The IMPACT compiler generates aggressively optimized code to increase ILP for each configuration. All the machine configurations have the same number of ALU and memory units as the issue width. The optimized code is consumed by the Lsim simulator. We simulate the benchmarks for a number of different cache configurations. For each executable of a benchmark, we simulate 25 combina- tions of instruction cache and data cache ranging from (512 bytes, 512 bytes) to (8 KB, 8 KB) .
Measured run-times of benchmarks through simulations are normalized with respect to a baseline machine. We selected as a baseline configuration a machine with one branch unit, one-issue unit, 512 bytes of instruction cache, and 512 bytes of data cache. An example set of results is shown in Table V . There are 128 different machine configurations that satisfy the illustrated area constraint, 16 mm 2 . After run-times are measured, we eliminate machines that are dominated by at least one other machine. By dominated, we mean a machine runs slower than or equal to the speed of another machine for all benchmarks. In this particular example, there are seven machines left after dominated machine configurations are eliminated. The areas of the machine configurations are shown in Table VI .
IV. APPROACH
Our hypothesis is that a set of machine configurations that run a given set of benchmarks equally well with respect to a baseline machine can be found. In other words, there is at least one machine from the set that can be used to build an application-specific system. Effectively, we can say that the machine is optimized to run the specific application. In this section, we describe our approach to the selection problem. First we show the global flow of the design process. We describe the combinatorial nature of the search space by showing an example search tree.
A. Global Design Flow
The experiment is carried out by first selecting a set of machine models. A portion of an example high-level machine description file (HMDES) used by the IMPACT tool suite is shown in Fig. 3 . Machines are described using HMDES. The HMDES files are compiled by the HMDES compiler [34] , [35] . Detailed and precise descriptions of the execution constraints for the HP PA7100, PLAYDOH, Intel Pentium, EPIC, and Sun SPARC have been ported and widely used by IMPACT C compiler and Lsim simulator. The benchmarks are compiled by the IMPACT C compiler for the machines described in HMDES. The architecture model we used in the experiment is the Hewlett Packard PA-RISC (HPPA) [1] . All the executables are simulated using the IM-PACT simulator, Lsim. At simulation time, we specify cache configurations for the simulator. Lsim simulates everything from different branch prediction scheme, reorder buffer, caches and memory. Memory latency, misprediction penalty and ALU latency are specified as Lsim parameters (Fig. 2 ) in the system model being simulated. We did not incorporate a second level cache or specialized ALU; however, adding one and changing ALU latency is possible in Lsim. Through simulations we measure run-times of all the benchmarks. The run-times are normalized with respect to a baseline machine run-time for each benchmark to obtain speed-up numbers. After all the simulations are completed, we begin searching for the best machine configuration sets for specific area constraints. For each area limit, we eliminate all machines that do not satisfy the area bound. From the machines that satisfy the area bound, we eliminate all the dominated machines (refer to Section III). Finally, we apply the K-selection algorithm (see Section V), to select a set of machine configurations that run the benchmark set best. Fig. 4 shows the global flow of design process.
B. Selection Problem Search Space
The search space is relatively small due to the area constraint and the number of dominated implementations. Nevertheless, there is a possiblity that the search space explodes due to its combinatorial nature. The likelihood of this phenomenon occuring appears to be a strong function of the area models used. Fig. 6 shows an example search space with three machine configurations. The search starts with an empty set (indicated by the root node in the diagram) and follows one of the possible path in the search tree. Each node in the search tree is an instance of selection completed. Fig. 5 shows an example measurement on performance versus number of selected processors. We are looking for the break point of diminishing returns for adding configurations to a set. This goal will be elaborated in more detail in the next section.
V. SELECTION PROBLEM FORMULATION
In this section we formulate the problem of finding the minimum number of machine configurations for a given set of benchmarks in such a way that all the benchmarks execute well on at least one of the selected machines.
A. Informal Description of Problem
Informally, the problem can be stated as follows: Given an area constraint and speed-up numbers of benchmarks on machines that fit into the given area, we want to select a subset of the machine configurations in such a way that the geometric mean of speed-ups across all the benchmark is maximized and the subset size is kept small.
We normalize the run time with respect to a baseline since we are not interested in the sum of run-times [36] . The sum of run-times does not reflect the performance effect of shorter benchmarks in the presence of longer benchmarks. In some cases, a benchmark that takes a long time to complete due to large data sets dominates the sum of run-times.
We use the geometric mean to summarize the selected machines since we normalize the measurements [36] . In general, the geometric mean is not a good method of summarizing performance numbers [37] as it does not show the nature of the workload. For example, consider a workload consisting of two applications. On a baseline machine one application takes 5000 s to complete and the other 2 s. We compare two machines based on a set of normalized numbers. Assume that one machine improves the performance of the first application by a factor of two and the other the second application by the same factor. Then the geometric mean indicates that the performance of the two machines is the same although the first machine cuts the running time by 2500 s while the second machine by only 1 s. This is a problem when we summarize normalized performance numbers for a mix of workload. As indicated earlier, however, we are not interested in the machine performance on a mix of workload. Instead, we are interested in predicting the performance of one of the selected machines on each individual benchmarks. As an illustration, consider the speed-up numbers given in Fig. 7 . While the arithmetic mean suggests that there is no difference between m1 and m2, the geometric mean provides more useful insight.
We want to see how many machine configurations are necessary in order to achieve high performance for all the benchmarks. The objective function of the optimization problem is minimization of the number of selected machine configurations, thereby, on average, maximizing the number of benchmarks that can be run on a processor as though it is optimized for each individual benchmark. In one extreme case, we might end up with machine configurations for each individual benchmark. On the other extreme, we might need only one processor solution for all applications.
B. Formal Description of Problem
We now define the problem using more formal Garey-Johnson format [39] .
Selection problem: Instance: Given a set of n benchmarks, ai; i = 1; 2; . . . ; n; k machine configurations, mj; j = 1; 2; . . . ; k, the speed-up factors E ij of the benchmarks a i ; i = 1; 2; . . . ; n on the machines mj; j = 1; 2; . . . ; k with respect to a baseline machine and constants K and C.
Question: Is there a set M of K machine configurations, c p ; p = 1; 2; . . . ; K, such that n i=1 maxj2M Eij C?
To determine the constant K we divide the problem into two subproblems, namely, a !-selection problem and K-selection problem. Starting from ! = 1 we iteratively increase ! until the benefit of increasing ! is less than a given threshold . Formally the subproblems are stated as follows. !-selection problem: Given a set of n benchmarks, a i ; i = 1; 2; . . . ; n; k machine configurations, mj; j = 1; 2; . . . ; k, the speed-up factors E ij of the benchmarks a i ; i = 1; 2; . . . ; n on the machines m j ; j = 1; 2; . . . ; k with respect to a baseline machine and constants !
Eij (2) where P is the selected machine set of size !. The size of the machine set is determined by an iterative test of comparing D ! and D !+1 . Since the D is monotonic, we continue to evaluate D and compare them using (3) until we reach a point where the benefit of the set size increase drops below a certain limit.
K-selection problem:
where D ! is given by (2) and is a cutoff ratio.
VI. SOLUTION SPACE EXPLORATION: STRATEGY AND ALGORITHMS
The algorithm for system-level synthesis of application-specific programmable processor is given in Fig. 8 . Considering that the run-time of simulations for 20 benchmarks on 175 machine configurations is about a week, we can tolerate a longer search time to find the optimal result. Generally, the size of the search problem is dramatically reduced by eliminating machine configurations that do not satisfy a given area constraint and those that are dominated by at least one other machine. Consequently, a smaller number of machines needs to be considered. The machine configuration A dominates the configuration B if no benchmarks have longer execution times on the machine A than on machine B.
The search for an optimum solution is organized using an implicit enumeration method. In particular, we adopt a branch-and-bound algorithm shown in Fig. 8 to speed up the selection.
The branch-and-bound algorithm consists of two major components: branching and evaluation. The branching step takes the current state of selection (a node in the search tree) and generates a number of new nodes by adding an available (still not considered in particular search path of the search tree) machine to the current state of selection (refer to Fig. 6 ). As shown in Fig. 8 , it examines to see if adding a machine to the current state of selection can result in a better solution than the current best solution found. Initially, the current best solution is set to the previous best solution. The previous best solution is the best solution found for the machine set size less than the current search size by one. The branching is bounded by the bounding function. The bounding function compares the current node and a candidate processor with the best node of the same size found. The node size is the number of processors. If the current node and the candidate are dominated by the best node, then we cut the path off from search. We compute the lower bound of the geometric mean of the maximum speed-up factors of each benchmark. The lower bound is obtained by using a steepest descent algorithm. The steepest descent algorithm selects machines in the order that the biggest improvement can be achieved. If the estimate is greater or equal to the current best solution, we have an opportunity to find a better solution than the current best solution by exploring the search path. Otherwise, there is less of a chance of obtaining a better solution.
We sort the search order based on the lower bound so to increase the bounding rate.
VII. EXPERIMENTAL RESULTS
We evaluated the tools and algorithms by running extensive experiments ranging from the area constraint of 30 to 200 mm Fig. 9(a) shows an experimental result using the cutoff value of 0.05. The thicker line shows the number of machines that are left after eliminations. The thinner line in the figure indicates the number of selected machines to cover all the benchmarks under area constraints. We clearly see that we need more machine configurations when less area is available. On the other hand, the more area we have, the more general the processor we can design. The results suggest that when more than 100 mm 2 of area is available, there is no advantage in having more than one architecture to be able to build application-specific systems for all the benchmarks. Moreover, for the given compiler technology and benchmarks, there is no need to have more than 100 mm 2 of area since the speed-up increase achieved by machines greater than 100 mm 2 are minimal. The overall performance comparison between all configurations and selected configurations are shown in Fig. 8(b) . There are three distinctive points where the speed-up increase rate changes. Up to the area 57 mm 2 , we see rapid performance increase, which is mainly due to increased amount of cache memories. From 57 to 101 mm 2 the measurement shows modest increase of performance. The performance increase shown in this interval is mainly due to increased issue width. For the processors larger than 101 mm 2 , the performance increase is minimal.
One of the underlying reasons that causes the phenomenon is that the ILP found by the compiler and hardware scheduler is fully exploited by having a certain amount of hardware, thereby performance increase possibility is exhausted. The limitation of performance increase in the face of increased area illustrates either the limitation of the current compiler technology or the inherent lack of ILP in the benchmarks. Note, however, that the measurement is not for a single processor. Smaller area cases tend to have more than one architectures which are more application specific.
Experimental results for the cutoff values 0.1, 0.05, 0.01, and 0.005 are given in Fig. 10 . Smaller cutoff values result in machine configuration sets that are more tuned to each application. In general, however, a smaller cutoff value does not result in dramatic performance increase. In most cases, the cutoff value of 0.05 appears to give a good tradeoff between the number of machine configurations and performance.
Speed-up factors of each benchmark are shown in Table VII . They are snapshots of experimental results summarized by the line graphs in Fig. 10 . The table contains maximum speed-up factors for three cutoff values (0.05, 0.01, and 0.005) and three area constraints (85, 100, and 169 mm 2 ). Note that the area constraints are not actual areas but rather bounds. We consider machines under the given area constraints. Table VIII gives the number of machine configurations selected and the best performing machine configuration for each benchmark. The actual areas of the selected machines are given in column 5 of the table. The combinations of components for the selected machines are shown in column 4. Fig. 11 shows the results when the liner complexity issue unit area model is assumed. The results suggest that the machine configuration selection problem has no strong dependence to an issue area model used. Although we observe that there is shift to smaller areas, essentially the results are identical to the results based on the quadratic complexity issue unit area model.
In summary, we found that under the machine models and machine configuration choices described in this paper, when more than 100 mm 2 of area is available, there is little advantage in having more than one architecture to be able to build application-specific systems for all the benchmarks. Moreover, for the given compiler technology and benchmarks, there is little need to have more than 100 mm 2 of area since the speed-up increase achieved by machines greater than 100 mm 2 are minimal. One notable exception is that for highly cost sensitive designs we observe a need for small number of specialized architectures which achieve smaller areas.
VIII. CONCLUSION
The arrival of production quality ILP compilers and commercial DSPs with VLIW architecture stimulated the idea of programmable processors that are aggressively tuned to specific applications. The assumption behind the idea is that there are ways of designing programmable processors that can exploit the run-time characteristics of specific applications. The run-time characteristics include the available ILP, demand on various hardware components such as cache memory units, register files, and the number of function units. It is assumed that a microprocessor can be designed by adding hardware components tailored to a specific application so that it can execute the single application extremely well. We ran extensive experiments on a framework based on the key paradigms of CAD and architecture communities. This combination enabled us to gain valuable insights about design and use of application-specific programmable processors for modern applications. We evaluated 175 machine configurations on 20 benchmarks under the area constraint ranging from 30 to 200 mm 2 . For each area constraint, we obtain an optimum set of machine configurations for a number of cutoff values. The run time of the entire synthesis process was about a week. It is well known that when the area constraint is tight, more machine configurations are needed for application specific designs. In the figures, we found out that even with more area, there still exists a fair number of different configurations due to the introduction of different functional units (branch unit and ALU) with the tradeoff of cache size. In the system level integration market, we believe a standard design solution means a quick time to market and guaranteed functionality. We develop this framework to ease design managers finding their chips portfolio in their particular interested domains.
We have found that the framework introduced in this paper can be very valuable in making early design decisions such as area and architectural configuration tradeoff, cache and issue width tradeoff under area constraint, and the number of branch units and issue width.
