Previous work i n hardware-software co-design has addressed issues i n system modeling, partitioning, and mired module simulation and integration. 
Introduction
System-level design is a set of tasks which convert a system-level specification into a set of interconnected modules implementing the specification. Each module could be implemented in hardware or as software executing on a processor. A hardware implementat,ion has better performance whereas a software implementation has lower cost, shorter development time and allows changes late in the design cycle.
Two opposite approaches have been proposed for hardware-software partitioning t,o determine which part of a specification should be implemented in software and which in hardware. Gupta and De Micheli [l] use a partitioning algorithm that starts with an initial partition where all operations. except for the unbounded delay operations, are assigned to hardware. The partition is refined by migrating operations from hardware to software in search for a lower cost feasible partition. The approach used by Ernst and Henkel [2] starts with a complete software implement,ation from which those portions that violate t.he performance constraint,s are extracted for the hardware implement,ation. These two approaches start from different directions but work towards the same goal of minimizing the amount of application-specific hardware required. Hardware-software partitioning reqnires a software estimator that. will predict the execution time of the software implementation in order to identify which portions of the specification can be migrated from hardware to software while not violat.ing the constraints or which portions need to be implemented in hardware to satisfy the timing constraints.
The most accurat,e met hod to obtain the execution t.ime of a software implementation would be to compile t.he given specificat ion to each target processor and measure it, b0t.h st.at.icallv [:3] and dynamically [4]. However. such a processor-specific approach is very time-consuming and requires est.ensive infrast.ruct ure such as various compilers. est,imat.ors and target processors. which is often not available. Due to the fact, that few systems will spend a huge amount of money to acquire various compilers and target processors just to determine the software performance as well as the fact that the processor-specific approach is too time-consuming to allow designers to consider more than a few design alternatives, it is indispensable to generate reasonably accurate estimates using a cheaper and faster approach.
In this paper, we propose a generic-processor model which generat.es reasonably acrurate software estimates in a cheaper and faster way. I n contrast to using different compilers and estiniat,ors for different target processors in the processor-specific approach, the generic model proposed in our approach will use only one compiler and one estimator in conjunct.ion with different processor t,echnology files which characterize the corresponding processors' instruction sets. This makes our estimator fast and easy to extend to different t.arget processors. Besides the performance, our est,imator also generates estimat,es for program-memory size for a given specification and a given target processor. The input to onr software est,imator is a system-level specificat.ion in SpecChart [5] . The target processors we consider i n this paper are those used mostly in the embedded systems. Processors wit.h cache memory and/or instruction pipelining are beyond the scope of this paper.
In t,he next section, we present the underlying model used for software estimation. Performance and memory size estimation for system-level specifications are discussed in Section 3 and Section 4 respectively. The results of our experiments are presented in Section 5 followed by conclusions and future work in Section 6 . The estimation model we propose is targeted t.owards syst.em-level specificat.ions which consist. of hierarchical concurrent /sequential behaviors. . A behavior, which is a set of actions ancl a set of conditions describing when each action is to occur, can in turn contain sequential or concurrent sub-behaviors. For example, in Figure 1 control reaches a stop dot. Behaviors without stop dots will never finish executing. Our est.imat.or is intended to estimate the software met,rics for any given leaf/non-leaf behavior of the specificat,ion as well as any given partition (a set of behaviors) in the specification. P I in Figure 1 is a partition which contains two behaviors Q and R.
Model for Estimation

Estimation Model for Leaf Behaviors
In order t o obtain the estimates for leaf beha\-iors. we may need to compile the code in the leaf behaviors into the instruction set of the target processor. For example, if a leaf behavior will be implemented on an Intel 8086 processor, it may need to be compiled into the 8086 instruction set. Using the timing and size information associated with each type of instruction such as how many clock cycles each 8086 instruction executes and how many bytes it takes, we can obtain the performance and program size of the behavior. Similarly, if the leaf behavior is going to be implemented on a Motorola 68000 processor, it may need t o be compiled into the 68000 instruction set. Based on the 68000 instruction timing and size information, the estimator can obtain the software metrics for the behavior. We call this model the processor-specific model shown in Figure 2 (a). Instead of using different compilers and estimators for different target processors in the processor-specific model, we propose a generic-processor model (Figure 2(b) ) in which the leaf behavior specification is converted into a set of generic three-address instructions described in [6] .
After that the estimator computes the software metrics for the leaf behavior based on the generic instructions and the technology files for the target processors. For example, if the leaf behavior is going to be implemented on an Intel 80286 processor, then the technology file for the 80286 processor will be used during the estimation. The technology file for a target processor supplies information about how many clock cycles and bytes each type of generic instruction requires for that processor.
Generic instrucbon
Technology file for 68020
Figure 3: Deriving technology files for generic instructions.
The technology file for each target processor can be derived from the timing and size information of the processor's instruction set. EA1 and EA2 in Figure 3 are the effective address calcnlat,ion times used for displacement memory addressing mode, which are 6 and 8 clock cycles on the 8086 and 68020 respectively. Thus, the generic inst,ruction will take 35 and 22 clock cycles on the 8086 and 68020 processors respectively. Using a similar approach we can derive the number of bytes each type of generic instruction will take if it is executed on the 8086 or 68020 processor. Presently the technology files for 8086, 80286, 68000 and 68020 processors are supported in our estimator. The 8086, 80286, 68000 and 68020 technology files are derived from the t,iming and size information of their corresponding instruction set,s given in [7, 8 , 9, 101. All technology files can be found in the appendix of [6] .
Compared with the processor-specific model, t,he generic model has several advantages. First, the generic model does not require different compilers and estimators for different target processors. Instead. only a single compiler, estimator and a set of technology files is required for software estimation. Second, t,he generic model makes retargeting the estimator to a new processor much easier. R e t a r g e h g consists of providing a technology file for the new processor. In the processor-specific model, we would require a compiler for t.he new processor in addition to the timing and size information of t.he processor's instruct.ion set.. Therefore. t.he generic niodel is more flexible and can be extended t o perform estimation for new processors/microcontrollers even if compilers are not available on systems running the estimators. In other words, the estimators based on the generic model are more portable since they can run on any machine, not just those with necessary compilers. Finally, it is much faster to compile the specification into the generic instruction set than those associated with specific processors since the generic threeaddress instructions are free of instruction idiosyncrasies.
A disadvantage of the generic model is the lower accuracy of its estimates largely because the generic inst roction set represents only a portion of the processor's ent.ire instruction set,.
Estimation Model for Non-leaf Be-
To evaluate the software implementat,ion of a given nonleaf behavior or a partition on a specific processor, we must first flatten the hierarchy and sequentialize the specification to diminish the concurrency since our target machine is a uni-processor. In other words, the specification needs to be mapped (flattened/sequentialized) into a program written in a language which can be directly compiled to the instruction set of the given processor. Based on the machine instructions generated, the software metrics such as performance and memory size for the specification can thus be computed. The software metrics obtained in such a way are accurate since they are computed from the actual implementation of the specification on the given processor. However, due to the fact that automatic partitioning tools will evaluate hundreds or even thousands of partitions, this approach is too costly and time consuming since we would have to actually implement each partition on the given processor through flattening, sequentializing and compiling in order to get the software estimates for that partition. To get fast estimates while not sacrificing too much accuracy, the estimation model we propose combines two different approaches: an accurate approach for est.imating leaf behaviors and a fast approach for estimating non-leaf behaviors and partitions. Prior to the partitioning process, each leaf behavior is compiled and estimated using the approach described in the previous section. During the partitioning process, the software est.imates for each partition are constructively computed bottom up from the estimates of the leaf behaviors. Such a combined approach is very fast because it does not involve flattening, sequentializing and compiling for each partition during the design process. It only requires some computation based on the pre-obtained estimates for t,he leaf behavior specifications. Therefore, this model allows rapid evaluation of different design alternatives.
haviors and Partitions 3 Performance Estimation
There are two different ways for obtaining performance metrics -dynamic simulation and static estimation. Given a set of input data, dynamic simulation actually executes the program and records the clock cycles used in each execution. Static estimation, on the other hand, is insensitive to input data. It just computes the average number of clock cycles required t o execute the program. Static estimation can yield good results if the number of loop iterations is known and the conditional branching probability can be predicted correctly. Besides. static estimation takes much less time and space than dynamic simulation. I n this section, we will describe a standard static estimation technique and its application t o the performance estimation for our system-level specifications.
Flow Analysis
Flow analysis is a technique used in the static estimation of performance for design with conditional branching (including loops). Given a control flow graph G = (V, E ) , where V is the set of vertices w,, and E is the set of directed edges et, connecting vertex vt to w, and indicating sequencing between vt and v,, we wish to determine t,he execution frequencies of each of its nodes based on the branching probabilities. By determining the execution frequencies of the nodes, we can obtain useful information about the tlesign by associating with each node in the graph, a rc~ecyht representing some design parameter.
Determining the Branching Probabili-
Branching probabilities are associated with the edges in the control flow graph. The following ways have been used in our estimator: ( 1 ) Eqrral Probabilities : When there are n edges branching out from a node, all of them are assigned a probability of 1/n. (2) Loop Related Prohibc/ctit.<: \\hen the number of loop iterations is known, say n., t.he exit edge is assigned a probability of 1/n while the back edge is assigned a probability of ( n -1)/n. ( 3 ) liser Defined Probabilities : User can specify the branch probabilities using annotations in the input specification.
Determining the Node Execution Frequencies
The execution frequency of a node is defined as the number of times on the average that the node will be executed in a single execution of the graph. It is determined by the following procedure: ( 1 ) Determine the branch probabilities using one of the methods outlined above. ( 2 ) A start node, s, preceding the first node in the graph, is added.
Its execution frequency, F ( s ) is set to 1 since t.his node is executed exactly once whenever the control flow graph is executed. ( 3 ) Let P(e,,) be the branch probability of t.he edge between ut and v j . The execution frequency F ( U,) for any node v, is formulated in the following equation: F ( v , ) x w e * , 1 all immediate predecessor nodes U, of U,
( 1 ) ( 4 ) Solve the set of equations formulated in st,ep 3 to obtain the individual node execution frequencies. There are a variety of methods such as Gaussian Elimination, L U decomposition, and Chomsky's met,hod which can be used for solving a set of linear equations. We have selected the Gaussian Elimination method in our estimator.
In situations where the edge has an associated weight (as is the case when conditions exist on arcs between nodes), we may need to know the execut.ion frequency of each edge, F(e,,) . It is obvious that the execution frequency of an edge is the same as that of its target node.
Therefore we have F(e,,) = F ( P , ) . The VHDL code in each leaf behavior is divided into basic blocks. The details of obtaining basic blocks from VHDL code and compiling VHDL code t,o the generic instruction set are described in [6] . By using the execution times of generic instructions specified in the technology file, the execution time (i.e. weight) of each basic block is computed by summing that of each generic inst,ruct.ion in that basic block.
Determining the
The basic block structure of a leaf behavior is mapped to an equivalent control flow graph G. the same as the weight of the condition associated with the corresponding edge in the basic block structure. By applying flow analysis, the execution time for t.he leaf behavior can be computed using equation 2 in sect,ion 3.1.3. Once execution times have been estimated for each of the leaf behaviors, we can merge the performance estimates of the leaf behaviors to yield the performance estimate of the next higher behavior in the hierarchy. To estimate the performance for Bparent, a non-leaf behavior with sequential sub-behaviors, we create a control flow graph G = (V, E) for its sub-behaviors whose performance estimates are already known. For each of the sub-behaviors, Bi, of Bparent, there exists a corresponding vertex U, in the graph G. For every transition arc bet,ween the two sub-behaviors B, and B,, the set, E has a directed edge et, from vertex U, to vertex U, in G. After the control flow graph model has been constructed for the sub-behaviors, we can apply the flow analysis (section 3.1.3) to obtain the performance of the parent behavior Bparcnt.
In case a behavior at any level of the hierarchy has concurrent sub-behaviors, the execution time of that behavior is computed as the sum of that of its sub-behaviors since the concurrent sub-behaviors have to be sequent,ialized to execute on a uniprocessor. It must be mentioned here that a non-leaf behavior may have a descendant sub-behavior which does not have a stop dot in SpecChart. In this case the behavior will never finish execut,ing and consequent,ly the execution time returned for that behavior is an arbitrarily large number.
Memory Size Estimation
Given a behavior, the goal of memory size estimation is to determine how much program-memory (i.e. bytes used to store the compiled program representing the behavior) are needed.
The size of each type of generic instrnction is specified in the technology file for target processor. Based on the size of each generic instruction, the program-memory size of each basic block is compnted as the snm of that of Generally, compilers optimize the object code bv using different optimization techniques such as global opt.imization, loop optimization and register allocation. Users can invoke those optimizations by passing special flags to the compiler. In the experiments of Figure 4 , we have disabled those optimizations during the C conipilat.ion since our generic compiler does not use optimization heurist.ics.
The next design we experimented is a real-time embedded medical system used t o measure a pat,ient's bladder volume. Its SpecChart descript,ion is described in [6].
There are two timing const,raints imposed on t.he system.
One is associat.ed with the leaf behavior DATAACQUISITION, which requires that the acquisition and conversion of 1000 data points t,ake place in less than 1 ms. The other is associated with the nonleaf behavior ONESCAN, which requires that the masimum time between two scans, i.e. the time used to esecute MOTOR-CONTROL, DATAACQUISITION, VOL-UME-COMPUTATION and DATASTORAGE, is 1 second. We have estimated behavior DATAACQUISITION and behavior O N E S C A N using our estimator. The estimates are compared with the actual results obtained from the (non-optimized) target machine instructions (Figure 5) . 
ONE-SCAN
instructions to some extent, there is always a difference between the estimates obtained from our estimator and the results obtained directly by compiling t o target machine instructions. We can expect more accurate estimation by enhancing the technology files to include more information about the target machine instruction sets. Currently our generic instruction set has limited formats, especially in terms of memory addressing modes. If we can incorporate more memory addressing modes in our compiler to close the gap between the generic instructions and the target machine instructions, we can expect better estimation results. However, this may increase the complexity of the generic inst ructions. Increasing complexity of the generic inst rnctions may increase the compiling time and hence increase the total estimation time. Therefore, more studies are needed to determine what constitutes a Enitable generic instruction set.
