To aid in the functional partitioning of a system into interacting hardware and software components, fast yet accurate estimations of hardware size are crucial. We introduce a technique for obtaining such estimates in two orders of magnitude less time than previous approaches without sacri cing substantial accuracy, by incrementally updating a design model for a changed partition rather than reestimating entirely.
Introduction
The designer of an embedded system is often faced with the challenge of partitioning the system functionality for implementation among hardware and software components, such as among ASICs and processors. New approaches for such partitioning start with a simulatable speci cation of system functionality, and then explore numerous possible partitions of functions from that speci cation among the hardware and software components 1]. We therefore need a method to determine, among other things, the hardware size of a set of functions, to see if that set will meet constraints.
There are several possible methods. The most accurate would be to synthesize a design for the set of functions, but such an approach requires too much time if we wish to examine more than a few possible partitions, as is usually the case. To overcome this limitation, several research e orts incorporate a hardware size estimator 2, 3, 4, 5] . In essence, those estimators roughly synthesize a design for the given functions, while omitting the timeconsuming synthesis tasks such as logic optimization, so they require only a few seconds to obtain a fairly accurate estimate. Such estimators based on a design-model have the advantage of obtaining accurate estimates in just a few seconds. At times, though, we wish to examine hundreds or thousands of partitions using an iterative-improvement partitioning algorithm such as simulated annealing, thus requiring faster estimators. Approaches that use iterative-improvement algorithms have until now used abstract-weight-based estimators, in which an abstract weight is assigned to each function, and then a hardware \cost" for a given partition is obtained quickly just by adding all the weights of functions in hardware 6, 7, 8] . Alternatively, they assume an already scheduled input and estimate hardware size as the size of required functional units 9] . These approaches have the advantage of obtaining very rapid estimations.
In developing our system that partitions an unscheduled speci cation among hardware and software components, we desired to use an estimator based on a design-model in order to obtain accuracy, but we also wanted to use iterative-improvement algorithms to explore many possibilities. Since previous estimation methods had not addressed both goals, we needed to develop a new method. Towards this end, we observed that iterative-improvement algorithms make only a few changes between iterations, so the change between one partition's design and the next one is incremental. For example, Figure 1(a) shows two functions and Figure 1(b) shows a partial datapath for one of those functions. When we add the other function, the datapath only requires one additional multiplexer, as shown in Figure 1(c) .
We took advantage of this incremental change by developing a data structure (representing an incrementally modi able design model) and an algorithm that can quickly provide the basic design parameters needed by a hardware-size estimator. As we shall see, we were able to do this by assuming that the granularity at which we partition the speci cation is at the procedural level (sometimes called the process or task level), as is the case in many new functional partitioning techniques 7, 8, 10, 11, 12, 13, 14, 15] . Our contribution is the development of this incremental hardware-size estimation method, consisting of a new data structure and algorithm, that achieves the advantages of both classes of previous approaches, namely accuracy and speed.
This paper is organized as follows. In Section 2, we describe the design-model that we use for hardware-size estimation, a model adopted from previous design-based estimators. In Section 3, we describe a new data structure that captures not only the design model, but also the contribution of each function to the design. In Section 4, we detail an algorithm for updating, in constant time, this data structure when a function is moved. In Section 5, we summarize our results that show the speed of our method.
Estimation design model
The design model we use to obtain hardware-size estimates for a set of functions is a controlunit/datapath (CU/DP) model 2, 16] , as shown in Figure 2 . The size for the model can be computed as the sum of the following: functional-unit and storage-unit size (including registers, register les and memories), multiplexer size, state-register size, control-logic size, and wiring size. Each is a function of one or more of the following basic design parameters: For example, the functional-unit and storage-unit size may be a function of all size list values, the multiplexer size may be a function of srcs list, the state-register size may be a function of states, the control-logic size may be a function of states, ctrl, and active list, and the wiring size may be a function of units, size list, and wires. The details of these functions are beyond the scope of this paper; any function that uses the above design parameters could be used in conjunction with our method, and more than one form of each function may exist to support estimation for various technologies. To avoid going into speci c details of those functions, in this article we assume that a function HwSize exists, which uses the above parameters and which returns a hardware size with the appropriate units for the particular hardware technology, such as square microns, transistors, gates, or combinational logic blocks.
3 Incrementally-updatable data structure
We now describe the data structure that allows us to represent a roughly-synthesized design using the above model, while at the same time allowing us to incrementally modify that design in constant time when a functional object is added or deleted. We assume that the speci cation consists of a single process with hundreds or thousands of sequential statements, including loops, branches, and procedure calls. We later describe a simple extension for multiple processes. We shall hereon refer to the speci cation pieces to be distributed among hardware and software components as functional objects.
Preprocessed information
Our rst task will be to create a hardware design that implements the entire set of functional objects, and to determine the contribution that each functional object makes to that design (in order to support incremental change). Since this information can be obtained before creating a partition, we call it preprocessed information.
To obtain the hardware design, we must allocate a set of functional units (FUs) and storage units (SUs), bind operations and data values to FUs and SUs, and schedule operations into control steps (not necessarily in the given order). The algorithms that we use to do these tasks should match the algorithms that will be used to synthesize the nal hardware, in order to obtain the highest accuracy; if the algorithms are not known, then we can use default algorithms instead.
To determine the contribution of each functional object to the design, we rst consider the datapath. We create a list of FUs for each functional object. For example, if a functional object uses two adder units, then we append two adder units to that object's FU list. We create a similar list of SUs for each functional object. Turning to multiplexers, we note that the size of a multiplexer in front of an FU, SU, or datapath output is determined by how many possible sources (i.e., SU or FU outputs, or datapath inputs) may need to be input to that FU, SU, or datapath output. Thus, for each functional object, we associate a list of sources contributed by that object to each FU, SU and datapath output. Turning our attention to the control unit, we record the number of possible states for each functional object, and the number of states that each FU, SU and datapath output is active.
At this point, an assumption that we wish to make explicit is that a functional object represents a coarse-grained computation, such as a process, procedure, or a large basic block, as also assumed in many new functional partitioning techniques (see Section 1). The larger the number of statements in each object, the more accurate the estimations will be, since inter-object synthesis optimizations would then play a smaller role in the overall design. The reason is that we assume that the tasks of scheduling, allocation and binding for two functional objects will be roughly the same whether we consider each object independently or together, because in our approach, we perform those tasks on each functional object independently. On the other hand, lower levels of granularity, such as small basic blocks, would result in less accurate estimates since current synthesis techniques (such as path-based scheduling and percolation scheduling) optimize across basic block boundaries. Figure 3 shows the preprocessed information created for each procedure of the example in Figure 1 (a). Note that this example is trivially small, but that it su ciently demonstrates our technique.
More formally, the data structure of preprocessed information, or PP, is a four-tuple < O; DPI; DPO; U >. DPI is a set of datapath inputs fdpi 1 ; dpi 2 ; :::g, and DPO is a set of datapath outputs fdpo 1 ; dpo 2 ; :::g. U is a set of available functional and storage units fu 1 ; u 2 ; :::g. Each unit u i is a pair =< size; ctrl >, where size is a natural number representing the size of the unit (in transistors, gates, or whatever unit is assumed by the estimation functions), and ctrl is a natural number representing the number of control lines on that unit.
O is a set of functional objects fo 1 ; o 2 ; :::; o n g. Each functional object o i is a pair < states; dsts >. states is a natural number representing the number of possible control states for the functional object. dsts is a set of destinations, fdst 1 ; dst 2 ; :::g, written to by the object. A destination dst i is a three-tuple < id; srcs; active >. The destination identi er id is the particular FU, SU or DP-output that dst represents, so id 2 DPO S U.
active is a natural number representing the number of states for which the destination is active for this object. srcs is a set of sources, fsrc 1 ; src 2 ; :::g, that the object assigns to this destination. Each src i is either a datapath input or a unit, so src i 2 DPI S U.
Design information
Given the preprocessed information PP, we can focus on creating a design for the subset of functional objects that have been mapped to hardware. We need to assemble the datapath and controller. Speci cally, the datapath FUs required to implement the hardware objects are determined as the union of the FUs needed by each object. For example, if one object requires units u1 and u2, and another requires units u1 and u3, then the datapath FUs will be u1, u2 and u3. The datapath SUs are determined similarly. The multiplexer sizes are determined for each destination by taking the union of the sources contributed to that destination by each object. The number of states in the controller is simply the sum of the number of states of the functional objects (remember that this is the number of possible states, rather than a measure of the start-to-nish performance), and the number of states that each datapath control line is active is the sum of those contributed by each object. We store the information in a table. For example, Figure 4 shows this information for the case when Procedure1 from Figure 1(a) is the only functional object mapped to hardware.
From the above discussion, we see that values for the basic parameters for the hardware size functions have been determined, so the size can now be computed by calling HwSize, as shown in Figure 4 . We will now de ne our data structure that maintains the design information in an incrementally updatable manner. The design information data structure D is a ve-tuple < usize; units; ctrl; wires; dsts >. The rst four items are natural numbers. usize represents the total size of all the FUs, SUs and multiplexers. units represents the total number of all FUs, SUs and multiplexers. ctrl represents the total number of control lines between the controller and the datapath. wires represents the number of wires in the datapath.
The fth item, dsts, is a set of all destinations in the design, fdst 1 ; dst 2 ; :::g. Each destination dst i is a three-tuple < id; src cons; active >. The identi er id indicates the unit or DP output that this destination represents, so id 2 DPO S U. active is a natural number that indicates the total number of states that this destination is active. src cons is a set fsrc con 1 ; src con 2 ; :::g, where each src con i is a pair < src; con >. src is a source, from the preprocessed information PP, that must be input to the destination dst i . con is a set of functional objects (i.e., con PP:O), where each functional object requires a path from the source to the destination. In other words, the objects are the contributors of the source to the destination.
Relative to the number n of functional objects, the complexity of building PP and D is O(n). For the industry examples that we have examined, n has ranged from 15 to 120.
The complexity is usually dominated by the scheduling algorithm, whose complexity may range from O(c 2 log(c)) to O(c 3 ), where there are c nodes in the functional object's data ow graph.
Constant-time update algorithm
We now turn our attention to the movement of functional objects between the hardware and software components, or more speci cally, to the addition or deletion of a functional object to or from hardware. We de ne an algorithm to update the design information D for an addition of a functional object o to hardware. The algorithm uses a procedure SeekDesignDst which returns the design destination that refers to the same unit as the given object destination. A procedure NewDesignDst creates a new design destination for the given object destination. Procedures Size and Ctrl return the size and number of control lines, respectively, for the given object destination's unit, returning 0 if the destination corresponds to a datapath output. A procedure SeekSrc con returns the design's source/contributors item that corresponds to the given source. A procedure NewSrc con creates a new source/contributors item for the given source. A procedure GetMuxSize determines the size of the multiplexor(s) needed in front of a particular destination for the given sources. The size is dependent on the number of sources and on whether there are one or two inputs on the destination (e.g., an adder has two inputs so no multiplexer is needed for two sources, whereas an incrementer with two sources does need a multiplexer since it has only one input). If there is more than one input on the destination, we assume the sources are uniformly distributed among those inputs. The algorithm performs the following for each destination written in o. First, it adds that destination to the design if it doesn't already exist. Such an addition requires updating the number and size of DP units, and the number of control lines between the CU and DP. Second, it unions the sources of that destination with the corresponding design destination's sources. If such a union adds sources, then we must update the number of DP wires and the size of the destination's multiplexer. If previously no multiplexer was needed, but after adding a source a multiplexer is needed, then the number of DP units is incremented. Third, the algorithm increases the number of states for which the destination must be asserted by the number of states for which o asserts that destination. After repeating the above three steps for all destinations, the algorithm updates the number of possible controller states by the number of states for o.
The algorithm for deleting a functional object is complementary to that for adding an object; we have omitted it for brevity. Figure 5 illustrates several changes we make to the design information when adding Procedure2 to the hardware. First, we create a new destination B. Second, we increase the adder's active states from 3 to 4. Third, we associate a new source with the adder, resulting in the need for another multiplexer. We then update the parameters to the HwSize function accordingly.
The algorithm executes in constant time, if we assume that the number of destinations per object is roughly constant for a given example. This assumption holds unless each functional object accesses every data item and external port. However, since functional objects (such as procedures) serve to modularize a speci cation, such a situation is highly unlikely. Instead, each object will likely access a small (constant) number of data items and ports.
Multiple processes can be handled with a straightforward extension. Since we assume each process will use its own controller and datapath, then we simply keep separate design information for each process, and we then add the sizes of all CU/DP's in hardware. The additional processes therefore do not a ect the constant-time characteristics of the estimation. We could also handle partitioning among multiple hardware components (such as among ASICs or among blocks on an ASIC) simply by maintaining separate design information for each ASIC.
Results
We have implemented a design-based incremental hardware-size estimator using the previously described data structure and algorithm, and have incorporated it into a functional partitioning tool. The input is a VHDL behavioral description, and the output a re ned description containing partition detail. The implementation consists of approximately 16,000 lines of C code. The functional partitioning tool has been released to over 20 companies as part of the SpecSyn system-design environment, and has been used in an industry design (a fuzzy-logic controller) involving ve ASICs, and tested on numerous other industry examples including an interactive TV processor and a missile-detection system. The tool is presently being applied to several industry examples in various companies.
The speed of our incremental estimation data structure and algorithm on several exam-ples is illustrated in Figure 6 . Examples include a microwave-transmitter controller (mwt), a telephone answering machine (ans), the DRACO peripheral interface (draco), and an Ethernet coprocessor (ether). To provide a notion for the size of each example, we indicate the number of functional objects to be partitioned, the number of speci cation lines, and the nal size of one hardware ASIC (in gates) after partitioning, as estimated by our HwSize function. Incidentally, the rst three examples consisted of one process, while the Ethernet coprocessor example contained 14 processes. For each example, we rst measured the time to build the preprocessed information. We then applied the group migration algorithm 17], using the cost function speci ed in 10]. Shown in the table are the number of moves that the algorithm examined, and the CPU time (in seconds on a Sparc1) required to update the estimation information and obtain a new hardware size estimate for each move. Note that the time-per-move is roughly the same across all four examples, demonstrating that computation is indeed done in constant time. More importantly, note the extremely fast time-per-move shown. The last two columns demonstrate the increased speed compared with a previous design-based estimator 16] . That estimator requires roughly 3 seconds for a given partition, which is the same magnitude of time required by several other designbased estimators 2, 3] . Multiplying by the number of moves yields a predicted estimation time; note the unacceptably long times for the large number of moves examined. The last column shows the speedup of our estimator over those previous ones, ranging from 426 to 755; such speedup is obtained while using the same design model. We also conducted experiments to determine the e ect of performing scheduling and allocation on each behavior individually, rather than considering all behaviors at the same time as in previous, slower design-based estimators. For the ether and ans examples, we inlined all subroutines; for the mwt example, such inlining generated an enormous output due to the many nested levels of subroutine calls, so we instead considered a subset of the speci cation consisting of four subroutines. We then applied the same scheduling and allocation tool to those inlined versions. Results of estimating all-hardware implementations are summarized in Figure 7 ; since we are considering all behaviors, the numbers are likely the worst case. Note that the number of states States, the number of control lines Ctrl, and the functional unit and multiplexor component areas Comparea are quite close, and the total sizes computed by the Hwsize function have an average error of only 7%. We also compared these estimates with what would have been obtained using previous weightbased techniques: we performed scheduling and allocation for each behavior, computed the size of each behavior, and then summed those sizes over the entire design. Note that the weight-based estimates are extremely inaccurate, with an average error of 80%. Those estimates greatly underestimate the control and routing area, while overestimating the total component area. Weight-based techniques assume that the behaviors combine in a linear manner, but the behaviors in fact share many components, and the PLA and routing sizes grow non-linearly (hence, there is no simple factor by which we can multiply the weights to improve the accuracy over all cases).
It is di cult to compare our estimates with implementation values. The reason is that there are many possible implementations for a given set of functions that trade o speed and size, so choosing the implementation to compare with is hard. A second di culty is that because we are dealing with large, industry examples, obtaining a real implementation takes many months. A third di culty lies in the fact that there are many possible HwSize functions that can be used in conjunction with our design parameters. Nonetheless, we compared our size estimations for part of the the answering machine example with an implementation. The implementation was developed by a designer who hand-designed the datapath and hand-speci ed the controlling state-machine; the state-machine was then implemented with the KISS synthesis tool. We estimated 7804 gates, while the implementation consisted of 5372 gates. A second rough comparison can be made with an industry design of a fuzzy-logic controller. We estimated 129,000 gates, whereas the actual implementation consisted of 5 20,000 gate FPGAs. We hope to obtain more comparisons as the tool is used in more designs.
Conclusions
We have introduced a method to rapidly estimate hardware size during functional partitioning. The method includes a data structure representing a design model, and an algorithm that incrementally updates that data structure during functional partitioning, thus yielding rapidly-computed design parameters that can be input to any number of hardware estimation functions. The method is the rst to achieve both advantages of being based on a design model, and of computing estimates in constant time; previous approaches achieved one advantage or the other, but not both. The method therefore enhances the usefulness of hardware as well as hardware/software functional partitioning tools in real design environments. The general method of developing an incrementally-updatable design model for estimation purposes may be applicable to many other estimation problems, such as estimation of hardware or software power consumption, hardware or software execution time, and bus bitrates. Thus, the method may become increasingly signi cant as design e ort shifts towards system-level design exploration. 
