In order to satisfy cost and performance requirements, digital signal processing and telecommunication systems, are generally implemented with a combination of di erent components, from custom-designed chips to o -the-shelf processors. These components vary in their area, performance, programmability and so on, and the system functionality is partitioned amongst the components to best utilize this tradeo . However, for performance critical designs, it is not su cient to only implement the critical sections as custom-designed high-performance hardware, but it is also necessary to pipeline the system at several levels of granularity. We present a design ow and an algorithm to rst allocate software and hardware components, and then partition and pipeline a throughput-constrained speci cation amongst the selected components. This is performed to best satisfy the throughput constraint at minimal hardware cost. Our ability to incorporate partitioning with pipelining at several levels of granularity, enables us to attain high throughput designs, and also distinguishes our work from previously proposed hardware/software partitioning algorithms.
Introduction
Digital systems, especially in the domain of digital processing and telecommunications, are immensely complex. In order to deal with the high complexity, increased time-to-market pressures, and a set of possibly con icting constraints, it is now imperative to involve design automation at the highest possible level. This \highest possible level" may vary on the design, but given the fact that an increasing number of designs now contain a combination of di erent component types, such as general-purpose processors, DSP (digital signal processing) cores, and custom designed ASICs (application speci c integrated circuits), we consider the highest level of design to be the one involving the selection and interconnection of such components. We refer to this as system-level design.
The reason why a system is best composed of di erent component types is due to the di erent characteristics of the components, which may be targeted at satisfying di erent constraints. O -theshelf processors o er high-programmability, lower design time, and a comparatively lower cost and lower performance than an equivalent ASIC implementation. On the other hand, ASICs are more expensive to design and fabricate, but o er comparatively higher performance. Thus, for a given system, these components are selected such that the performance critical sections are performed rapidly on ASICs, and the less critical sections, or the sections that require higher programmability, are performed on the processors.
Our work addresses the synthesis of throughput constrained systems. In order to meet the throughput constraints of these systems, it is not su cient to only perform the critical sections in hardware, but it is also necessary to pipeline the design. Pipelining divides the design into concurrently executing stages, thus increasing its e ective data rate. However, the increased concurrency can only be handled with an increased number of resources (or with faster resources). Hence, in designing a \cost-optimal" pipeline using hardware and software components, it is necessary to integrate the selection of the components, the functional partitioning of the system amongst the components, and lastly, the division of the system into pipe stages.
Given a speci cation of such a throughput-constrained system, we present a design ow and a set of algorithms to determine (1) an allocation of system-level components, (2) a functional partition, and (3) a pipeline to implement the system at minimal hardware cost. Our work supports pipelining at four di erent levels of the design, namely the system, behavior, loop and operation level. The integrated component selection, partitioning, and multi-level pipelining enables us to design highthroughput hardware/software systems.
The rest of the paper is organized as follows. In the next section, we describe previous work in the area of hardware/software codesign and position it with respect to our work in hardware/software pipelining. In Section 3, we de ne the problem of hardware/software partitioning and pipelining, and in Sections 4 and 5 we describe our model and algorithm, respectively. We provide experimental results in Section 7 and end with our conclusions in Section 8.
Previous work
Over the past ve years, several hardware/software codesign systems have been developed, some in industry but mainly in academia. These systems are based on di erent design methodologies and hence concentrate on di erent aspects of the hardware/software codesign problem. For instance, SpecSyn 6], Cosyma 5] , and Vulcan II 9] are codesign tools that focus on hardware and software estimation and on the functional partitioning of a given speci cation amongst hardware and software components, such as ASICs and processors. They di er in the level of granularity of partitioning (process vs. statement blocks) and in their partitioning approach (starting with all software vs. all hardware). Other synthesis systems such as Ptolemy 14] 13] and Chinook 4] focus on di erent aspects of the hardware/software codesign problem. Ptolemy provides an environment for specifying, simulating and prototyping DSP applications. It concentrates on hardware/software co-simulation and on code generation, rather than on hardware/software partitioning and estimation. Chinook concentrates on the synthesis of hardware and software interfaces, and like Ptolemy, it requires the user to manually specify the hardware and software partition.
Our work in hardware/software codesign focuses on performance-constrained functional partitioning for data ow dominated systems. However, in addition to performing spatial partitioning, that is, dividing the speci cation into a hardware and software space, we also perform temporal partitioning in which we divide the speci cation into pipe stages that execute concurrently, so as to achieve high data rates. Spatial and temporal partitioning are interdependent problems, since the division of tasks into pipe stages depends on their execution delay, which in turn, depends on the resources (hardware and/or software) used to implement the tasks. Hence, our algorithms consider both problems together, and determine a hardware/software partition as well as the number of pipe stages and an implementation for each pipe stage.
In comparing our work with the systems mentioned above, we have, in general, extended the design space explored by these systems by allowing designs to be pipelined at the system and the task level. Systems such as SpecSyn, Cosyma, and Vulcan II assume that the tasks in the system execute sequentially, in a non-pipelined fashion. Furthermore, they also assume that the operations within each task execute sequentially. In our designs, not only can the system be pipelined, but the operations within each task may also be pipelined. Hence, our work extends current algorithms by performing hierarchical pipelining and by so doing, it achieves partitions with high throughput values that are unattainable by other existing partitioning and estimation algorithms.
After a speci cation has been pipelined and partitioned amongst hardware and software components, we still require to perform the tasks of interface re nement, code generation and co-simulation. Thus, while our work is directly related to codesign systems performing partitioning, it is orthogonal to systems such as Ptolemy and Chinook that concentrate on the tasks carried out after partitioning and estimation.
Problem de nition
Our problem of hardware/software partitioning and pipelining may be de ned as follows:
Given:
1. A speci cation of the system as a control ow graph (CFG) of behaviors or tasks.
2. A hardware library containing functional units characterized by a three-tuple <type, cost, delay>.
3.
A software/processor library containing a list of processors characterized by a four-tuple <type, clock speed, dollar cost, metrics le>.
4.
A clock constraint and a throughput constraint for the complete speci cation.
Determine:
1. An implementation type (either software or hardware) for every behavior.
2. The estimated area for each hardware behavior, as well as the total hardware area for the complete speci cation.
3. The processor to be used for each software behavior, as well as the total number of processors for the complete speci cation 4. A division of the control ow graph into pipe stages of delay no more than the given throughput constraint.
Such that:
1. Constraints on throughput are satis ed, and 2. Total hardware area (for the given clock) is minimized.
The throughput constraint speci es the di erence in the arrival time (in nanoseconds) of two consecutive input samples. We also refer to this time as the PS (pipe stage) delay, since this would be the required delay of a pipe stage in the design, if it were to be pipelined. Note, that the number of stages in the design is determined by our algorithm, and it depends on the critical path of the CFG and on the PS delay constraint.
The example in Figure 1 illustrates the problem de ned above. As input we have a control ow graph of behaviors, a hardware and a software library, and a PS delay (throughput) and clock constraint. The control ow graph is derived from a SpecChart speci cation, details of which are provided in Section 5.1. The nodes in the control ow graph represent behaviors and the arcs represent control dependencies. Thus behaviors B and C can begin only when behavior A has completed, and similarly behavior D can begin only when both behaviors B and C have completed. Each behavior contains a sequence of VHDL statements that represents computation done on variables. Data dependencies are not explicitly depicted in the control ow graph. The clock constraint of 10 ns indicates that each of the hardware behaviors will operate at a clock frequency of 100 MHz and this clock value is used while estimating the area and delay of the hardware behaviors. Finally, the PS delay constraint indicates that a new sample of input data will be introduced every 4000 ns and thus, if the design is pipelined, each pipe stage should have a delay no more than 4000 ns.
The output consists of a pipelined and partitioned CFG where every behavior has been mapped to either hardware or software and the graph has been divided into pipe stages of delay no more than 4000 ns. Every hardware behavior is associated with an estimate of its execution time and the number and type of components (selected from the hardware library) needed to obtain that execution time. For instance, behavior E has a throughput of 4000 ns and requires 3 instances of Mpy1, 1 instance of Mpy2 and 2 instances of Add2, bringing the total area to 430 gates. Similarly, every software behavior is associated with a processor from the software library, and its execution time on that processor. For instance, behavior A implemented on the Pentium processor, has an execution time of 3100 ns.
Finally, the CFG has also been partitioned into three pipe stages such that the throughput of the system is 4000 ns, that is each pipe stage has a delay of no more than 4000 ns. The hardware and software partitioning and pipelining has been done with the aim of satisfying the throughput constraint at minimal hardware cost. This scheduling and pipelining information is represented in the System Pipeline diagram, which depicts all the pipe stages and the execution schedule of behaviors within each pipe stage.
Before we describe our algorithm to solve this problem, note our assumptions on how resources are shared amongst behaviors:
Two software behaviors may share the same processor, irrespective of the pipe stages they execute in. For instance, in Figure 1 , behaviors A and D are mapped to the same Pentium processor, even though they execute in di erent pipe stages. Behavior A executes in stage 1 from time 0 ns till time 3100 ns and behavior D executes in stage 2 from time 3100 till time 4000 ns. Since all pipe stages execute concurrently, the only criteria that needs to be satis ed is that the behaviors should execute at non-overlapping times.
Two hardware behaviors may not share resources. Thus, even though both behaviors C and E use Mpy1 they may not share the same multiplier. Note that this assumption is not a limitation of our model, but of our algorithm at this point. Having de ned our problem, we now introduce our system architecture. The system consists of one or more processors, one or more ASICs (application speci c integrated circuits), and one or more memory chips that all communicate over one or more buses ( Figure 2 ). The memory stores data that needs to be transferred between pipe stages as well as any globally de ned data that may be accessed by multiple processors and/or ASICs. In this paper, we assume that all hardware behaviors are mapped onto 1 ASIC. After the pipelining and the partitioning, this ASIC may be further partitioned into smaller The entire system can be viewed as a sequence of communicating pipelined FSMDs (Finite State Machine with Datapath). As an example, consider the MPEG1 (Motion Pictures Expert Group) decoder system shown in Figure 3 . It consists of six pipelined FSMDs namely the Decoder, Dequantizer, IDCT (Inverse Discrete Cosine Transform), Sum, Predictor and Display. The system is pipelined, both at the behavior or block level (that is within an FSMD) and at the system level (that is between consecutive FSMDs). In order to maintain the ow of data in the pipeline it is assumed that the consumption and production rate of a data stream is the same. Thus, in Figure 4 , the Dequantizer block produces a sample of the 64-element Array A, say, every 4000 ns and the IDCT consumes it at the same rate. We also assume that all pipelined FSMDs require a xed number of samples per input before they can begin processing data. After each producer FSMD has produced the required number of samples, it sends a START signal to its consumer FSMD indicating that the consumer may now begin execution. An FSMD can start after all its producers have sent START signals.
Finally, we assume that each pipelined FSMD has su cient memory to store two sets of samples per input, one the sample being currently used and the other, the sample being currently produced by the preceding FSMD. This approach of storing two samples per input, also known as doublebu ering, represents the maximum amount of memory required for communication; hence, it is an expensive solution. The size of this memory may be reduced by determining the sequence in which data is produced and consumed and hence determining the number of variables that are alive at any given time. The size may be further reduced by transforming the data ow so that the read and write sequences match as closely as possible.
Algorithm
Having de ned the problem and model, we now present the algorithm for the combined problem of hardware software partitioning and pipelining, an overview of which is presented in Figure 5 . Determine area of resources for all hardware behaviors
Step 1
Step2
Step 3 Step 4
Step 5
Step 6
Yes

Yes No
Resources not fast enough.
Stop.
No
Minimal hw area pipeline & schedule achieved. Stop.
Determine initial cheapest processor allocation
Figure 5: Overview of the algorithm for hardware software partitioning and pipelining.
Given a SpecChart 15] speci cation, hardware and software libraries, a throughput constraint, and a clock constraint, the rst step consists of deriving the control ow graph from the given speci cation. A type (hardware or software) is then determined for every behavior in the CFG. This determination is based on the assumption that a software implementation is always less costly than an equivalent hardware implementation for a given behavior; hence, our algorithm attempts to execute as many as possible behaviors in software, using as many processors as needed. A behavior will be executed in hardware only if a processor is unable to satisfy the constraint.
In the next step, we determine the number and type of hardware resources to be used for each of the hardware behaviors, and similarly we determine the total number and type of processors to be used by all the software behaviors. We then schedule and pipeline the control ow graph, that is, we determine a pipe stage and a time slot within the pipe stage for each behavior to execute in. If we can not determine a valid schedule and pipeline, we increase the speed and/or the number of processors and repeat the scheduling and pipelining step. This is repeated till constraints are satis ed, and in the worst case, this may be repeated till there are as many processors as software behaviors in the CFG.
In the next few sections we describe each of the six steps of our algorithm.
5. The input speci cation is given in SpecCharts 15], a language that supports a hierarchy of VHDL behaviors or processes. A behavior in SpecCharts may be of one of three types: concurrent, sequential or leaf. A concurrent behavior contains other behaviors that execute concurrently, a sequential behavior contains sequentially executing subbehaviors, while a leaf behavior contains only VHDL code and no other behaviors within it. Sequencing between behaviors is de ned using arcs known as TOC (transition on completion) arcs.
An example of a SpecCharts speci cation is given in Figure 6 . The TOP behavior contains three sequentially executing behaviors, A, BCDE and F. These behaviors execute one after the other as indicated by the TOC arcs. The statement, A: (TOC, true, BCDE), indicates that, on completion of A, BCDE will always (or unconditionally) be executed. This is as opposed to the statement B: (TOC, value > 0, D) (TOC, value <= 0, E); which indicates that, on completion of B, if value>0 then behavior D will be executed else if value<=0 behavior E will be executed. Behaviors A and F are of type \leaf" and, hence, contain only VHDL code. On the other hand, behavior BCDE is concurrent, containing behaviors BDE and C that execute concurrently. Behavior BDE is, in turn, sequential containing behaviors B, D and E.
The equivalent CFG for this SpecCharts code is shown on the right side of the gure. The CFG is obtained by essentially attening out the behavioral hierarchy. The circular nodes represent leaf behaviors and the arcs represent control dependencies. Thus, behavior F cannot begin until C, D and E have nished executing. The triangular node is a fork node that is used to represent the mutual exclusivity of behaviors. Thus, behaviors D and E are mutually exclusive. This information is used while scheduling and pipelining the CFG.
5.2
Step 2: Hardware/software partition
After deriving the CFG, we determine the execution time of all the leaf behaviors on all the available processors in the software library. This gives us the PET (processor execution time) table, an example of which is shown in Figure 7 for the given CFG and software library in Figure 1 . Table   ( The execution time is determined by invoking a software estimator 7] 11] which translates the VHDL code of the given behavior to a list of three address generic instructions, and then obtains the number of instruction cycles required to execute each of the three address instructions from the metrics le associated with the processor. A sum of the instruction cycles for all the generic instructions yields the total number of cycles, from which the execution time in ns may be obtained by multiplying this sum with the clock cycle.
Processor Execution−Time
Once we have constructed the PET table, determining the hardware/software partition is a simple task since our algorithm attempts to execute as many as possible behaviors on processors. Thus any behavior that has an execution time less than the given throughput constraint on at least one processor can be executed in software and only those behaviors that have an execution time greater than the throughput constraint on all processors need be executed in hardware. For instance, for the example in Figure 7 behaviors A, B and D can be executed as software and behaviors C and E need to be executed in hardware.
We would like to mention that though our algorithm depends heavily on the software estimator, we will not be describing it any further in this paper. This estimator has been the topic of several papers 7] 11] and it has been implemented within the same environment (SpecSyn 6]) as our tool.
Step 3: Initial processor allocation
Once we have built the PET table and have determined the hardware/software partition, our next task is to select an initial processor allocation for all the software behaviors. Our algorithm's primary goal is to perform the partitioning and pipelining at minimal hardware cost; however, it's two secondary goals are to minimize the total cost of processors and also to minimize the number of pipe stages in the design, with the former goal having a higher priority. We determine the nal processor allocation by starting with the cheapest processor allocation and then increasing the cost of the processor allocation till throughput constraints are satis ed. The initial processor set thus consists of the cheapest processor on which all the software behaviors have an execution time that is less than the throughput constraint.
For the PET table and CFG in Figure 7 the initial processor allocation is just one instance of a PowerPC processor. Though the 68000 processor is cheaper than the PowerPC (dollar costs given in software library in Figure 1) it cannot be used for behavior A which has an execution time of 6000 ns (while the throughput constraint is 4000 ns). Thus, the cheapest processor that can be used for behaviors A, B and C is the PowerPC. This processor allocation may or may not satisfy the throughput constraint once we schedule and pipeline the behavior, but we are assured that we are starting the search from the cheapest possible allocation.
Step 4: Estimating hardware resources
By the time we come to this step we know the behaviors that are to be implemented in hardware. However, we do not know either the execution time of these behaviors nor their area. Since our aim is to pipeline and schedule the CFG with the minimal hardware area we need to know both, the throughput or execution time as well as the area, of all hardware behaviors. Now, in order to minimize the cost of all the hardware behaviors we would like them to execute as slowly as possible. A slower or longer execution time results from the use of slower (hence, cheaper) and fewer components. Furthermore, we know that each hardware behavior should have a PS delay of at least the given PS delay constraint; thus, while estimating the resources for the hardware behavior we attempt to achieve a PS delay equivalent to the constraint. For instance, for the CFG in Figure 7 (b) we know that behaviors C and E are to be implemented in hardware, since the execution time on all processors exceeds 4000 ns. For these behaviors, we thus have to determine the number of pipe stages and the schedule and resources for every pipe stage, required to satisfy the PS delay constraint of 4000 ns, at minimal cost.
We brie y summarize the estimation algorithm here. For more details please refer to 1]. As input to the algorithm we have a behavior containing a sequence of VHDL statements, a hardware library, and a PS delay and clock constraint. Each behavior consists of a set of possibly hierarchical loops each with a di erent loop body and a di erent number of iterations. We rst distribute the throughput constraint amongst these loops, and depending on the execution time of each loop, we partition the loops into di erent pipe stages. For each loop body, we then determine the number of pipe stages and the resources that satisfy the throughput constraint that was propagated down to it. For this we use a combination of force directed scheduling and list scheduling, both modi ed for pipelines. Details of this step are provided in 2]. It is important to note that we pipeline each behavior at two levels: across loops and within a loop body. This hierarchical pipelining enables us to achieve high throughput values.
The output of the algorithm, which is the behavior's area and throughput (which may be less than or equal to the given constraint) is then entered into a Hardware Execution Time (HET) Table, an example of which is provided in Figure 9 (c). This table is then used in the next step of the algorithm.
5.5
Step 5: Scheduling and pipelining the CFG By the time we reach this step we have the following information:
1. a control ow graph CFG(V,E) in which each behavior node has a designated type, hardware or software.
2. an execution time, T v for every behavior node v 2 V , whether in hardware or software.
3. a processor allocation P (number and type of processors) selected from the software library.
the PS delay constraint, T.
Our aim is to determine the schedule and pipeline for the CFG that will satisfy the PS delay constraint, T. If we think of time as a grid Time(x; y) where x is a continuum from f0 Tg representing time in ns, and fy = 1 1g represents the number of pipe stages, then the problem may be de ned as follows:
Determine: 2. For every software behavior v s 2 V , a processor p 2 P.
3. For every processor p 2 P, a utilization list containing pairs (x p1 ; x p2 ), indicating time intervals when the processor is utilized.
Such that: In simpler words, we assign each behavior, v, to a pipe stage and to a time slot within a pipe stage such that predecessor behaviors of v nish their execution either in a previous stage, or in the same stage before the behavior v begins its execution. Furthermore, if v is a software behavior, then we have to make sure that the processor we select to execute it is not used by any other behavior during the time interval that it is executing behavior v. This assignment of behaviors to pipe stages is done so as to satisfy the PS delay constraint with the given processor allocation.
Algorithm overview
We use a version of the well known list-scheduling algorithm 10], outlined in Figure 8 . We start by determining the longest completion time from all nodes till all output nodes, assuming that the fastest processor is used to execute all the software nodes. This completion time gives the priority one node has over another during scheduling. The completion time of a node is a direct indication of its criticality, and hence, the higher the completion time, the higher its priority.
Next, we reset the utilization list of all processors to empty. As mentioned previously, the utilization lists contain the time slots when the processor is busy executing behaviors. We then form a list of ready nodes, that is nodes whose predecessors have already been scheduled. These nodes are prioritized in an order of decreasing delay from the node till any output. For instance, if delay from node(a) till an output is greater than the delay from node(b) till an output, then node(a) will have a higher priority than node(b). After forming the ready nodes' list and the utilization list for the processors we nd the \best time slot" for every node in the ready list, starting with the rst (or highest priority) node. For nodes that are labeled as software the \best time slot" is also dependent on the processor selected to implement that node. The procedure to determine the \best time slot" is explained in detail in the next section. After a time slot (and processor) has been assigned for the node, we mark it as \scheduled", and then remove it from the ready list. This step is repeated for every node on the ready list, until the ready list is empty. Then, we update the ready list by going through the CFG and nding all unscheduled nodes with scheduled predecessors. Steps 6 to 17 are repeated for all nodes on the ready list, and the outermost loop (Steps 5 to 19) is repeated till all nodes in the CFG are scheduled.
The heart of the algorithm lies in Steps 7 and 15, which we now explain in greater detail. 
Determining the best time slot
Every node should ideally begin executing immediately after all its predecessors have completed execution, in the same pipe stage as it's last predecessor 1 . However, this may not be feasible for two reasons. Firstly, the throughput constraint may be violated and secondly, in the case of a software node, a processor for that time slot may not be available. In both these situations it may then be necessary to schedule the node in the next pipe stage.
We explain the procedure for determining the best time slot by using the example in Figure 9 . Given are a CFG with 4 nodes (3 in software, 1 in hardware), and software and hardware execution time tables, in parts (a), (b) and (c) of the gure, respectively. Also given is a PS delay constraint of 10 ns. The algorithm starts by nding the longest delay from each node till any output node, assuming execution on the fastest processor. Thus, starting from the bottom of the CFG, node D has a delay of 2 ns (we'll drop the ns from now on) assuming execution on processor P1, node C has a delay of 10 (8+2), node B of 15 (10+5), and node A of 17 (10+7), giving node A the highest priority and node D the lowest. The initial prioritized ready list, then, consists of nodes A and B in that order.
We start by nding the best time slot for node A. We have a choice of three processors and 1 The last predecessor is the predecessor that nishes its execution last in the last pipe stage amongst all predecessors. For instance, if Predecessor 1 nishes execution in stage 1, time 100 ns, and Predecessor 2 nishes execution in stage 2, time 50, then Predecessor 2 is the last one.
corresponding three time slots: from 0 to 7 on processor P1, 0 to 9 on processor P2, and 0 to 11 on processor P3. This is indicated in the Completion Time Table in Figure 9 (d). Each entry in the table gives the completion time and the pipe stage for a speci c behavior on a speci c processor. Note that the completion time table is not built before we start scheduling, but an entry for a node is appended to the list when the node is at the top of the ready list, next in line to be scheduled.
Of the three choices for node A we select processor P1 since that gives us the earliest completion time. Node A is thus scheduled in the rst pipe stage, starting at time 0 and ending at time 7. The schedule as well as the utilization list of processor P1 is updated as shown in Figure 9 (f) and (e), respectively. Next, the completion times for node B on all the processors is calculated, and processor P2 is selected to be the best since it gives us the earliest completion time. Note that even though Processor P1 is faster and has a lower execution time than P2, it is utilized by A from 0 to 7, and thus it o ers a completion time of 12 (7 + 5), which violates the PS delay constraint of 10.
Next we come to node C, which is in hardware, and thus does not need to contend with any other node for its resources. The earliest we can schedule C is at time 0 in stage 2. In stage 1 we will get a completion time of 16 (8 + 8) which is a violation of the throughput constraint. Next, the ready list contains node D which cannot be scheduled prior to stage 2 since its prede-cessor has been scheduled in stage 2. The options for node D are: (1) on P1 from time 8 to 10 in stage 2, (2) on P2 from time 8 to 11 in stage 2, (3) on P3 from time 0 to 4 in stage 3. Of these four choices, we select the rst, since it gives us the earliest completion time in the same pipe stage as its predecessor, node C. Though P3 gives us an earlier completion time, it introduces a third pipe stage which is undesirable. Recall that one of the secondary goals of our algorithm is to minimize the total number of pipe stages. The nal schedule and pipeline is shown in Figure 9 (f) and (g).
5.6
Step 6: Modifying the processor allocation
We now come to the last step of our main algorithm outlined in Figure 5 . When scheduling the CFG in the previous step, if we do not nd a feasible time slot for a software node, then we need to modify the processor allocation to either use faster processors or increase the number of processors.
We use a simple, almost exhaustive method of modifying the processor allocation. We start with one instance of the slowest or cheapest processor and in every iteration we replace it with the next faster (more expensive) processor in the library. When we have tried scheduling with the fastest processor, we start with 2 instances of the slowest processor and then increase the processor speed in every iteration. We stop increasing the number of processors when it equals the number of software nodes in the CFG, since one CFG node cannot be implemented by two processors. Thus, our rst priority is in keeping the number of processors as small as possible and our second priority is on the processor cost. For instance, we would favor 1 faster processor over 2 slower processors even if the cost of the 2 were to be less than the cost of the 1. This is because with every additional processor, the extra communication delays and interface costs could far outweigh the saved dollars in choosing the slower processors.
As an example let's assume that we have 3 software nodes in the CFG and 3 di erent processor types as shown in the library of Figure 10(a) . The di erent processor allocations and their corresponding cost is shown in the Processor Modi cation Table in Figure 10 (b).
Extensions
Our present model and algorithm serves as a basic framework for hardware/software partitioning and pipelining. This framework may be enhanced to incorporate a variety of other cost functions and partitioning algorithms. However, we believe such extensions can be handled without signi cant changes in our design ow and algorithm. Important amongst our current limitations and possible extensions are:
1. Interface cost and delay. Interface cost refers to the cost of buses, bus interfaces or memory components that may be required for the selected ASICs and processors in the system to communicate with each other. Similarly, interface time refers to the delay that is incurred while sending messages or data from one system component to another.
The time taken to transfer data between behaviors (and across pipe stages) that have been implemented on di erent components may be comparable to the computation time of the behaviors. Clearly, this cannot be ignored, especially for DSP systems that process large amounts of data. We are currently looking at developing algorithms to estimate interface cost and communication delays, and then to update the cost functions in the hardware/software partitioning and pipelining algorithm, to consider this added cost and delay. We expect that the design ow and steps of the algorithm will remain largely una ected by this addition.
2. Partitioning amongst multiple ASICs. It might be necessary to partition the hardware component amongst multiple ASICs when it exceeds certain area and/or pin limits. Existing algorithms 8] 17] may be invoked at every iteration of the partitioning and pipelining step of the algorithm (Step 5). Alternatively, if the pipeline structure is going to remain unchanged by the ASIC partitioning, this may be performed after our algorithm determines the hardware/software partition and pipeline.
3. Hardware/software cost function. Currently, the goal of our algorithm is to minimize cost, and the hardware/software partitioning is solely based on this. This cost function can be replaced by another, where perhaps both hardware and software cost appear as some weighted normalized measures. Once again our basic framework and approach can easily handle this change.
We would like to stress that numerous possible extensions and enhancements can be made; however, the underlying steps of the algorithm are not expected to vary.
Experimental results
We have implemented the algorithms for hardware/software partitioning and pipelining described in this paper and have integrated them within SpecSyn 6] . The experiments described in this section are conducted on a SUN Sparc 5 Workstation.
We present results of two experiments. In the rst, we compare manual designs of the MPEG I decoder system against that obtained by our algorithm. In the second experiment, we explore the design space of three examples and demonstrate the bene ts of hierarchical pipelining.
Experiment 1: The MPEG system
The system-level SpecChart speci cation of the MPEG I video decoding algorithm, derived from the ISO Standards, consists of about 3000 lines of code. It contains 15 leaf behaviors representing di erent parts of the MPEG functionality, such as the dequantization, prediction, cosine transform and so on. In this experiment we use a reduced speci cation with 12 behaviors. We exclude 3 control dominated behaviors (for variable length decoding, and for storing and displaying decoded frames) since our algorithm targets data dominated and not control intensive designs. The 3 behaviors that we excluded are not performance critical and they may be synthesized using currently available behavioral and logic synthesis tools.
By comparing with manual designs, we attempt to evaluate two features of our algorithm, the hardware resource estimation and the pipelining and partitioning algorithm.
Hardware resource estimation
Hardware implementations for each of the 12 leaf behaviors were obtained manually 16] and also by using our algorithm, for a PS delay/throughput constraint of 4000 ns and a clock of 25 ns 2 . The PS delay constraint was derived assuming that the desired decoding rate is 30 frames/second where one frame consists of 720 480 pixels. The clock constraint was selected based on the hardware library which in turn was derived from the VDP300 (VLSI Technology Inc.) library 16] . It contains the area (in number of gates) and delay (in ns) of RTL level components such as adders, multipliers, shifters, registers, multiplexers and so on.
In Figure 11 , we present comparisons for a few behaviors representing important parts of the MPEG decoder, such as the inverse discrete cosine transform and motion prediction. The table gives the number of pipe stages, the number of states per stage, and the components used by the leaf behaviors in the manual design and in the estimated designs. The table provides the sum of the component areas (in gates) as well as the individual components for each design, making it easier to pinpoint the di erence between the manual designs and the estimates. The components are represented along with their bitwidth, hence, 9-MUL refers to a 9-bit multiplier and (4) 10-MUL refers to 4 instances of a 10-bit multiplier.
Both methods, the manual and our estimation algorithm, obtained comparable PS delay values, given the constraint of 4000 ns. The value of the PS delay does not re ect on the design quality since it is a constraint that just needs to be satis ed. In most cases, the number of pipe stages and the number of states/stage were comparable. Furthermore, both methods used a comparable number and type of functional units for all behaviors. Our estimation algorithm, however is unable to estimate for multi-functional units as well as multiple bitwidth units. Hence, for example, in leaf deq0 and leaf motion, we used one adder and one subtractor instead of an ADD/SUB component, and in leaf idcti1 we used 5 16-bit adders as opposed to adders with 4 di erent bitwidths in the manual design.
Part of motion prediction
Having given an idea of how our estimates compare with the manual designs we now study its impact on design exploration.
Pipelining and partitioning algorithm
After the hardware implementations were obtained the design space was manually explored by starting with an all software non-pipelined solution and then moving the critical behaviors to hardware, as well as pipelining the system, till its throughput was within the constraint of 4000 ns. Similarly, we ran our algorithm for a range of PS delay constraints (700,000 ns to 3,000 ns), starting from an all software solution and moving towards an all hardware solution. Results of the comparison are tabulated in Figure 12 and shown graphically (in part) in Figure 13 . For each design, the tables give the PS delay, the number of pipe stages at the system level, the hardware area and the number of Pentium processors required. (The Pentium processor was selected to be the best from a library of about 6 processors, including the Sparc, PowerPC, and 68000 processors). Note that in the manual designs, the hardware area is the area of the entire datapath and controller, while our algorithm just estimates the functional unit and the memory area. Hence, in general, the area of the manual design is higher than the estimated area.
The results indicate that the design exploration conducted by our algorithm closely matches the manual exploration. Though the accuracy of our hardware estimates is not very high, its delity is extremely high. Since Figure 11 indicates that our estimation of functional units is fairly accurate, once we estimate the controller area as well, we will be able to close the gap between the two curves. The results also indicate that the designs obtained by our algorithm, were in some cases able to share the same processor amongst 2 behaviors, hence requiring a fewer number of processors. Figure 14 : Designs explored by partitioning and pipelining the Volume, DHRC and AR examples.
In this experiment, we present the range of designs obtained by pipelining a system and partitioning it amongst hardware and software components. We present results for the Volume System 7], the di erential heat release computation (DHRC) 3] and the AR Filter 12] . The Volume system speci cation contains 14 leaf behaviors, while the DHRC and AR examples, contain just one leaf behavior each. We place a set of di erent PS delay constraints on each of the examples, and our algorithm returns a set of designs, ranging from all hardware, highly pipelined solutions, to ones containing processors and possibly custom hardware (Figure 14 ).
For the Volume System, the fastest design has a throughput of 420 ns and 6 pipe stages with a total area of 6541 gates and at the other end of the spectrum the design is about a 100 times slower, with just 1 pipe stage of delay 52520 ns using just one 68020 processor. Similarly, large variations in performance and area were obtained for the DHRC and AR lter. The PS delay varies by as much as 6000 times for the DHRC, 3000 times for the AR lter, with the faster designs being implemented as pipelined hardware and the slower designs being implemented on processors. Once again, we have depicted some of the DHRC designs graphically, to give a feel for the design space explored.
Though these experiments do not evaluate the quality of our hardware/software pipelining algorithm, they indicate the powerful design exploration that is achieved by not only partitioning the design amongst hardware and software components, but also pipelining it at 3 di erent levels, the system, behavior and loop level.
Conclusion
In this paper, we have presented a design ow and a set of algorithms that partition a system speci cation in two domains, temporal (pipelining) and spatial (hardware/software partitioning), so as to achieve high throughput designs at reasonable costs. These algorithms have been implemented within SpecSyn a system-level design environment. Experiments conducted on several examples indicate the feasibility of the problem de nition and of our approach in solving it. Though our experiments have not been conclusive about the quality of our algorithms, we were able to demonstrate its accuracy and delity for the MPEG system, by comparing our designs with manually obtained ones. We also demonstrated the impact of hierarchical pipelining in design space exploration. From our experimental results we see that our design ow and algorithms may be improved in several ways. At the behavior level, our resource estimation algorithm can be improved by allowing the use of multi-functional components, by allowing multiple bitwidths of the same component type (currently, components of di erent types may have di erent bitwidths, but components of the same type have the same bitwidth), by extending our algorithms to pipeline in the presence of loop-carried dependencies, and by providing estimates for the cost of multiplexers and buses. At the system level, we can improve the pipelining and partitioning algorithms by taking interface cost and delay into account, by incorporating di erent cost functions and by allowing mutli-ASIC partitioning. Our design model and algorithms can support these extensions and incorporating these extensions is part of ongoing work.
