M O S T DIGITAL SYSIEMS usedfor dedicated applications consist of general-purpose processors, memory, and applicationspecific hardware circuits. Examples of such embedded systems appear in medical instrumentation, process control, automated vehicles, and networking and communication systems. Besides being application specific, such system d e signs also respect constraints related to the relative timing of their actions. For that reason we call them real-time embedded systems.
(such as the protocol for Ethernet links).
The decision to map functionalities into dedicated hardware or implement them as programs on a processor usually depends on estimates of achievable performance and the implementation cost of the respective parts. While this division impacts evely stage of the design, it is large ly based on the designer's experience and takes place early in the design process. As a consequence, portions of a design often are either under-or over-designed with respect to their required performance. More important, due to the ad hoc nature of the overall design process, we have no guarantee that a given implementation meets required system performance (except possibly by overdesigning).
In contrast, we can formulate a methodical approach to system implementation as asynthesiwriented solution, a tactic that has met with enormous success in individual integrated circuit chip design (chip level synthesis). A synthesis a p proach for hardware proceeds with systems described at the behavioral level, by means of an appropriate A synthesis-oriented approach to digital circuit design starts with a behavioral description of circuit functionality. From that, it attempts to generate a gate level implementation that can be characterized as a purely hardware implementation (Figure 2) . Recent strides in high-level synthesis allow us tosynthe size digital circuits from high-level specifications; several such systems are available from industry and academia. Gajski' and Camposano and Wolf2 provide surveys of these. Synthesis produces a gatelevel or geometric-level d e scription that is implemented as single or multiple chips. As the number of gates (or logic cells) increases, such a solution requires semicustom or custom design technologies, which then leads to associated increases in cost and d e sign turnaround time. For large system designs, synthesized hardware solutions consequently tend to be fairly expensive, depending upon the technology chosen to implement the chip.
On the other end of the system development cost and performance spectrum, one can also create a software prototype, amenable to simulation, of a system using a general-purpose programming language. (See Figure 2 .) The Rapide prototyping system" is one example. D e signers can build such software prototypes rather quickly and often use them for verifying system functionality. However, software prototype performance very often falls short of what timeconstrained Practical experience tells us that costeffective designs use a mixture of hardware and software to accomplish their overall goals (Figure 1 j. This providessufficient motivation for attempting a synthe sis-oriented approach to achieve system implementations having both hardware and software components. Such an a p proach would benefit from a systematic analysis of design trade-offs that is com-' system designs require. scribe synthesis of hardware or software One way to accomplish this task is to specify constraints on cost and performance of the resulting implementation (Figure 3) . We present an approach to systematic exploration of system designs that is driven by such constraints. Our work builds upon high-level synthesis techniques for digital hardware4 by extending the concept of a resource needed for implementation.
As shown in Figure 4 , this approach captures a behavioral specification into a system model that is partitioned for implementation into hardware and software. We then synthesize the partitioned model into interacting hardware and software components for the target architecture shown in Figure 5 . The target architecture uses one processor that is embedded with an applicationspecific hardware component. The processor uses only one level of memory and address space for its instructions and data. Currently, to simplify the synthesis and performance estimation for the hardware component, we do not pipeline the applicationspecific hardware. Even with its relative simplicity, the target architecture can apply to a wide class of applications in embedded systems.
Among the related work, Woo, Wolf, and Dunlop5 investigate implementing hardware or software from a cospecififor interface circuits. Chiodo et al.7 discuss a methodology for generating hardware and software based on a unified finite-statemachinebased model. Given a system specification as a C-program, Henkel and Ern& identify portions of the program that can be implemented into hardware to achieve a speedup of overall execution times. Srivastava and Broderseng and Buck et present frameworks for generating hardware and software components of a system. Investigators have proposed several new architectures that use field-programmable gate arrays to create special-purpose coprocessors to speed up applications (PAM", MoMI2) or to create prototypes (Q~ickTurn'~).
Capiurin specification of system functioncIity and constraints
We capture system functionality using a hardware description language,
Haudwa~eC.'~ The cosynthesis approach formulated here does not depend upon the particular choice of the HDL, and could use other HDLs such as VHDL or Verilog. However, the use of HardwareC leverages the use of Olympus tools developed for chiplevel synthe~is.~ HardwareC follows much of the syntax and semantics of the programming language, with modifications necessary for correct and unambiguous hardware modeling. HardwareC description consists of a set of interacting processes that are instantiated into blocks using a d e clarative semantics. A process model executes concurrently with other processes in the system specification. A process r e starts itself on completion. Operations within a process body allow for nested concurrent and sequential operations. Figure 6 shows an example of an HDL functionality specification. This example performs two data input operations, followed by a conditional in which a counter index is generated. The specification uses counter index z to seed a downcounter indicated by the while loop. A graph-based representation as shown captures this HDL specification.
In general, the system model consists of a set of hierarchically related sequencing graphs. Within a graph, vertices represent languagelevel operations and edges represent dependencies be- tween the operations. Such a representation makes explicit the concurrency inherent in the input specification, thus making it easier to reason about properties of the input description. As we shall soon see, it also allows us to analyze timing properties of the input description.
Model properties.
The sequencing graph is a polar one with source and sink vertices that represent no-operations. Associated with each graph model is a set of variables that defines the shared memory between operations in the graph model. Source and sink vertices synchronize executions of operations in a graph model across multiple iterations. Thus, polarity of the graph model ensures that there is exactly one execution of an operation with respect to each execution of any other operation. This makes execution of operations within a graph single rate ( Figure  7 ). The set of variables associated with a graph model defines the storage common to the operations; it sewes to facilitate communication between operations.
Given the singlerate execution model, it is relatively straightforward to ensure ordering of operations in a graph model that preserves integrity of memory shared between operations. However, operations across graph models follow multirate execution semantics. That is, there may be variable numbers of executions of an operation for an operation in another graph model. Because of this multirate nature of execution, the operations use messagepassing primitives like send and receive to implement communications across graph models. Use of these primitives simplifies specification of inter. model communications. A multirate specification is an important feature for modeling heterogeneous systems, b e cause the processor and applicationspe cific hardware may run on different clocks and speeds.
HDL descriptions contain operations to represent synchronization to external events, such as the receive operation, as well as datadependent loop operations. These operations, called nondeterministic delay (ND) operations, present unknown execution delays. The ability to model ND operations is vital for reactive embedded system descriptions. Min/max delay constraints: These provide bounds on the time interval between initiation of execution of two operations. Execution rate constraints: These provide bounds on successive initiations of the same operation. Rate constraints on input/output operations are equivalent to constraints on throughput of respective inputs/ outputs.
These two types of constraints are sufficient to capture constraints needed by most real-time system^.'^ Our synthesis system captures minimum delay constraints in the graphical representation by providing weights on the edges to indicate delay of the corresponding source operation. Capturing maximum delay constraints requires additional backward edges ( Figure 9 ).
Model analysis.
Having captured system functionality and constraints in a graphical model, we can now estimate system performance and verify the consistency of specified constraints. Performance measures require estimation of operation delays. We compute these delays separately for hardware and software implementations based on the type of hardware to be used and the processor used to run the software. A processor cost model captures processor characteristics. It consists of an execution delay function for a basic set of processor operations, a memory address calculation function, a memory access time, and processor interruption response time.
Timing constraint analysis attempts to answer the following question: Can imposed constraints be satisfied for a given implementation? We indicate an implementation of a model by assigning appropriate delays to the operations with known delays (not ND) in the graph model. Constraint satisfiability r e lates to the structure as well as the actual delay and constraint values on the graph. Some structural properties of the graphs (relating to ND operations and their dependencies) may make a constraint unsatisfiable regardless of the actual delay values of the operations. Further, some constraints may be mutually inconsistent: for example, a maximum delay constraint between two operations that also have a larger minimum delay constraint. No assignment of nonnegative operation delay values can satisfy such constraints.
In the presence of ND operations in a graph model, we consider a timing constraint satisfiable if it issatisfied for all possible (and maybe infinite) delay values of the ND operations. We consider a timing constraint marginally satisfiable if it can be satisfied for all possible values within specified bounds on the delay of the ND operations. Marginal satisfiability analysis is useful because it allows the use of timing constraints that can be satisfied under some implementation assumptions (acceptable bounds on ND operation delays). Without these assumptions the general timing constraint satisfiability analysis would otherwise consider these constraints ill-posed. I6 We perform timing constraint analysis by graph analysis on the weighted s e quencing graphs. Note that in some cases system i throughput (specified by rate con-1 straints) can be optimized significantly with little or no impact on system laten-' cy by using a pipelined execution mod-~ el and extra resources. Indeed, for deterministic and fixed-rate systems particularly used for digital signal processi ing applications, researchers have developed extensive transformations that determine and achieve bounds on system throughput.I7 However, as noted bounded loop. The ND operation induces a bipartition of the calling process, P = Fu B, such that the set of operations in F (for example, the read operation in process test) must be performed before invoking the loop body. Further, the set of operations in B can only be performed after completing executions of earlier, systems modeled by the se-i the loop body. We can then use funcquencing graphs generally operate at ~ tional pipelining of F, B, and the loop to different rates. In addition, because of ' improve the reaction rate of P. Since we the presence of ND operations due to ~ assume nonpipelined hardware, these loops, the rate at which a particular o g transformations are used only in the eration executes may change over time. context of the software component. While this property is essential for modeling controldominated embedded sysConstraint analysis and software. terns, it aggravates the problem of , The linear execution semantics imposed determining absolute bounds on achiev-by the software running on a single-proable system throughput.
cessor target architecture complicates We illustrate the issue of rate con-constraint analysis for a software imple straints on graphs containing ND opera-~ mentation of a graph model. That is, pertions in Example A (next page).
forming delay analysis for software
In general, consider a process P that 1 operations requires a complete order of contains an ND operation due to an un-I operations in the graph model. In creat- ing a complete order of operations, it is likely that unbounded cycles may be cre ated, which would make constraints unsatisfiable.
Asshown in Figure 10 , any serialization that puts an ND operation between two operations opl and op2 will make any maximum delay constraint between opl and op2 unsatisfiable. However, note that while all computations must be performed serially in software, communication operations can proceed concurrently.
In other words, it is possible to overlap execution of ND operations (wait for synchronization or communication) with some (unrelated) computation. But such an overlap requires the ability to schedule operations dynamically in software since the simultaneously active ND operations may complete in orders that cannot be determined statically. Typically, dynamic scheduling of o p erations involves delay overheads due to selection and scheduling of operations. Therefore, a good model of software is to think of software as a set of fixed-latency concurrent threads ( Figure  11 ). We define a thread as a linearized set of operations that may or may not begin by an ND operation indicated by a circle in Figure 1 1. Other than the beginning ND operation, a thread does not contain any ND operations. We consider the delay of the initial ND operation part of the scheduling delay and, therefore, not included in the latency of the program thread. Use of multiple concurrent program threads instead of a single program to implement the software also avoids the need for complete serialization of all operations that may create un- bounded cycles.
In this software model, we can check marginal satisfiability of constraints on operations belonging to different threads, assuming a fixed and known delay of scheduling operations associated with ND operations (context switch delay, for example).
System partitioning
The system-level partitioning problem refers to the assignment of operations to hardware or software. The assignment of an operation to hardware or software determines the delay of the operation. In addition, assignment of operations to a processor and to one or more application-specific hardware circuits involves additional delays due to communication overheads.
Any good partitioning scheme must attempt to minimize this communication. Further, as operations in software are implemented on a single processor, incre* ing operations in software increases processor utilization. Consequently, overall system performance depends on the effect of hardwaresoftware partition on utilization of the processor and the bandwidth of the bus between the processor and applicationspecific hardware.
A partitioning scheme thus must attempt to capture and make use of a partition's effect on system performance in making trade-offs between hardware and software implementations of an operation. An efficient way to do this would be to devise a partition cost func- tion that captures these properties. We would then use this function to direct the partitioning algorithm toward a desired solution, where an optimum solution is defined by the minimum value of the partition cost function.
Note that we need to capture not only the effects of sizes of hardware and software parts but also the effect on timing behavior of these portions in our partition cost function. In contrast, most partitioning schemes for hardware have focused on optimizing area and pinout of resulting circuits. Capturing the effect of a partition on timing performance during the partitioning stage is difficult. Part of the problem arises because the timing properties are usually global in nature, thus making it difficult to make incremental computations of the partition cost function as is essential for developing effective partition algorithms. Approximation techniques have been suggested to take into account the effect of a partition on overall latency.I8 Note, however, that partitioning in the software world does make extensive use of statistical timing properties to drive the partitioning algorithms.I9 We draw the distinction between these two extremes of hardware and software partitioning by the flexibility to schedule operations. Hardware partitioning attempts to divide circuits that implement scheduled operations. Conversely, the program-level partitioning addresses operations that are scheduled at runtime.
Our approach to partitioning for hardware and software takes an intermediate approach. Asshown in Figure 12 , we use deterministic bounds on timing properties that are incrementally computable in the partition cost function. That is, we can compute the new partition cost function in constant time. We accomplish this by using a software model in terms of a set of program threads as shown in Figure 11 Bus utilization, B I B.
w A partition cost function, f = f ( S , the two sets of graph models.
B, PI, m) is minimized.
An exact solution to the constrained partitioning problem-a solution that minimizes the partition cost functiontion rate ?, of a program thread is computed as the inverse of its latency. The latency of a program thread is computed using a processor delay cost model and includes a fixed scheduling overhead delay.
From an initial solution we perform iterative improvement by migrating o p erations between the partitions. Migration of an operation across a partition tion to one of the program threads. requires that we examine a large number of solutions. Typically, that number is exaffects its execution delay. It also affects the latency and reaction rate of the Characterization of software using h, p, P, and B parameters makes it possible to calculate static bounds on software performance. Use of these bounds is helpful in selecting an appropriate partition of system functionality between hardware and software. However, it also has the disadvantage of overestimating performance parameters such as processor and bus bandwidth utilization. Typically, there is a distribution of thread invocations and communications based on actual data values being transferred, which is not accounted for in these parameters.
We compute hardware size S, bottom-up from the size estimates of the resources implementing the individual operations. In addition, we characterize the interface between hardware and ponential to the number of operations under partition. As a result, designers often use heuristics to find a "good" solution, with the objective of finding an optimal value of the cost function that is minimal for some local properties.
Most common heuristics to solving partitioning problems start with a constructive initial solution that some iterative procedure can then improve. Iterative improvement can follow, for example, from moving or exchanging operations and paths between partitions. A good heuristic is also relatively insensitive to the initial solution. Typically, exchange of a larger number of operations makes the heuristic more insensitive to the starting solution, at the cost of increasing the time complexity.
In the following, we describe the intusoftware by a set of communication itive features of the partitioning algoports (one for each variable) between rithm. We have presented details hardware and software that communi-elsewhere. 20 The procedure identifies cate data over a common bus. The over-operations that can be implemented in head due to communication between software such that the corresponding hardware and software is manifested by constraint qraph implementation can be thread to which this operation is moved. We similarly compute its effect on processor and bus bandwidth utilization. At any step, we select operations for migration so that the move lowers the communication cost, while maintaining timing constraint satisfiability. In addition, we check for communication feasibility by verifying that pi 2 pi for each thread, and that processor and bus utilization constraints are satisfied.
System synthesis
From partitioned graph models, our next problem is to synthesize individual hardware and software components. KuI4 and address in detail the generation of hardware circuits for sequencing graph models. Therefore, we concentrate on generation of software and interface circuity from partitioned models. The problem of software synthe sis is to generate a program from partitioned graph models that correctly implements the original system functionality. We assume that the resulting pre the utilization of bus bandwidth as described earlier.
satisfied a i d the resultingsoftware (as a set of program threads) meets required gram is mapped to real memory, so the issues related to memory management Given the cost model for software. rate constraints on its inputs and outhardware, and interface, we can informally state the problem of partitioning a specification for implementation into hardware and software as follows:
From a given set of sequencing graph models and timing constraints between operations, create two sets of sequencing graph models such that one can be implemented in hardware and the other in software and the following is true: puts. As an initial partition we assume that ND operations related to datadependent loop operations define the beginning of program threads in software, while all other operations are implemented in hardware. The rate constraints on software inputdoutputs translate into bounds on required reaction rate p, of corresponding program thread q. Maximum achievable reacare not relevant to this problem. The partitioning discussed previously identified graph models that are to be implemented in hardware and operations (organized as program threads) that are to be implemented in software. See Example B.
The program generation from a thread can either use a coroutine or subroutine scheme. Since, in general, there can be dependencies into and from the program threads, a coroutine model is more appropriate. A dependency between two operations can be either a data or a control dependency. Depending upon predecessor relationships and timing of the operations, we can make some of these redundant by inserting other dependencies such that resulting program threads are convex-all external dependencies are limited to the first and last operations. For a given subgraph corresponding to a program thread, we can move an incoming data dependency up to its first operation and move an outgoing data dependency down to its last operation. This procedure produces a potential loss of concurrency. However, it makes the task of routine implementation easier since we can implement all the routines as independent programs with statically embedded control dependencies.
Rate constraints and software. In the presence of dependencies on ND operations, we cannot always guarantee that a given software implementation will meet the data rate constraints on its Example C Consider the threads T I and T2 generated from process test mentioned in Example A. The overall execution time of the while loop determines the interval between successive executions of the read operation. Due to this variable-delay loop operation, the input rate at port pis variable x) we cannot always guarantee the reaction rate of T1. Since the set of operations in loop-body may alter the contents of memory in process test, thread T1 must be blocked until the completion of T2. Thus the process test can be thought of as consisting of two parallel processes, as shown in Figure B . We need the first operation of thread T2, wait1 , to observe the data dependency of operations in thread T2. We need the second wait operation, wait2, to guarantee that any memory side effects of T2 for variables in T1 are correctly reflected. To obtain a deterministic bound on the reaction rate of the calling thread, it is possible to unroll the looping thread by creating a variable number of program threads. However, in this case each iteration of the looping thread would carry scheduling overhead. Dynamic creation of program threads may also lead to violation of processor utilization constraint as described in previous sections.
However, it is possible to overlap execution of loop thread T2 with execution of thread T I , and to ensure marginal timing constraint satisfiabiliIy. Note that we can remove operation ~ I/O ports. In case of synchronizationrelated ND operations, we can check for marginal satisfiability of timing constraints by assigning a context-switch d e lay to the respective wait operations. However, in the case of unbounded loop-related ND operations, the delay due to these operations consists of ac- In this example, two data queues with 1 6 bits of width and 1 bit of depth, line-queue and circle-queue, and one queue with 2 bits of width and 1 bit of depth, confrolFIF0, are declared. The guarded commands specify the conditions on which the number 1 or the number 2 is enqueued-here, a '+' after a signal name means a positive edge and a '-' after the signal means a negative edge. The first when condition states that when a dequeue request for the queue line-queue arrives and this queue is not empty and the queue controlFlF0 is not full, then enqueue the value 2 (representing identifier for a corresponding program thread that consumes data from the line queue) into the confrolFIF0. hardware to software must be explicitly synchronized. By using a polling strategy, we can design the software component to perform premeditated transfers from the hardware components based on its data requirements. This requires static scheduling of the hardware component. Where software functionality is limited by communications-that is, where the processor is busy waiting for an inputoutput operation most of the timesuch a scheme would suffice. Further, in the absence of any unbounded-delay operations, we can simplify the software component in this scheme to a single program thread and a single data channel since all data transfers are serialized. However, this approach would not support any branching nor any reordering of data arrivals, since the design would not support dynamic scheduling of o p erations in hardware.
To accommodate differing rates of execution among the hardware and software components, and due to unbounded delay operations, we look for a dynamic scheduling of different threads of execution. Availability of data forms the basis for such a scheduling. One mechanism to perform such scheduling is a control FIFO (first in, first out) buffer, which attempts to enforce the policy that data items are consumed in the order in which they are produced. As shown in Example D, the hardwaresoftware interface consists of data queues on each channel and a control FIFO that holds the identifiers for the enabled program threads in the order in which their input data arrives. The control FIFO depth equals the number of threads of execution, since a thread execution stalls pending availability of the requested data.
Note that thread scheduling by means of a control FIFO does not explicitly prioritize the program threads. This is because, for safety reasons, the control FIFO serves program threads strictly in the order in which their identifiers are enqueued. In some systems we may want to invoke a program thread as soon as its needed data becomes available. Such systems would be better served by a preemptive scheduling algorithm based on relative priorities of the threads. However, pre emption comes at significant operating system overhead. In contrast, nonpreemptive prioritized scheduling of program threads is possible with relatively minor modifications to control FIFO. Example E describes the actual interconnection schematic between hardware and software for a single data queue.
We can implement the control FIFO and associated control logic either in hardware as a part of the ASIC compo-nent or in software. If we implement the control FIFO in software, the system no longer needs the FIFO control logic since the control flow is already in software. In this case, the q-rq lines from data queues connect to processor unvectored interruption lines, where the system uses respective interruption service routines to enqueue the thread identifier tags into the control FIFO. During the enqueue operations the system disables the interruptions to preserve integrity of the software control flow.
Example
As an experiment in achieving mixed system designs, we attempted synthesis of an Ethernet-based network coprocessor. The coprocessor is modeled as a set of 13 In case of multiple in-degree queues, the enqueue-rq is generated by OR-ing the requests of all inputs to the queues. In case of multiple-out-degree queues, the signal dequeue-rq is generated also by OR-ing all dequeue requests from the queue. an input bit-rate of 10 Mbytesk computing systems, it also affords an o p portunity in computer-aided design, by which we can automatically synthesize such systems from a unified specifica-SYNTHESIS or EMBEDDED REAL-TIME tion. Further, the ability to perform consystems from behavioral specifications I straint and performance analysis for constitutes a challenging problem in 1 such systems provides a major motivahardwaresoftware cosynthesis. Due to tion for using the synthesis approach the relative simplicity of the target archi-instead of design-oriented implementatecture compared to general-purpose tion approaches.
Even when manually designed, such systems can benefit greatly from prototypes created by a cosynthesis approach.
A cosynthesis approach lets us reduce the size of the chip-synthesis task, while meeting the performance constraints, such that we can use field-or maskprogrammable hardware to provide fast turnaround on complexsystem designs.
For hardwaresoftware synthesis to be effective, we need specification Ianguages that capture and use capabilities of both hardware and software. The a p proach presented in this article makes use of an HDL to formulate the problem of cosynthesis as an extension of hardware synthesis. In the process, the ap- 
V L S l A L G O R I T H M S A N D A R C H I T E C T U R E S Fundamentals edited b y N. Ranganathan
This first book introduces basic approaches to the design of VLSI algorithms and architectures and provides a reliable reference source for advanced readers. It addresses introductory and fundamental topics related to VLSI algorithms and architectures and provides a concise tutorial on the subject. The chapters in this volume: 
V L S l A L G O R I T H M S A N D A R C H I T E C T U R E S Advanced Concepts edited b y N. Ranganathan
This companion volume features an in-depth examination into the latest designs of VLSI algorithms and architectures for the engineering community. It contains many new studies and elaborates on various computationally intensive problems requiring VLSI solutions. It also addresses advanced techniques and VLSI architectures for a broad range of application areas.
The first chapter discusses important architectural design issues as well as the realization of these architectures as VLSl systems. It discusses design issues such as layout methodology, processor synchronization, area-time trade-offs, and performance. The next chapter focuses on advanced concepts for systolic arrays and algorithms for the automatic synthesis of systolic arrays. The subsequent chapters describe special-purpose architectures for a wide range of computationally intensive problems. They discuss special-purpose architectures; VLSI chips for problems in image and speech processing, Al. and vision applications; application issues for dictionary machines and data compression; and hardware architectures for iterative algorithms. 
