Hardware-Software Cosynthesis for Digital Systems by Gupta, Rajesh & De Micheli, Giovanni
Hardware-Software 
MOST DIGITAL SYSIEMS usedfor 
dedicated applications consist of gen- 
eral-purpose processors, memory, 
and applicationspecific hardware 
circuits. Examples of such embedded 
systems appear in medical instrumen- 
tation, process control, automated 
vehicles, and networking and com- 
munication systems. Besides being 
application specific, such system d e  
signs also respect constraints related 
to the relative timing of their actions. 
For that reason we call them real-time 
embedded systems. 
Design and analysis of real-time 
embedded systems pose challenges 
in performance estimation, selec- 
tion of appropriate parts for system 
implementation, and verification of 
such systems for functional and tem- 
poral properties. In practice, design- 
ers implement such systems from 
their specification as a set of loosely 
defined functionalities by taking a 
design-oriented approach. For in- 
stance, consider the design shown 
in Figure 1 (next page) of a network 
processor that is connected to a serial 
Cosynhesis b r  
~ - 
M E S H  K. GUPTA 
GlOVANNl DE MlCHELl 
Stanford University 
As system design grows 
increasingly complex, the use of 
predesigd components, such as 
general-purpose microprocessors, 
can simplify synthesized hardware. 
While the pr&s in designing 
systems that contain processors and 
application-specific integrated 
circuit chips are not new, computer- 
aided synthesis of such 
h-s or m ' d  systems 
demonstroie he feasibility of 
synthesizing b m s  systems 
by using timing constraints to 
delegaie tasks between hardware 
and sofhvare so that perfwmance 
requirwnextts can be met. 
(such as the protocol for Ethernet links). 
The decision to map functionalities 
into dedicated hardware or implement 
them as programs on a processor usual- 
ly depends on estimates of achiev- 
able performance and the imple- 
mentation cost of the respective 
parts. While this division impacts 
evely stage of the design, it is large 
ly based on the designer's experi- 
ence and takes place early in the 
design process. As a consequence, 
portions of a design often are either 
under- or over-designed with re- 
spect to their required perfor- 
mance. More important, due to the 
ad hoc nature of the overall design 
process, we have no guarantee that 
a given implementation meets re- 
quired system performance (ex- 
cept possibly by overdesigning). 
In contrast, we can formulate a 
methodical approach to system im- 
plementation as asynthesiwriented 
solution, a tactic that has met with 
enormous success in individual 
integrated circuit chip design (chip 
level synthesis). A synthesis a p  
proach for hardware proceeds with 
systems described at the behavioral 
level, by means of an appropriate 
SEPTEMBER 1993 0740-7475/93/0900-0029$03.00 Q 1993 IEEE 29 
Software 
Memory El- 
, 
Applications interface 
Packet formatting 
Self-test 
4 
Hardware 
Figure 1. A design-orientedapproach to system implementation. 
Behavioral 
specification 
1 Memory b~ 
Performance 
Prototyping High-level synthesis 
I / 
I / +Hardware 
/ 
+oftware 
I cost 
Figure 2. A synthesis-oriented approach to system implementation. 
Behavioral 
specification 
plus 
constraints 
i 3 interface Analog 
I "L 
Performance 1 i m p l R E 7 a t i o n  
t 
rdware 
lPware I \Constraints 
I 1 * cost 
Figure 3. Proposed approach to system implementation. 
Interface 
1 
scription languages (HDLsj to describe 
integrated circuits has been gaining 
wide acceptance in recent years. 
A synthesis-oriented approach to dig- 
ital circuit design starts with a behavior- 
al description of circuit functionality. 
From that, it attempts to generate a gate 
level implementation that can be 
characterized as a purely hardware im- 
plementation (Figure 2). Recent strides 
in high-level synthesis allow us tosynthe 
size digital circuits from high-level spec- 
ifications; several such systems are 
available from industry and academia. 
Gajski' and Camposano and Wolf2 
provide surveys of these. Synthesis pro- 
duces a gatelevel or geometric-level d e  
scription that is implemented as single 
or multiple chips. As the number of 
gates (or logic cells) increases, such a 
solution requires semicustom or custom 
design technologies, which then leads 
to associated increases in cost and d e  
sign turnaround time. For large system 
designs, synthesized hardware solutions 
consequently tend to be fairly expen- 
sive, depending upon the technology 
chosen to implement the chip. 
On the other end of the system devel- 
opment cost and performance spectrum, 
one can also create a software prototype, 
amenable to simulation, of a system us- 
ing a general-purpose programming lan- 
guage. (See Figure 2.) The Rapide 
prototyping system" is one example. D e  
signers can build such software proto- 
types rather quickly and often use them 
for verifying system functionality. Howev- 
er, software prototype performance very 
often falls short of what timeconstrained 
Practical experience tells us that cost- 
effective designs use a mixture of hard- 
ware and software to accomplish their 
overall goals (Figure 1 j. This providessuf- 
ficient motivation for attempting a synthe 
sis-oriented approach to achieve system 
implementations having both hardware 
and software components. Such an a p  
proach would benefit from a systematic 
analysis of design trade-offs that is com- 
' system designs require. 
A mixed implementation 
30 IEEE DESIGN & TEST OF COMPUTERS 
Processor 
process (a, b. c) 
in port a, b, 
read(a): 
write(c); 
I Out port c. .. 
t ' 
Specification 
Performance estimation Trade-offs 
Constraint analysis 
detach 
Interface 
Concurrency 
Synchronization 
Figure 4. Synthesis approach to embedded systems 
mon in synthesis while also creating cost- i cation. Chou, Ortega, and Borriello6 de- 
effective systems. scribe synthesis of hardware or software 
One way to accomplish this task is to 
specify constraints on cost and perfor- 
mance of the resulting implementation 
(Figure 3). We present an approach to 
systematic exploration of system designs 
that is driven by such constraints. Our 
work builds upon high-level synthesis 
techniques for digital hardware4 by ex- 
tending the concept of a resource need- 
ed for implementation. 
As shown in Figure 4, this approach 
captures a behavioral specification 
into a system model that is partitioned 
for implementation into hardware and 
software. We then synthesize the parti- 
tioned model into interacting hardware 
and software components for the target 
architecture shown in Figure 5. The tar- 
get architecture uses one processor 
that is embedded with an application- 
specific hardware component. The 
processor uses only one level of mem- 
ory and address space for its instruc- 
tions and data. Currently, to simplify 
the synthesis and performance estima- 
tion for the hardware component, we 
do not pipeline the applicationspecific 
hardware. Even with its relative simplic- 
ity, the target architecture can apply to 
a wide class of applications in embed- 
ded systems. 
Among the related work, Woo, Wolf, 
and Dunlop5 investigate implementing 
hardware or software from a cospecifi- 
for interface circuits. Chiodo et al.7 dis- 
cuss a methodology for generating hard- 
ware and software based on a unified 
finite-statemachinebased model. Given 
a system specification as a C-program, 
Henkel and Ern& identify portions of 
the program that can be implemented 
into hardware to achieve a speedup of 
overall execution times. Srivastava and 
Broderseng and Buck et present 
frameworks for generating hardware 
and software components of a system. 
Investigators have proposed several new 
architectures that use field-programma- 
ble gate arrays to create special-purpose 
coprocessors to speed up applications 
(PAM", MoMI2) or to create prototypes 
(Q~ickTurn'~). 
Capiurin specification of system 
functioncIity and constraints 
We capture system functionality us- 
ing a hardware description language, 
Haudwa~eC.'~ The cosynthesis approach 
formulated here does not depend upon 
the particular choice of the HDL, and 
could use other HDLs such as VHDL or 
Verilog. However, the use of HardwareC 
leverages the use of Olympus tools de- 
veloped for chiplevel synthe~is.~ 
HardwareC follows much of the syntax 
and semantics of the programming lan- 
guage, with modifications necessary for 
correct and unambiguous hardware 
processor 
Figure 5. Target architecture. 
modeling. HardwareC description con- 
sists of a set of interacting processes that 
are instantiated into blocks using a d e  
clarative semantics. A process model ex- 
ecutes concurrently with other processes 
in the system specification. A process r e  
starts itself on completion. Operations 
within a process body allow for nested 
concurrent and sequential operations. 
Figure 6 shows an example of an HDL 
functionality specification. This exam- 
ple performs two data input operations, 
followed by a conditional in which a 
counter index is generated. The specifi- 
cation uses counter index z to seed a 
downcounter indicated by the while 
loop. A graph-based representation as 
shown captures this HDL specification. 
In general, the system model consists 
of a set of hierarchically related se- 
quencing graphs. Within a graph, verti- 
ces represent languagelevel operations 
and edges represent dependencies be- 
SEPTEMBER 1993 31 
. .  
process counter(a,b,c) 
in  port a[8]; 
in channel b[8]; 
out port c[8]; 
boolean x[8], y[8],  z[8]; False True 
{ 
GO 
Operation delay in cycles GlOop 
Figure 6. Example of input specification and capture. 
tween the operations. Such a represen- 
tation makes explicit the concurrency 
inherent in the input specification, thus 
making it easier to reason about proper- 
ties of the input description. As we shall 
soon see, it also allows us to analyze tim- 
ing properties of the input description. 
Model properties. The sequencing 
graph is a polar one with source and 
sink vertices that represent no-opera- 
tions. Associated with each graph mod- 
el is a set of variables that defines the 
shared memory between operations in 
the graph model. Source and sink verti- 
ces synchronize executions of opera- 
tions in a graph model across multiple 
iterations. Thus, polarity of the graph 
model ensures that there is exactly one 
execution of an operation with respect 
to each execution of any other opera- 
tion. This makes execution of opera- 
tions within a graph single rate (Figure 
7). The set of variables associated with a 
graph model defines the storage common 
to the operations; it sewes to facilitate com- 
munication between operations. 
Given the singlerate execution model, 
i t  is relatively straightforward to ensure 
ordering of operations in a graph model 
that preserves integrity of memory shared 
between operations. However, opera- 
tions across graph models follow multi- 
rate execution semantics. That is, there 
may be variable numbers of executions 
of an operation for an operation in anoth- 
er graph model. Because of this multirate 
nature of execution, the operations use 
messagepassing primitives like send and 
receive to implement communications 
across graph models. Use of these primi- 
tives simplifies specification of inter. 
model communications. A multirate 
specification is an important feature for 
modeling heterogeneous systems, b e  
cause the processor and applicationspe 
cific hardware may run on different 
clocks and speeds. 
HDL descriptions contain operations to 
represent synchronization to external 
events, such as the receive operation, as 
well as datadependent loop operations. 
These operations, called nondeterministic 
delay (ND) operations, present unknown 
execution delays. The ability to model ND 
operations is vital for reactive embedded 
system descriptions. Figure 6 indicates ND 
operations with double circles. 
Graph model 
Single 
Graph model 
Messages 
Single 
Mult i  rate 
~ Figure 7. Properties of the graph model. 
A system model may have many pos- 
sible implementations. Timing con- 
straints are important in defining specific 
performance requirements of the desired 
implementation. As shown in Figure 8, 
timing constraints are of two types: 
Min/max delay constraints: These 
provide bounds on the time interval 
between initiation of execution of 
two operations. 
Execution rate constraints: These 
provide bounds on successive initi- 
ations of the same operation. Rate 
constraints on input/output opera- 
tions are equivalent to constraints 
on throughput of respective inputs/ 
outputs. 
These two types of constraints are suf- 
ficient to capture constraints needed by 
most real-time  system^.'^ Our synthesis 
system captures minimum delay con- 
straints in the graphical representation 
by providing weights on the edges to in- 
dicate delay of the corresponding 
source operation. Capturing maximum 
delay constraints requires additional 
backward edges (Figure 9). 
Model analysis. Having captured 
system functionality and constraints in a 
graphical model, we can now estimate 
system performance and verify the con- 
sistency of specified constraints. Perfor- 
mance measures require estimation of 
operation delays. We compute these 
delays separately for hardware and soft- 
ware implementations based on the 
32 IEEE DESIGN & TEST OF COMPUTERS 
type of hardware to be used and the pro- 
cessor used to run the software. A pro- 
cessor cost model captures processor 
characteristics. It consists of an execu- 
tion delay function for a basic set of pro- 
cessor operations, a memory address 
calculation function, a memory access 
time, and processor interruption re- 
sponse time. 
Timing constraint analysis attempts to 
answer the following question: Can im- 
posed constraints be satisfied for a given 
implementation? We indicate an imple- 
mentation of a model by assigning ap- 
propriate delays to the operations 
with known delays (not ND) in the 
graph model. Constraint satisfiability r e  
lates to the structure as well as the actu- 
al delay and constraint values on the 
graph. Some structural properties of the 
graphs (relating to ND operations and 
their dependencies) may make a con- 
straint unsatisfiable regardless of the ac- 
tual delay values of the operations. 
Further, some constraints may be mutu- 
ally inconsistent: for example, a maxi- 
mum delay constraint between two 
operations that also have a larger mini- 
mum delay constraint. No assignment of 
nonnegative operation delay values can 
satisfy such constraints. 
In the presence of ND operations in a 
graph model, we consider a timing con- 
straint satisfiable if it issatisfied for all pos- 
sible (and maybe infinite) delay values of 
the ND operations. We consider a timing 
constraint marginally satisfiable if it can 
be satisfied for all possible values within 
specified bounds on the delay of the ND 
operations. Marginal satisfiability analysis 
is useful because it allows the use of tim- 
ing constraints that can be satisfied under 
some implementation assumptions (ac- 
ceptable bounds on ND operation de- 
lays). Without these assumptions the 
general timing constraint satisfiability 
analysis would otherwise consider these 
constraints ill-posed.I6 
We perform timing constraint analysis 
by graph analysis on the weighted s e  
quencing graphs. Consider first the case 
' where the graph model does not contain 
any ND operations. Here, we can label 
, every edge in the graph with a finite and 
known weight. In such a graph, we can- 
not satisfy a min/max delay constraint if a 
positive cycle in the graph model exists.I6 
Next, in the presence of ND operations, 
timing constraints are satisfiable if no cy- 
cles containing ND operations exist. For a 
cycle containing an ND operation, it is 
impossible to determine satisfiability of 
~ timing constraints, and only marginal sat- 
isfiability can be guaranteed. As we will 
see, it is possible to break the cycle by 
graph transformations that preserve the 
HDL program semantics. 
For nonpipelined implementations, we 
can treat rate constraints as min/max d e  
lay constraints between corresponding 
~ source and sink operations of the graph 
1 model. Thus we can apply the above min/ 
max constraint satisfiability criterion to the I analysis of rate constraints. 
Note that in some cases system 
i throughput (specified by rate con- 
1 straints) can be optimized significantly 
with little or no impact on system laten- 
' cy by using a pipelined execution mod- 
~ el and extra resources. Indeed, for 
deterministic and fixed-rate systems par- 
ticularly used for digital signal process- i ing applications, researchers have 
developed extensive transformations 
that determine and achieve bounds on 
system throughput.I7 However, as noted 
, 
Time 
t 
L
c: m 
a: I Minimax .I I. Rate .I 
Figure 8. Timing constraints. 
Figure 9. Representation of timing 
constraints: min/max constraint {a), rate 
constraint {b). 
bounded loop. The ND operation induc- 
es a bipartition of the calling process, P 
= Fu B, such that the set of operations in 
F (for example, the read operation in 
process test) must be performed before 
invoking the loop body. Further, the set 
of operations in B can only be per- 
formed after completing executions of 
earlier, systems modeled by the se- i the loop body. We can then use func- 
quencing graphs generally operate at ~ tional pipelining of F, B, and the loop to 
different rates. In addition, because of ' improve the reaction rate of P. Since we 
the presence of ND operations due to ~ assume nonpipelined hardware, these 
loops, the rate at which a particular o g  transformations are used only in the 
eration executes may change over time. context of the software component. 
While this property is essential for model- 
ing controldominated embedded sys- Constraint analysis and software. 
terns, it aggravates the problem of , The linear execution semantics imposed 
determining absolute bounds on achiev- by the software running on a single-pro- 
able system throughput. cessor target architecture complicates 
We illustrate the issue of rate con- constraint analysis for a software imple 
straints on graphs containing ND opera- ~ mentation of a graph model. That is, per- 
tions in Example A (next page). forming delay analysis for software 
In general, consider a process P that 1 operations requires a complete order of 
contains an ND operation due to an un- I operations in the graph model. In creat- 
~ 
SEPTEMBER 1993 33 
, 
~ ExampIeA 
~ Consider the following process 
fragment 
process test (p, . . .) 
I 
in port p [SIZE]; 
... 
v = read p ; 
while (v >= 0)  
{ 
I 
< loop-body > 
v = v - 1  ; 
1 
Here, vis a Boolean array that rep- 
resents an integer. In the presence of 
rate constraint ron the readoperation, 
the constraint graph has a cycle con- 
taining an ND operation relating to the 
unbounded while loop operation. Note 
that the rate constraint corresponds to 
directed edge from sink tto source s in 
the graph of Figure A. 
The overall execution time of the while 
loop determines the interval between 
/ 
~~ 
successive executions of the read oper- 
ation. Due to this variable-delay loop op- 
eration, the input rate at port p is variable 
and cannot always be guaranteed to 
meet the required rate constraint. In gen- 
eral, determining achievable throughput 
at port p is difficult. As we explain next, 
marginal satisfiability of the rate con- 
straint can be ensured by graph transbr- 
motions and by using a finite-size buffer. 
Figure A shows the sequencing 
graph model P corresponding to pro- 
cess test. Identifier rd refers to the read 
operation, /p refers to the while loop 
operation. Symbols PI, P2, and so forth 
in the execution trace below indicate 
successive invocations of the process 
test. L I, 12, 13, and 14 indicate multi- 
ple invocations of the /p operation. De- 
pending on the side effects produced 
by the loop-body, the original graph P 
can be transformed into fragments Q 
and R such that executions of Q and R 
can overlap to improve the throughput 
of the read operation in Q. Data trans- 
fers from Q to R by means of a buffer. 
See Example B on page 37 for a con- 
sideration of a software implementa- 
tion of P. 
/-? m faR 
AI /\ 
5 q q q q i T  ml-'m 
Figure A. Breaking ND cycle by graph transformation 
ing a complete order of operations, it is 
likely that unbounded cycles may be cre 
ated, which would make constraints 
unsatisfiable. 
Asshown in Figure 10, any serialization 
that puts an ND operation between two 
operations opl and op2 will make any 
maximum delay constraint between opl 
Partially ordered 
constraint graph 
Completely ordered 
constraint graph 
OP1 74 
- wait A I  a -U 
op2 0' n2 
Figure IO. Linearization in sobare leads 
to creation ofunsotisfiable timing con- 
straints. Constraint maxtime from op I to 
op2 = U cycles. 
and op2 unsatisfiable. However, note that 
while all computations must be per- 
formed serially in software, communica- 
tion operations can proceed concurrently. 
In other words, it is possible to overlap ex- 
ecution of ND operations (wait for syn- 
chronization or communication) with 
some (unrelated) computation. But such 
an overlap requires the ability to schedule 
operations dynamically in software since 
the simultaneously active ND operations 
may complete in orders that cannot be 
determined statically. 
Typically, dynamic scheduling of o p  
erations involves delay overheads due 
to selection and scheduling of opera- 
tions. Therefore, a good model of soft- 
ware is to think of software as a set of 
fixed-latency concurrent threads (Figure 
11). We define a thread as a linearized 
set of operations that may or may not 
begin by an ND operation indicated by 
a circle in Figure 1 1. Other than the be- 
ginning ND operation, a thread does not 
contain any ND operations. We consid- 
er the delay of the initial ND operation 
part of the scheduling delay and, there- 
fore, not included in the latency of the 
program thread. Use of multiple concur- 
rent program threads instead of a single 
program to implement the software also 
avoids the need for complete serializa- 
tion of all operations that may create un- 
34 
1- 
IEEE DESIGN & TEST OF COMPUTERS 
I 
Figure 1 I .  S o b a r e  model to avoid cre- 
ation of ND cycles. 
bounded cycles. 
In this software model, we can check 
marginal satisfiability of constraints on 
operations belonging to different threads, 
assuming a fixed and known delay of 
scheduling operations associated with 
ND operations (context switch delay, for 
example). 
System partitioning 
The system-level partitioning problem 
refers to the assignment of operations to 
hardware or software. The assignment 
of an operation to hardware or software 
determines the delay of the operation. In 
addition, assignment of operations to a 
processor and to one or more applica- 
tion-specific hardware circuits involves 
additional delays due to communica- 
tion overheads. 
Any good partitioning scheme must at- 
tempt to minimize this communication. 
Further, as operations in software are im- 
plemented on a single processor, incre* 
ing operations in software increases 
processor utilization. Consequently, over- 
all system performance depends on the ef- 
fect of hardwaresoftware partition on 
utilization of the processor and the band- 
width of the bus between the processor 
and applicationspecific hardware. 
A partitioning scheme thus must at- 
tempt to capture and make use of a par- 
tition’s effect on system performance in 
making trade-offs between hardware 
and software implementations of an op- 
eration. An efficient way to do this 
would be to devise a partition cost func- 
c 
0 c
c 5 Statistical 
x 
0 = Deterministic 
bounds 
I 
c 
r: - 
c 
a 
D 
I 
Hardware- ~ 
software i 
i 
i= 
1 Static Partial Dynamic 
Scheduling flexibility 
Figure 12. Use of timing properties in partition cost function 
tion that captures these properties. We 
would then use this function to direct 
the partitioning algorithm toward a de- 
sired solution, where an optimum solu- 
tion is defined by the minimum value of 
the partition cost function. 
Note that we need to capture not only 
the effects of sizes of hardware and soft- 
ware parts but also the effect on timing 
behavior of these portions in our parti- 
tion cost function. In contrast, most par- 
titioning schemes for hardware have 
focused on optimizing area and pinout 
of resulting circuits. Capturing the effect 
of a partition on timing performance 
during the partitioning stage is difficult. 
Part of the problem arises because the 
timing properties are usually global in 
nature, thus making it difficult to make 
incremental computations of the parti- 
tion cost function as is essential for 
developing effective partition algo- 
rithms. Approximation techniques have 
been suggested to take into account the 
effect of a partition on overall latency.I8 
Note, however, that partitioning in the 
software world does make extensive use 
of statistical timing properties to drive the 
partitioning algorithms.I9 We draw the 
distinction between these two extremes 
of hardware and software partitioning by 
the flexibility to schedule operations. 
Hardware partitioning attempts to divide 
circuits that implement scheduled opera- 
tions. Conversely, the program-level parti- 
tioning addresses operations that are 
scheduled at runtime. 
Our approach to partitioning for hard- 
ware and software takes an intermediate 
approach. Asshown in Figure 12, we use 
deterministic bounds on timing proper- 
ties that are incrementally computable 
in the partition cost function. That is, we 
can compute the new partition cost 
function in constant time. We accom- 
plish this by using a software model in 
terms of a set of program threads as 
shown in Figure 11 and a partition cost 
function, f ,  that is a linear combination 
of its variables. The following properties 
characterize this software component: 
w Thread latency A, (seconds) indi- 
cates the execution delay of a prc- 
gram thread. 
w Thread reaction rate p, (per sec- 
ond) is the invocation rate of the 
program thread. 
Processor utilization Pis calculat- 
ed by 
P - C A ,  .p ,  
/ = I  
w Bus utilization B (per second) is 
the total amount of communication 
taking place between the hardware 
and software. For a set of m vari- 
SEPTEMBER 1993 35 
C O S Y N T H E S I S  
_. 
ables to be transferred between 
hardware and software, 
rn 
B = C Y ,  
J=l 
yi is the inverse of the minimum 
time interval (in seconds) between 
two consecutive samples for vari- 
ablej, which is marked for destina- 
w Timing constraints are satisfied for 
w Processor utilization, P I  1. 
Bus utilization, B I B. 
w A partition cost function, f =  f ( S ,  
the two sets of graph models. 
B, PI, m) is minimized. 
An exact solution to the constrained 
partitioning problem-a solution that 
minimizes the partition cost function- 
tion rate ?, of a program thread is com- 
puted as the inverse of its latency. The 
latency of a program thread is computed 
using a processor delay cost model and 
includes a fixed scheduling overhead 
delay. 
From an initial solution we perform 
iterative improvement by migrating o p  
erations between the partitions. Migra- 
tion of an operation across a partition 
tion to one of the program threads. requires that we examine a large number 
of solutions. Typically, that number is ex- 
affects its execution delay. It also affects 
the latency and reaction rate of the 
Characterization of software using h, 
p, P, and B parameters makes it possible 
to calculate static bounds on software 
performance. Use of these bounds is 
helpful in selecting an appropriate parti- 
tion of system functionality between 
hardware and software. However, it also 
has the disadvantage of overestimating 
performance parameters such as pro- 
cessor and bus bandwidth utilization. 
Typically, there is a distribution of 
thread invocations and communica- 
tions based on actual data values being 
transferred, which is not accounted for 
in these parameters. 
We compute hardware size S, bot- 
tom-up from the size estimates of the re- 
sources implementing the individual 
operations. In addition, we characterize 
the interface between hardware and 
ponential to the number of operations 
under partition. As a result, designers of- 
ten use heuristics to find a “good” solu- 
tion, with the objective of finding an 
optimal value of the cost function that is 
minimal for some local properties. 
Most common heuristics to solving 
partitioning problems start with a con- 
structive initial solution that some itera- 
tive procedure can then improve. 
Iterative improvement can follow, for ex- 
ample, from moving or exchanging oper- 
ations and paths between partitions. A 
good heuristic is also relatively insensitive 
to the initial solution. Typically, exchange 
of a larger number of operations makes 
the heuristic more insensitive to the start- 
ing solution, at the cost of increasing the 
time complexity. 
In the following, we describe the intu- 
software by a set of communication itive features of the partitioning algo- 
ports (one for each variable) between rithm. We have presented details 
hardware and software that communi- elsewhere.20 The procedure identifies 
cate data over a common bus. The over- operations that can be implemented in 
head due to communication between software such that the corresponding 
hardware and software is manifested by constraint qraph implementation can be 
thread to which this operation is moved. 
We similarly compute its effect on pro- 
cessor and bus bandwidth utilization. At 
any step, we select operations for migra- 
tion so that the move lowers the com- 
munication cost, while maintaining 
timing constraint satisfiability. In addi- 
tion, we check for communication feasi- 
bility by verifying that pi 2 pi for each 
thread, and that processor and bus utili- 
zation constraints are satisfied. 
System synthesis 
From partitioned graph models, our 
next problem is to synthesize individual 
hardware and software components. 
KuI4 and address in detail the 
generation of hardware circuits for se- 
quencing graph models. Therefore, we 
concentrate on generation of software 
and interface circuity from partitioned 
models. The problem of software synthe 
sis is to generate a program from parti- 
tioned graph models that correctly 
implements the original system function- 
ality. We assume that the resulting pre  
the utilization of bus bandwidth as de- 
scribed earlier. 
satisfied a id  the resultingsoftware (as a 
set of program threads) meets required 
gram is mapped to real memory, so the 
issues related to memory management 
Given the cost model for software. rate constraints on its inputs and out- 
hardware, and interface, we can infor- 
mally state the problem of partitioning a 
specification for implementation into 
hardware and software as follows: 
From a given set of sequencing graph 
models and timing constraints between 
operations, create two sets of sequenc- 
ing graph models such that one can be 
implemented in hardware and the other 
in software and the following is true: 
puts. As an initial partition we assume 
that ND operations related to data- 
dependent loop operations define the 
beginning of program threads in soft- 
ware, while all other operations are im- 
plemented in hardware. The rate 
constraints on software inputdoutputs 
translate into bounds on required reac- 
tion rate p, of corresponding program 
thread q. Maximum achievable reac- 
are not relevant to this problem. The par- 
titioning discussed previously identified 
graph models that are to be implemented 
in hardware and operations (organized 
as program threads) that are to be imple- 
mented in software. See Example B. 
The program generation from a 
thread can either use a coroutine or sub- 
routine scheme. Since, in general, there 
can be dependencies into and from the 
program threads, a coroutine model is 
36 
1- 
IEEE DESIGN & TEST OF COMPUTERS 
Example B 
the variables common to SI and S2. In 
such cases we can use data buffers be- 
We can implement the process test 
shown in Example A as following 
two program threads in software. 
Thread T1 Thread T2 
read v loop-synch 
detach <loop-bodp 
v= v- 1 
detach 
In its software implementation of 
process test, thread T1 performs the 
reading operations, and thread T2 
consists of operations in the body of 
the loop. For each execution of 
thread T1 there are v executions of 
thread T2. 
more appropriate. A dependency be- 
tween two operations can be either a 
data or a control dependency. Depend- 
ing upon predecessor relationships and 
timing of the operations, we can make 
some of these redundant by inserting 
other dependencies such that resulting 
program threads are convex-all exter- 
nal dependencies are limited to the first 
and last operations. 
For a given subgraph corresponding to 
a program thread, we can move an in- 
coming data dependency up to its first 
operation and move an outgoing data 
dependency down to its last operation. 
This procedure produces a potential loss 
of concurrency. However, it makes the 
task of routine implementation easier 
since we can implement all the routines as 
independent programs with statically em- 
bedded control dependencies. 
Rate constraints and software. In 
the presence of dependencies on ND 
operations, we cannot always guarantee 
that a given software implementation 
will meet the data rate constraints on its 
Example C 
Consider the threads T I  and T2 gen- 
erated from process test mentioned in 
Example A. The overall execution time 
of the while loop determines the interval 
between successive executions of the 
read operation. Due to this variable-de- 
lay loop operation, the input rate at port 
pis variable x) we cannot always guar- 
antee the reaction rate of T1. Since the 
set of operations in loop-body may al- 
ter the contents of memory in process 
test, thread T1 must be blocked until the 
completion of T2. Thus the process test 
can be thought of as consisting of two 
parallel processes, as shown in Figure B. 
We need the first operation of thread T2, 
wait1 , to observe the data dependency 
of operations in thread T2. We need the 
second wait operation, wait2, to guar- 
antee that any memory side effects of T2 
for variables in T1 are correctly reflect- 
ed. To obtain a deterministic bound on 
the reaction rate of the calling thread, it 
is possible to unroll the looping thread 
by creating a variable number of pro- 
gram threads. However, in this case 
each iteration of the looping thread 
would carry scheduling overhead. Dy- 
namic creation of program threads may 
also lead to violation of processor utili- 
zation constraint as described in previ- 
ous sections. 
However, it is possible to overlap ex- 
ecution of loop thread T2 with execu- 
tion of thread T I ,  and to ensure 
marginal timing constraint satisfiabili- 
Iy. Note that we can remove operation 
~ 
I/O ports. In case of synchronization- 
related ND operations, we can check for 
marginal satisfiability of timing con- 
straints by assigning a context-switch d e  
lay to the respective wait operations. 
However, in the case of unbounded 
loop-related ND operations, the delay 
due to these operations consists of ac- 
Reference 
1 R K  Gupta, C Coelho, and G De 
Micheli, “Program Implementation 
Schemes for Hardwaresoftware Sys- 
terns, ’ Notes of Int’l Workshop Hard- 
ware-Software Codesign, Oct 1992, 
and CSLTech Repott TR-92-548, Stan- 
ford University, Stanford, Calif, 1992 
rive computation time. Marginal timing 
satisfiability analysis therefore requires 
that we estimate loop index values. We 
illustrate this in Example C. 
Hardware-software interface. Be- 
cause of the serial execution of the soft- 
ware component, a data transfer from 
SEPTEMBER 1993 37 
~ Example D 
Consider the mixed implementation of a graphics controller that contains two 
threads for generation of line and circle coordinates in software as shown in Fig- 
ure C. The interface protocol using control FIFO is specified as follows: 
queue [2] controlFIFO [ l ] ;  
queue [ 1 6 1  line-queue [ 1 1, circle-queue [ 1 1; 
when ((line-queue.dequeue-rq+ & !line-queue.empty) & !controlFIFO.full) do 
controlFIFO enqueue #2; 
when ((circle-queue.dequeue-rq+ & !circle-dequeue.empiy) & !controlFIFO.full) 
do controlFIFO enqueue # l ;  
when (controlFIFO.dequeue-rq +&!controlFIFO.ernpty)do controlFIFO de- 
queue dlx.OxffOOO[ l :O]; 
In this example, two data queues with 1 6 bits of width and 1 bit of depth, line-queue 
and circle-queue, and one queue with 2 bits of width and 1 bit of depth, con- 
frolFIF0, are declared. The guarded commands specify the conditions on which 
the number 1 or the number 2 is enqueued-here, a '+' after a signal name means 
a positive edge and a '-' after the signal means a negative edge. The first when 
condition states that when a dequeue request for the queue line-queue arrives 
and this queue is not empty and the queue controlFlF0 is not full, then enqueue 
the value 2 (representing identifier for a corresponding program thread that con- 
sumes data from the line queue) into the confrolFIF0. 
ASIC 
hardware 
Processor 
Circle data queue 
Circle 
Control FIFO 
Figure C. Mixed implementation. 
hardware to software must be explicitly 
synchronized. By using a polling strategy, 
we can design the software component to 
perform premeditated transfers from the 
hardware components based on its data 
requirements. This requires static sched- 
uling of the hardware component. 
Where software functionality is limited 
by communications-that is, where the 
processor is busy waiting for an input- 
output operation most of the time- 
such a scheme would suffice. Further, in 
the absence of any unbounded-delay 
operations, we can simplify the software 
component in this scheme to a single 
program thread and a single data chan- 
nel since all data transfers are serialized. 
However, this approach would not sup- 
port any branching nor any reordering 
of data arrivals, since the design would 
not support dynamic scheduling of o p  
erations in hardware. 
To accommodate differing rates of 
execution among the hardware and 
software components, and due to un- 
bounded delay operations, we look for 
a dynamic scheduling of different 
threads of execution. Availability of data 
forms the basis for such a scheduling. 
One mechanism to perform such sched- 
uling is a control FIFO (first in, first out) 
buffer, which attempts to enforce the 
policy that data items are consumed in 
the order in which they are produced. 
As shown in Example D, the hardware- 
software interface consists of data 
queues on each channel and a control 
FIFO that holds the identifiers for the en- 
abled program threads in the order in 
which their input data arrives. The con- 
trol FIFO depth equals the number of 
threads of execution, since a thread ex- 
ecution stalls pending availability of the 
requested data. 
Note that thread scheduling by means 
of a control FIFO does not explicitly prior- 
itize the program threads. This is because, 
for safety reasons, the control FIFO serves 
program threads strictly in the order in 
which their identifiers are enqueued. In 
some systems we may want to invoke a 
program thread as soon as its needed 
data becomes available. Such systems 
would be better served by a preemptive 
scheduling algorithm based on relative 
priorities of the threads. However, pre 
emption comes at significant operating 
system overhead. In contrast, nonpre- 
emptive prioritized scheduling of pro- 
gram threads is possible with relatively 
minor modifications to control FIFO. 
Example E describes the actual intercon- 
nection schematic between hardware 
and software for a single data queue. 
We can implement the control FIFO 
and associated control logic either in 
hardware as a part of the ASIC compo- 
38 IEEE DESIGN & TEST OF COMPUTERS 
nent or in software. I f  we implement the 
control FIFO in software, the system no 
longer needs the FIFO control logic 
since the control flow is already in soft- 
ware. In this case, the q-rq lines from 
data queues connect to processor un- 
vectored interruption lines, where the 
system uses respective interruption ser- 
vice routines to enqueue the thread 
identifier tags into the control FIFO. Dur- 
ing the enqueue operations the system 
disables the interruptions to preserve in- 
tegrity of the software control flow. 
Example 
As an experiment in achieving mixed 
system designs, we attempted synthesis 
of an Ethernet-based network coproces- 
sor. The coprocessor is modeled as a set 
of 13 concurrently executing processes 
that interact with each other by means 
of 24 send and 40 receive operations. 
The total description consists of 1,036 
lines of HDL code. A hardware-software 
implementation of the coprocessor 
takes 8,572 bytes of program and data 
storage for a DLX processorz1 and 8,394 
equivalent gates using an LSI Logic 10K 
library of gates. 
We can thus build the mixed imple- 
mentation using only one ASIC chip 
plus an off-the-shelf processor. A com- 
plete hardware implementation would 
require use of a custom chip or two ASIC 
chips. More importantly, we can guaran- 
tee that the mixed solution using a DLX 
processor running at 10 MHz will meet 
the imposed performance requirements 
of a maximum propagation delay of 46.4 
ps, a maximum jam time of 4.8 ps, a min- 
imum interframe spacing of 67.2 p, and 
Example E 
Figure D shows schematic connection of the FIFO control signals for a single 
data queue. In this example, the data queue is memory mapped at address 
OxeeOOO while the data queue request signal is identified by bit O of address 
OxeeOO4 and enable from the microprocessor (up-en) is generated from bit 0 of 
address OxeeOO8. The following describes the FIFO and microprocessor con- 
nections. cntc refers to a data queue associated with the circle drawing program 
threads. rnp refers to a model of the microprocessor. A signal name is prefixed 
with a period to indicate the associated hardware or software model. 
cntc.rq-line [O:O] = @ mp.Oxee004[0:0]; 
cn tc. en-1 i ne [ O:O] = mp.Oxee008[ O:O]; 
cntc.ab-line [O:O] = mp.OxeeOOO-rd; 
# request 
# enable up en 
# absorb up ack 
The control logic needed to generate the enqueue is  described by a simple 
state transition diagram shown in Figure E. The control FIFO is  ready to enqueue 
(indicated by gn = 1 ) process id if the corresponding data request (q-rq) is high 
and the process has enabled the thread for execution (up-en). Signal up-ab in- 
dicates completion of a control FIFO read operation by the processor. 
In case of multiple in-degree queues, the enqueue-rq is generated by OR-ing 
the requests of all inputs to the queues. In case of multiple-out-degree queues, 
the signal dequeue-rq i s  generated also by OR-ing all dequeue requests from 
the queue. 
gn=O 
\ 
p-en & q-rq 
gn= l  
Figure E. FIFO controlstate transition 
diagram. figure D. Control FIFO schematic. 
~~~ ~ _ _ _ _  ~- ~~ ~ ~~~ 
an input bit-rate of 10 Mbytesk computing systems, it also affords an o p  
portunity in computer-aided design, by 
which we can automatically synthesize 
such systems from a unified specifica- 
SYNTHESIS or EMBEDDED REAL-TIME tion. Further, the ability to perform con- 
systems from behavioral specifications I straint and performance analysis for 
constitutes a challenging problem in 1 such systems provides a major motiva- 
hardwaresoftware cosynthesis. Due to tion for using the synthesis approach 
the relative simplicity of the target archi- instead of design-oriented implementa- 
tecture compared to general-purpose tion approaches. 
Even when manually designed, such 
systems can benefit greatly from proto- 
types created by a cosynthesis approach. 
A cosynthesis approach lets us reduce 
the size of the chip-synthesis task, while 
meeting the performance constraints, 
such that we can use field- or mask- 
programmable hardware to provide fast 
turnaround on complexsystem designs. 
For hardwaresoftware synthesis to be 
SEPTEMBER 1993 39 
. ---l 
effective, we need specification Ian- 
guages that capture and use capabilities 
of both hardware and software. The a p  
proach presented in this article makes 
use of an HDL to formulate the problem 
of cosynthesis as an extension of hard- 
ware synthesis. In the process, the ap- 
6. P. Chou, R. Ottega, and G. Borriello, “Syn- 
thesis of the HardwarelSoftware Inter- 
face in Microcontroller-Based Systems,’’ 
hoc. Int’l Conf Computer-Aided Design, 
IEEE Computer Society Press, Los Alam- 
itos, Calif., 1992, pp. 4W95 .  
7. M. Chiodo et al., “Synthesis of Mixed 
proach makes many simplifications for 
the generated software and leaves room 
for considerable optimization of the soft- 
ware component. 
Currently, we are attempting to develop 
transformations to simplify control flow in 
the sequencing graph models, which we 
can use to minimize interface synchroni- 
zation requirements. We also plan to i n v e  
tigate extensions to the target architecture 
to include hierarchical memory schemes 
and multiple processors. #@b 
Acknowledgments 
We acknowledge discussions and contri- 
butions by Claudionor Coelho and David 
Ku. This research was sponsored by NSF- 
ARPA, undergrant MIP9115432, and byafel- 
lowship provided by Philips at the Stanford 
Center for Integrated Systems 
References 
1. Silicon Compilation, D. Gajski, ed., Addi- 
son Wesley, Reading, Mass., 1988. 
2. High-Leuel VUISynthesis, R. Camposano 
and W. Wolf, eds., Kluwer Academic 
Publishers, Norwell, Mass., 1991. 
3. D.C. Luckham, “Partial Ordering of Event 
Sets and Their Application to Prototyping 
Concurrent Timed Systems,” J.  Systems 
andsoftware, July 1993. 
4. G. De Micheli et al., “The Olympus Syn- 
thesis System for Digital Design,” IEEE De- 
Hardwaresoftware Implementations 
from CFSM Specifications,” Memo UCBl 
ERL M93/49, June 1993, Univ. of Califor- 
nia at Berkeley, and Notes of Int’l Work- 
shop on Hardware-Software Codesign, 
Oct. 1992. 
8. J. Henkel and R. Emst, “Ein Softwareori- 
entierter Ansatz zum Hardwaresoftware 
CoEntwurf” [A Software-oriented Ap- 
proach to Hardwaresoftware Codesign], 
Roc. 17% Conf , Recnergestuetzter En- 
twurf und Architektur mikroelektroninish- 
er Systeme, Darmstadt, Germany, 1992, 
9. M.B. Srivastava and R.W. Brodersen, 
“Rapid-Prototyping of Hardware and 
Software in a Unified Framework,” Roc. 
Int’l Con[ Computer-Aided Design, IEEE 
CSPress, 1991, pp. 152.155. 
10. J. Buck et al., “Ptolemy: A Framework for 
Simulating and Prototyping Heteroge- 
pp. 267-268. 
neous Systems,” to be published in Int’l 
J.  Computer Simulations. 
1 1.  P. Bertin, D. Roncin, and J. Vuillemin, “ln- 
troduction to Programmable Active 
Memories,” in Systolic Away pfocessors, ~ 
J. McCanny, J. McWhirter, and E. Swartz- ~ 
lander, Eds., Prentice Hall, New York, 
1989, pp. 30@309. 
12. R.W. Hattenstein, A.G. Hirschbiel, and M. 
Weber, “Mapping Systolic Arrays Onto 
the Maporiented Machine,” in Systolic 
Away Processors, J. McCanny, J. McWhirt- 
er, and E. Swattzlander, eds., Prentice 
Hall, New York, 1989, pp. 300-309. I 
13. S. Walters, “Reprogrammable Hardware ~ 
Emulation Automates System-Level ASIC ~ 
~ 
15. B. Dasarathy, “Timing Constraints of Real- 
Time Systems: Constructs for Expressing 
Them, Method for Validating Them,” 
IEEE Trans. Software Engineering, Vol. 
SE-1 1, No. 6, Jan. 1985, pp. 80-86. 
16. D. Ku and G.De Micheli, “Relative Sched- 
uling Under Timing Constraints: Algc- 
rithms for High-level Synthesis of Digital 
Circuits,” IEEE Trans. CADIICAS, Vol. 11 ., 
No. 6, June 1992, pp. 696718. 
17. K.K. Parhi, “Algorithm Transform for Con- 
current Processors,” Proc. IEEE, Dec. 
1989, IEEE Press, Piscataway, N.J., pp. 
18791 985. 
18. R.K. Gupta and G. De Micheli, “Partition- 
ing of Functional Models of Synchronous 
Digital Systems,”Boc. Int7Conf Computer- 
AidedDesign,lEEECSPress, 1990, pp.216 
219. 
19. V. Sarkar, Partitioning and Scheduling 
Parallel Programs forMultiprocessors> MIT 
Press, Cambridge, Mass, 1989. 
20. R.K. Gupta and G. De Micheli, “System- 
Level Synthesis Using Reprogrammable 
Components,” Proc. European Design 
Automation Conf ,  IEEE CS Press, 1992, 
21. J.L. Hennessy and D.A. Patterson, Com- 
pp. 2-7. 
sign & Test o f  Computers, Vol. 7 ,  No. 5, 
Oct. 1990, pp. 37-53. 
5. N.  Woo, W. Wolf., and A. Dunlop, ‘Tom- 
pilation of a Single Specification Into 
Hardware and Software,” Notes of Int ’I  
Workshop Hardware-So ftware Codesign, 
Oct. 1992. 
Validation,” Wescon/SO Conf Records, 
Electron. Conventions Mgt., Nov. 1990, 
pp. 140-143. 
14. D. Ku and G. De Micheli, High-Leuel Syn- 
thesis ofASlCs Under Kming andSynchre 
nization Constraints, Kluwer Academic 
Publishers, Nonvell, Mass., 1992. 
puter Architecture: A Quantitative A p  
proach, Morgan Kaufman Publishers, 
Palo Alto, Calif., 1990, pp. 88137. 
Department of Electrical Engineering at Stan- 
ford University. His primaly research interests 
are the design and synthesis of VLSl circuits 
and systems. Gupta received an MS in electri- 
cal engineering and computer science from 
the University of California, Berkeley, and a 
BTech in electrical engineering from the In- 
40 IEEE DESIGN & TEST OF COMPUTERS 
dian Institute of Technology in Kanpur. Earli- 
er he worked on VLSl design at various levels 
of abstraction as a member of the design 
teams for the 8038&SX, 486, and Pentium mi- 
croprocessor devices at Intel. He is coauthor 
of a patent on a PLL-based clock circuit, and 
is currently a Philips fellow at the Center for 
Integrated Systems at Stanford. 
sor of electrical engineering and computer 
science at Stanford University. His research 
interests include several aspects of the com- 
puter-aided design of integrated circuits with 
particular emphasis on automated synthesis, 
optimization, and verification of VLSI cir- 
cuits. He is coeditor of Design Systems for 
VU1 Circuits: Logic Synthesis and Silicon 
Compilation, and coauthor of High-Level 
Synthesis o f  ASKS Under Timing and Syn- 
chronization Constraints. He graduated from 
the Politecnico di Milano with a degree in 
nuclear engineering and received a PhD in 
electrical engineering and computer sci- 
ence from the University of California, Ber- 
keley. De Micheli is a senior member of the 
IEEE and is associate editor of the IEEE Pro- 
ceedings, the IEEE Transactions on VLSISys- 
terns, and Integration: The VLSIJoumal. 
Send correspondence about this article to 
the authors at the Center for Integrated Sys- 
tems, CIS 18, Stanford University, Stanford, 
CA 94305; rgupta@momus.stanford.edu. 
SEPTEMBER 1993 
VLSl  A L G O R I T H M S  
AND A R C H I T E C T U R E S  
Fundamentals 
edited b y  N. Ranganathan 
This first book introduces basic approaches to the design of VLSI algorithms and architec- 
tures and provides a reliable reference source for advanced readers. It addresses introductory 
and fundamental topics related to VLSI algorithms and architectures and provides a concise 
tutorial on the subject. The chapters in this volume: 
+ Introduce Basic Concepts  and  Discuss Var ious  Issues 
+ Present  Papers  on Sys to l ic  and  Wavefront  Arrays  
+ Focus  on VLSI  Implementation of Data Structures and  Sorting 
+ 
+ Address  Impor tan t  Appl ica t ion  Areas  
Related to the Design of VLSI Algorithms and Architectures 
Deal with VLSI Structures for Matrix and Algebraic Computations 
Sections: An Overview of VLSI Algorithms and Architectures, Systolic and Wavefront 
Arrays, Data Structures and Sorting, Matrix and Algebraic Computations, Pattern 
Matching and Text Retrieval, VLSI Processor Designs. 
320pages July 1993 Hardcover ISBN 0-8186-4392-7 
Catalog # 4392-01 - $40 00 Members $32 00 
VLSl  A L G O R I T H M S  
A N D  A R C H I T E C T U R E S  
Advanced Concepts 
edited b y  N. Ranganathan 
This companion volume features an in-depth examination into the latest designs of VLSI 
algorithms and architectures for the engineering community. It contains many new studies and 
elaborates on various computationally intensive problems requiring VLSI solutions. It also 
addresses advanced techniques and VLSI architectures for a broad range of application areas. 
The first chapter discusses important architectural design issues as well as the realization of 
these architectures as VLSl systems. It discusses design issues such as layout methodology, 
processor synchronization, area-time trade-offs, and performance. The next chapter focuses on 
advanced concepts for systolic arrays and algorithms for the automatic synthesis of systolic 
arrays. The subsequent chapters describe special-purpose architectures for a wide range of 
computationally intensive problems. They discuss special-purpose architectures; VLSI chips 
for problems in image and speech processing, Al. and vision applications; application issues for 
dictionary machines and data compression; and hardware architectures for iterative algorithms. 
Sections: VLSI Architecture Design Issues: Advanced Topics in Systolic Arrays; 
Image, Speech, and Signal Processing: Artificial Intelligence and Computer Vision; 
Dictionary Machines and Data Compression; Iterative Algorithms. 
320pages July 1993. Hardcover 
Catalog # 4402-01 ~ $40.00 Members $32 00 
ISBNO-8186-4402-8 
To order call toll-free 
1 -800-CS-BO0 KS 
in CA - 71 41 821-8380 + FAX - 71 41 821-4641 
