High-level synthesis of asynchronous systems: Scheduling and process synchronization by Badia Sala, Rosa Maria & Cortadella, Jordi
High-Level Synthesis of Asynchronous Systems: 
Scheduling and Process Synchronization* 
Rosa M. Badia Jordi Cortadella 
Polytechnic University of Catalonia, Dept. of Computer Architecture. 
Campus Nord, Modul D4. Gran Capita s/num. Barcelona, E-08071 
Abstract 
Asynchronous systenis are gaining acceptance as 
the size and complexity of digital circuits increase. 
Concordantly, synthesis tools for  asynchronous sys- 
tems must be developed t o  m a k e  design process eas- 
ier. This paper aims at the definition of basic con- 
cepts for  scheduling algorithms and control synthe- 
sis in high-level synthesis of asynchronous circuits. 
Two scheduling strategies are presented and evaluated. 
Experiments on different benchmarks show that e f i :  
cient asynchronous schedules can be obtained. Control 
is modelled in a distributed fashion with Local Con- 
trollers synchronizing between them b y  means of hand- 
shaking protocols. 
1 Introduction 
Asynchronous circuits present properties that meet 
the requirements for large, complex systems [l, 21: no 
clock skew, modular interconnectivity, low peak cur- 
rents, and performance determined by average pro- 
cessing speeds. The design of asynchronous systems 
based on self-timed circuits [l] has been more broadly 
accepted in the last years. Most work on design au- 
tornation for asynchronous systems has been focused 
to logic synthesis of sequential machines [3]. Cur- 
rently, significant effort is being invested in the syn- 
thesis of hazard-free circuits from Signal Traiisition 
Graphs [2] ,  initially proposed by Chu [4] to describe 
the behavior of asynchronous sequential machines. 
Other approaches synthesize asynchronous cir- 
cuits from high-level specifications by syntax-directed 
translation, according to production rule sets. In 51 
and [6] a similar strategy is used to translate (1: !A 'P 
into delay-insensitive circuits. We cannot consider, 
however, these approaches within the category of high- 
level synthesis, since no attempt is done to improve the 
quality of the circuit, (size and performance) by using 
optimization techniques like operation scheduling and 
hardware allocation [7]. Syntax-directed translation 
generates circuits whose size depends linearly on the 
size of the input desc,ription [5]. 
The efforts on high-level synthesis [7] have been 
mainly focused to synchronous designs. A clear ev- 
idence of this tendency is that the proposed schedul- 
~ 
'Work funded by CYCIT TI(: 91-1036 aiid ACiD-WG (Es- 
prit 7225) 
ing algorithms [8, 93 are based on the concept of con- 
trol step-i.e. time is measured in cycles, and cycle 
time is determined by the worst-case delay of all the 
operations executed in a control step. Only Ku and 
De Micheli [lo] consider the possibility of having syn- 
chronous operations with unbounded delays. 
From the point of view of the timing model used 
for operation scheduling and control synthesis, asyn- 
chronous systems present two significant differences: 
Time is considered as a continuous variable and ini- 
tiation and completion of operations are events that 
can occur a t  any instant. 
Operations have variable, data-dependent delays. 
Therefore, different scheduling strategies must be con- 
ceived if an asynchronous timing model is considered. 
This paper aims a t  the definition of basic con- 
cepts, data structures, and primitive functions for 
asynchronous scheduling algorithms and control syn- 
thesis. 
The paper is organized as follows. Section 2 de- 
scribes the architecture model considered for the asyn- 
chronous execution of operations. Section 3 presents 
an overview of a high-level synthesis system. Sec- 
tion 4 defines the basic data structures and functions 
proposed for scheduling algorithms and presents two 
algorithms for operation scheduling. Section 5 de- 
scribes the connection between module binding and 
process synchronization and how control can be syn- 
thesized. Conclusions and future work are presented 
in section 6. 
2 Asynchronous Architecture Model 
A high-level synthesis system requires a target ar- 
chitecture model for the mapping of high-level objects 
(operations, variables, data transfers) into hardware 
modules (ALUs, registers, multiplexors). This section 
presents a short summary of the architecture model 
proposed in [Il l  (only details referring to operation 
scheduling and control synthesis will be described). 
2.1 Data-Path 
The data-path is composed of self-timed blocks 
(ALUs, registers, multiplexors, etc) synchronized by 
means of a handshaking protocol implemented with 
two signals: request and completzon [l]. Registers 
1066-1409/93 $03.00 0 1993 IEEE 
70 
are implemented as latches (the request signal indi- 
cates when latching must be initiated). Each hardware 
module is considered a process which executes opera- 
tions and synchronizes with other processes when data 
transfers are required. 
2.2 Distributed Control 
.___-.---.---.-..___ 
-...--___.....-.- 
Figure 1 : Local Controllers organization 
Control is completely distributed in stic,h a way 
that for each data-path block (or a group of data- 
path blocks) there is a local controller (LC:). A local 
controller has two types of handshaking signals (see 
figure 1): 
Local signals: request and completion signals for the 
synchronization with the data-path block being con- 
trolled and other signals such as the operation code 
for ALUs or the selection code for multiplexors. 
0 Global signals for the synchronization associated 
with data transfers between blocks. 
The granularity of the control distribution may vary 
for the sake of the circuit performance. Hereafter and 
without loss of generality, we will consider that an LC 
exists for each cornputation block (or register) and its 
input multiplexors. 
The execution of an operation has the following 
steps: 
1 Input data are read from registers. 
2 Multiplexors transfer input data to their corre- 
sponding functional unit input. 
3 Operation is executed in a functional unit. 
4 Multiplexors transfer output data to a register 
5 Output data is latched into a register. 
3 High-Level Synthesis: an Overview 
The system input is a behavior description of the 
circuit described by a Control Data Flow Graph 
(CDFG). The first step performed is Scheduling and 
Allocation (figure 2). Allocation selects the number 
and type of hardware modules that will compose t>he 
data-path. In operation scheduling, operations are 
distributed through the time space arid are assigned to 
a type of functional unit. The output is a Scheduled 
Data Flow Graph (SDFG), which is a modification of 
the previous CDFG containing information related to 
scheduling and allocation. In Resource Banding oper- 
ations are bound to hardware modules. The output 
is a Bound Data Flow Graph (BDFG) where each op- 
eration has been bound to an FU instance and each 
variable to a register. After binding the set of opera- 
tions that will be executed in each process is totally 
defined and the behavior of the local controllers can 
be derived as it is explained in section 5. 
4 Scheduling 
We will represent the scheduling problem with 
a data flow graph (DFG), G(V,E) ,  where vertices 
and edges denote operations and dependencies respec- 
tively. For each vertex v E V we will use the following 
terminology: 
vo : executed operation 
vd,vc : start and c o m p l e t i o n  time 
(after being scheduled) 
functional unit type that executes uo I J ~  :
p w d ( v )  = 
szlcc(v) = 
{ U  I ( u , ~ )  E E )  
{U I (U,.) E E )  
For now on, the following assumptions will be consid- 
ered (which are close to reality): 
0 Since control is evenly distributed all over the cir- 
cuit, it is assumed that delays introduced by the 
L ( 3  are constant. 
0 Latching delays are equal and constant. 
Thus, synchronization and latching delays can be in- 
cluded in the delay of each operation. 
The library of functional units (FUs) available for 
a given tec.hnology is represented by the Delay Malrix 
6 (see figure 4). Each element 6f,. indicates the de- 
lay for the execution of operation o by the FU type 
f (6j,. = CO indicates that o is not implemented by 
f). We will denote by F l l ( o )  the set of FU types 
implementing 0. In the environment of asynchronous 
systems, where execution delays are data-dependent, 
the Delay Matrix 6 represents average delays. Thus, 
scheduling times obtained by using 6 rnust be consid- 
ered as estimated average processing delays'. 
A resource vector, R =< Ifll, Ifit,. . . , I f n [  >, rep- 
resents a set of resources available for allocation, where 
lfil indicates the number of instances of the FU type 
fi (Ifil > 0). Given a resource vector R, we define the 
average delay for operation 0, b o ,  as: 
IftPf, ,o 
IflI  
- 
- - f I E F U ( 0 )  
0 -  
f ,  G W o )  
6, is a pre-scheduling estimation of the expected 
exewtion delay for operation 0, assuming that any 
- 
' A Worst- Case Delay h4atrix would be also required to cal- 
culate worst-case processing delays if timing constraints were 
imposed in the DFG. Dealing with timing constraints is out of 
the scope of this paper. 
71 
/ Control Generation 
A Data-path 
- SDFG -F Resource -BDFG 0Binding 
Figure 2: High-level synthesis steps 
vertex v can be equiprobabilistically assigned to any 
Finding an asynchronous schedule means defining a 
partial ordering of the vertices and allocating each 
operation to a type of FU so that the total estimated 
processing delay is minimized. 
4.1 Data Structures 
f E FU(v0) .  
time ELdus 
FRT (adder) 
0 
t l  
t2 
t3 
t4 
(4 (b) (4 
Figure 3: (a) Frame Reservation Table; (b) Event List; 
(c) Partially-scheduled DFG 
A Frame Reservation Table ( F R T j )  is a data struc- 
ture bound to a type of FU f and reports the num- 
ber of active instances of type f at  each time instant. 
F R T j  is updated each time a new operation is sched- 
uled and bound to an FU of type f .  The number of 
active instances can never be greater than I f l .  
F R T j  can be represented as an Event List ( E L / ) .  
E L j  is formed by a list of pairs < t ime; ,  n f q  > or- 
dered by time, where n f q  indic,ates the number of 
available (non-active) FUs of type f from timei to 
tiniei+l. Figures 3.a and 3.b depict the F R T j  and 
EL! corresponding to the scheduled operations shown 
in 3.c. 
Two functions have been defined for rnanaging the 
Event List [12]:  
e start-time = f ind-free- interval  
(EL,  min-start-time, delay) 
This function seeks in the event list EL for 
the first time interval of duration delay with 
start-time 2 minstart-time such that it has at  
least one free FU. 
This function reserves a time interval of duration 
delay starting at  start-time. 
0 reserve-interval  (EL,  s tart-t ime,  delay) 
4.2 Event-List-Based Scheduling 
Two algorithms for operation scheduling in asyn- 
chronous systems are presented in this sec.tion: 
ELS  (Event-List Scheduling) and ELLAS (Event-List 
Look-Ahead Scheduling). Both algorithms select ver- 
tices to be scheduled acxording to a priority function. 
They differ in the calculation of the priority function: 
in the former the priority of each vertex is calculated 
at  the beginning of the algorithm, while in the latter 
it is dynamically evaluated as a result of a look-ahead 
scheduling function. 
During the execution of a scheduling algorithm the 
set of nodes of the DFG can be partitioned into three 
sets: the Ready Set (RS) ,  the Scheduled Set (SS), and 
the Non-Scheduled Set (NSS). SS contains all the ver- 
tices already scheduled. A vertex v belongs to RS if it 
has not been scheduled yet and for each U E pred(v), 
U E SS. The rest of vertices belong to NSS. The 
scheduling algorithm is next described. 
ELS (G(V,E) ,  6 ,  R) { 
for each f in the library do initializeeventlist (EL,, I f I ) ;  
calculatepathlengths~oend (G, 6 ,  R);  
RS = source-vertices(G); NSS = V - RS; SS = 0; 
while RS # 0 { 
v = iiiaxpriorityrertex(RS); 
ininstart = iiiax U , ;  
for each f E FU(w,) do { 
u€pred(v)  
niiiistartf = fiidfreeinterval ( E L f  , ininstart, 6f,vo);  
coiiipletion, = ininstartf + 6f,,,,, ; 
1 
fmln = f such that 
reserveiiiterval ( ELj , , ,  , niiustartf,n,n, sf,,, ,,,o); 
w S  = iiiinstartjm,n ; w, = coiiipletioiifmin; vf = fmtn; 
SS = SSU { v } ;  fireable = {v E NSS I Vu E pred(w),u E SS}; 
RS = ( R S  - { w } )  U fireable; NSS = NSS - fireable; 
(completion,_,, = iiuii completionf); 
f € F U ( v o )  
Similarly to list scheduling[9], this algorithm cal- 
culates, first, a priority for each vertex of the DFG. 
Then vertices in RS are scheduled in order according 
to their priority. The priority of each vertex v, vp, is 
calcdated as the path length to the end of t k D F G ,  
assuming that each vertex v is executed in Suo time. 
The calculation of vp can be done recurrently from 
sink to source vertices as follows: 
vp = max up + Suo 
uEsucc(v)  
First, all the event lists are initialized with their 
corresponding number of resources ( I f [ ) ,  and the pri- 
ority (path length to end) of each vertex is calculated. 
The main loop of the algorithm selects, first, the ver- 
tex v with maximum priority in the Ready Set. Then, 
it calculates the completion time achievable by each 
FIT that can execute v,,, and selects that FU, fmtn,  
that yields the minimum value. Finally, v is sched- 
uled and bound to fmtn. 
Information about utilization of resources is kept 
in the event lists, and managed by functions 
72 
f i n d f r e e i n t e r v a l  and r e s e r v e i n t e r v a l .  When 
an operation can be executed by more than one type of 
FU, ELS tends to bind vertices with more priority to 
faster FUs. The time complexity of ELS is O(n  log n ) .  
As mentioned before, ELLAS dynamically calcu- 
lates each vertex’s priority by using a look-ahead 
scheduling function. The time complexity of ELLAS 
is O(n310gn) [12]. 
4.3 Results 
Figure 4: Delay Matrix used for the benchmarks. 
In this section, the scheduling algorithms previously 
presented are evaluated. Two benchmarks have been 
chosen to present the results of the experirnents: the 
Differential Equation Solver [8] and the Fifth-order 
Wave Digital Filter [13]. The library used for both 
benchmarks is represented by the Delay Matrix de- 
picted in figure 4. 
Table 1: Results for the Differential Equation Solver 
and for the Elliptic Filter with CPU times in a DEC- 
system 5100 (all CPU times for the ELS are 0.Olsec.) 
Figure 5: Schedule for the Dzff. Eq. Solver  (00~+@0); 
Table 4.3 presents t,he results obtained for the two 
benchmarks considering different resource contraints. 
For the Differential Equation ELS and ELLA,Y give 
the same results. (:PU times are similar due to the 
est1 
sch 
tlme 
Figure 6: Schedule for the Ellzptzc Fzlter obtained by 
ELLAS (@I @I @OO) 
small size of the problem. Figure 5 depicts the result- 
ing schedule for one of the experiments. For the El- 
liptic Filter ELLAS is superior to ELS in most cases, 
a t  the cost of higher CPU times (but still moderate). 
Figure 6 shows the schedule obtained with ELLASfor  
one of the experiments. I t  is worth to emphasize the 
skill for binding critical operations to fast FUs and 
non-critical operations to slow FUs when they can be 
concurrently executed. 
5 Binding and Process Synchroniza- 
tion 
Module binding is performed after scheduling and 
allocation in order to bind operations to hardware 
module instances. The criteria used for binding aims 
at  the reduction of the connectivity of the circuit so 
that routing area and communication delays are min- 
imized In that respect, binding algorithms already 
proposed for synchronous circuits can also be used for 
asynchronous circuits (if they do not use control-step- 
based approaches) [ 141. 
Once binding is performed, the sequence o f  oper- 
ations to be executed in each process is completely 
defined. Each data transfer between two operations 
corresponds to a synchronization between processes 
when the operations are bound to different hardware 
modules. In figure 7 each arrow corresponds to a syn- 
chronization between processes for the scheduling ex- 
ample of figure 5 a. 
Each data transfer between a computational block 
and a register requires an explicit synchronization be- 
tween the processes corresponding to each of the in- 
73 
Figure 7: Synchronizations between processes (some 
register assignments have been replicated to increase 
the readability of the diagram) 
volved hardware modules. Synchronizations are re- 
quired to assure the sequenciality imposed by data 
dependencies, as asynchronous systems do not have a 
global clock that indicates the completion of opera- 
tions. After the ordering of data transfers and syn- 
chronizations has been determined, the behavior of 
each local controller is derived by defining the tran- 
sitions of the handshake signals. Signal Transition 
Graphs (STGs) [4] are used as behavioral description 
for local controllers. From STGs, a hazard-free circuit 
can be synthesized for each controller. For more de- 
tails, we refer the reader to [ll], where an approach for 
the synthesis of distributed asynchronous controllers 
from high-level descriptions is proposed. 
6 Conclusions and future work 
As the design of asynchronous circuits is gaining ac- 
ceptance, tools for high-level synthesis are more riec- 
essary. This paper has presented the first, approach, 
to the knowledge of the authors, to scheduling for the 
high-level synthesis of asynchronous circuits. For an 
asynchronous timing model, in which no control steps 
exist, scheduling means defining a partial ordering of 
the execution of the operations. Basic data structures 
and primitive functions for the mariagernent of initi- 
ation and complet ion operation events have been de- 
fined. Two algorithms, ELS (O(n  log n )  - t i m e )  and 
ELLAS O(n3 log n )  - t i m e )  have been proposed and 
Further research is required in this emerging area. 
Among the issues not considered in this paper, we 
mention some of the most significant: scheduling 
across basic. blocks, pipelined functional units, and 
evaluate 6 . 
scheduling under timing constraints. On the other 
hand, further research is also required in the area 
of distributed control synthesis for asynchronous sys- 
tems. 
References 
C.L. Seitz, Introduction to VLSI Systems, Chapter 7, 
Mead and Conway (Eds.), Addison Wesley, 1981. 
T.H. Meng, Synchronization Design for Digital Sys- 
tems, Kluwer Academic Publishers, 1991. 
G. Mago, “Realization Methods for Asynchronous Se- 
quential Circuits,” IEEE Trans. on Computers, Vol. 
C-20, No. 3, pp. 290-297, March 1971. 
T.A. Chu, Synthesis of Self-timed VLSI Circuits from 
Graph-theoretic Specifications, Ph.D. thesis, MIT, June 
1987. 
A. J. Martin, “Compiling Communicating Processes 
into Delay-insensitive VLSI Circuits,” Distributed 
Computing, Vol. 1 (4), Springer-Verlag, pp. 226-234, 
1986. 
K. van Berkel, J. Kessels, M. Roncken, R. Saeijs, and 
F. Schalij, “The VLSI-programming language Tangram 
and its translation into handshake circuits,” Proc. Eu- 
ropean Conference on design Automation, pp. 384-389, 
Feb. 1991. 
M.C. McFarland, A.C. Parker, and R. Camposano, 
“Tutorial on High-Level Synthesis,” Proc. 25th 
ACM/IEEE Design Automation Conference, pp. 330- 
336, June 1988. 
P.G. Paulin and J.P. Knight, “Force-Directed Schedul- 
ing in Automatic Data Path Synthesis,” Proc. 4 t h  
A CM/IEEE Design Automation Conference, pp. 195- 
202, June 1987. 
S. Davidson, D. Landskov, B. D. Shriver, and P. W. 
Mallet, “Some experiments in local microcode com- 
paction for horizontal machines,” IEEE Trans. on 
Computers, vol. C-30, July 1981. 
[lo] D. Ku and G. De Micheli, “Relative Scheduling Under 
Timing Constraints: Algorithms for High-Level Syn- 
thesis of Digital Circuits,” IEEE Trans. on Computer- 
Aided Design, Vol. 11, No. 6, pp. 696-718, June 1992. 
[11] J. Cortadella and R.M. Badia, “An Asynchronous Ar- 
chitecture Model for Behavioral Synthesis,” Proc. Eu- 
ropean Conference on Design Automation, pp. 307-311, 
March 1992. 
[12] R.M. Badia and J. Cortadella, “High-Level Syn- 
thesis of Asynchronous Digital Circuits: Scheduling 
Strategies,” UPC/DAC Report no. RR-92/6, Novem- 
ber 1992. 
[13] P. Dewilde, E. Deprettere, and R. Nouta, “Parallel 
and Pipelined VLSI Implementation of Signal Process- 
ing Algorithms,” in VLSI and Modern Signal Process- 
ing, ed. T. Kailath, pp. 258-264, 1985. 
[14] C.J. Tseng and D.P. Siewiorek, “Automated Synthe- 
sis of Data Paths on Digital Systems,” IEEE Transac- 
tions on Computer-Aided Design of Integrated Circuits 
and Systems, vol. CAD-5, no. 3, pp. 379-395, July 1986. 
74 
