Architectural synthesis of timed asynchronous systems by Myers, Chris J. & Bachman, Brandon M.
Architectural Synthesis of Timed Asynchronous Systems * 
Brandon M. Bachman 
Hewlett-Packard Company 




This paper descr~ibes a new method JOT architecturnl 
8ynthesis of timed asynchronous 8ystems. Due to the 
variable delays associated with asynchronous resources, 
implicit schedules are created by the addition of sup-
plementary constmints between T'esources. Since the 
number of schedules gmws exponentially w#h respect 
to the size of the given data flow graph, pruning tech-
niques are int1'Oduced which dramaticallu improve 1"Un-
time without significantly a.ffecting the quality of the 
results. Using a combination of data and resource con-
straints, as well as an analysis of bounded delay in-
formation, our method determines the minimum num-
ber of resources and registers needed to implement a 
given schedule. Results aTe demonstrnted using some 
high-level synthesis benchmark ciT'C'uits and an indus-
trial example. 
1. Introduction 
Architect'uml-level synthesis is the process of tak-
ing an abstract behavioral model of a desired circuit 
and refining it to an optimal macroscopic structure. 
Issues such a.<;j latency, area, and power must be tak-
en into consideration to balance trade-offs in a design. 
Architectural-level synthesis is an approach to manag-
ing these trade-offs at a macroscopic level. There has 
been a plethora of methods developed to manage these 
trade-offs for synchronous design (synchronous high-
level synthesis methods are surveyed in [6], and recent 
work includes [14, 16, 7, 15, 11]). 
As transistors decrea.":le in size, the integrated cir-
cuit industry continues to increase dock speeds and 
increase density maldng global synchronization across 
"This research is supported by ::\"SF CAREER award MIP-
9625014, SHC contract 99-TJ-694, and a grant from Intel 
Corporation. 
Hao Zheng & Chris J. 'vIyers 
University of Utah 
Electrical Engineering Department 
Salt Lake City, CT 
{hao,myers }(i.ilvlsigroup.elen. utah.edu 
large chips more difficult to maintain. As a result; 
asynchronous design is being looked at as an alternative 
because it eliminates clocking issues and has the poten-
tial to achieve lower power, as well as average-case per-
formance. In [10], some resources can be asynchronous 
with an unbounded delay, and a synchronous sched-
ule is determined relative to their completion. This 
method, however; docs not apply when the entire de-
sign is asynchronous as it does not determine a schedule 
of the asynchronous resources or support bounded vari-
able delays. There has only been limited research in the 
architectural-level synthesis of fully asynchronous sys-
tems. Several automated asynchronous design methods 
exist which transform high-level algorithmic descrip-
tions down to layout [1, 4, 12]. These methods, how-
ever, do not consider design tradeoffs such as resource 
and register sharing in an automated Viray. Heuristic 
techniques for high-level synthesis of synchronous cir-
cuits have been extended to asynchronous circuits [3]. 
A graph-based algoritbm for syntbesis baB also been 
approached, but the complexity of this technique re-
stricts its application to small examples [17]. 
This paper presents a ne,,, architectural-level synthe-
sis method for asynchronous systems. Tbis metbod be-
gins "yith a behavioral specification, a library of charac-
terb-;ed asynchronous datapath resources; and optional 
area and/or delay constraints. From this information; 
our metbod determines a datapath and a schedule for 
the operations. For synchronous systems; scheduling 
determines when operations are executed in time. This 
can be done efficiently using discrete-time intervals 
ba.'ied on a global dock. In an a.<;jynchronolls circllit 1 
the absence of a global clock and the asynchronous tim-
ing of events make scheduling difficult. The scbeduling 
of resources is dependent only on the availability of 
the resource and its inputs. For accurate asynchronous 
scheduling; resources must be modeled ,,,ith variable 
completion delays. It is also difficult to break time in-
to discrete bins because the fine grain discretization 
Figure 1. A DFG for a differential equation
solver (a) before and (b) after adding resource
edges.
needed for asynchronous scheduling makes tradition-
al synchronous scheduling algorithms computationally 
infeasible. For these reasons, scheduling information is 
not used here to explicitly schedule an operation to a 
specific timc. It is only used to determine conscrvative 
windmvs of time in which an operation may occur. The 
actual schedule is determined by the resource sharing 
and the order of operations. To accomplish this, our 
synthesis method performs timing analysis and adds re-
source edges into the DFG to determine a schedule. A 
number of filters are introduced which reduce the num-
ber of possible schedules which need to be explored. 
For each schedule, our synthesis method attempts to 
share resources and registers whenever possible to im-
prove the area of the resulting datapaths. The synthe-
sized architectures are evaluated, and a list of potential 
dat.apath configurations are presented to the user. The 
utility of our architectural-level synthesis method is e-
valuated using several high-level synthesis benchmarks, 
as well as an industrial example. 
2. DFGs and the Resource Library 
An architecture is typically specified using a high-
level hardware description language such as VHDL or 
Verilog. This description can then be compiled into a 
data flow graph (DFG). A data flow graph (DFG) is an 
abstract representation of the functional behavior of a 
circuit. The nodes are operations, such as additions 
and multiplications. The data edges represent the flow 
of data from one operation to the next., \vhere each 
directed edge represents a data dependency bet\veen 
two operations. Figure l(a) shows an example of a 
data flmv graph for a differential equation solver. 
(al (bl 
The resources in the asynchronous datapath library 
can be dual-rail, bundled data, or some hybrid. They 
are characterized \vith an area and a minimum, maxi-
mum. and typical delay. It should be noted that even 
bundled data resources have significant delay variation 
due to temperature, voltage, and process variation. 
3. Scheduling using Resource Edges 
The first step of architectural synthesis is to deter-
mine the schedule in which resources are to be used. 
The goal of scheduling is to restrict \vhen certain op-
erations in the DFG can occur such that multiple op-
erations can be completed using the same resource. T-
wo or more operations can share the same resource if 
they are of the same type and they are not in confiict 
\vith each other. Operations are in conflict if their ex-
ecution windows overlap in time. This happens when 
either operation starts before the other has completed. 
Operations that are scheduled in disjoint "yindows of 
time are guaranteed not to overlap and are, therefore, 
always compatible. The conflict window can be de-
termined by using the best-case and worst-case ASAP 
(as-soon-as-possible) schedules to determine the start 
and stop time of the window. 
Another v-my to shmv that t\VO operations are COIIl-
patible is to analyze the DFG. If there is a path from 
operation i to operation j, then those two operations 
are compatible regardless of their scheduled windows 
of time. This is because the existence of a path guar-
antees that operation i must complete before operation 
j begins. Edges used to explicitly denote two sharable 
operations are known as resource edges and are added 
to the DFG during design space exploration. They 
are distinguished from data edges, because they do not 
imply the transfer of data from one operation to the 
next. A resource edge forces two operations to occur 
at disjoint times and denotes the ordering in which the 
operations must occur. 
Figure l(a) shows a DFG with only data edges. In 
this configuration, four multipliers are required and 
three ALes. With the resource edges shown in Fig-
ure l(b). only two multipliers and one ALU are re-
quired. Note that there are many other "vays to add 
resource edges to the graph. Each resource edge added 
to the graph, in essence covers an aggregate of all the 
possible discrete time schedules that the given opera-
tion sequencing and resource sharing would produce. 
Hence, scheduling of operations is done independent 
of the discretization of time. For efficiency, our tool~ 
Mercury. utilizes both the information from the DFG 
and where applicable, conservative ASAP scheduling 
information to aid in performing resource sharing. 




Figure 2. Asynchronous left-edge algorithm.
(a) (b) (c)











































Figure 5. Combining timing information and
topology for register sharing.
graph and determining if adding a candidate resource 
edge between the two operations satisfies all of the 
bounding conditions. This includes not being removed 
by any of the filters described above. Each time a can-
didatc edge is filtered, or the algorithm exceeds con-
straints, the design space is pruned. If the candidate 
resource edge satisfies all of the bounding conditions, 
then the algorithm recurses to another level of the ex-
ploration. The next level considers all remaining edges 
with and without the candidate resource edge. Recur-
sion continues until all possible edges between any two 
compatible operations have been explored or pruned. 
Once the algorithm completes, the Pareto points re-
maining in the solution set arc the bcst solutions. 
Each time an edge is added or removed from the 
graph, a topological sort must be done on the graph, 
and the ASAP a.nd ALAP schedules must be updat-
ed. In addition, the transitive closure of the system, 
which determines \vhether a path exists between any 
two operations, must be updated. For these incremen-
tal changes; two optimizations are employed. First, a 
dynamic transitive closure algorithm; and second; dy-
namic computation of the ASAP and ALAP schedules. 
Both optimizations take advantage of the incremen-
tal changes to the graph by reducing unwarranted cal-
culations to areas of the graph that are not changed. 
For brevity, the details of these algorithms are not dis-
cussed here, but we refer the interested reader to [2]. 
7. Case Studies 
To test the effectiveness of the filters, three common 
high-level synthesis benchmarks are used: a differential 
equation solver (DIFFEQ), a fifth order elliptical wave 
filter (EWF), and an inverse discrete cosine transform 
(mCT). We have also applied our method to a filter 
bank from an industrial application. DIFFEQ is the 
smallest of the examples with a total of 11 operations 
and 13 variables needing registers. E\VF is larger with 
32 operations and 43 variables. IDCT is t.he largest 
with 46 operations and 56 variables. The filter bank 
from the industrial application has 23 operations and 
29 variables. All of the case studies were performed 
using a Pentium II 400 )",Ihz processor. The maximum 
amount of memory used is only 13 megabytes; so mem-
ory is not an issue. 
For DIFFEQ, exploration is done using both the 
hierarchical and non-hierarchical approaches. I3y de-
fault, the infeasible edge filter is always active for each 
of these tests, since exploring infeasible designs is not 
useful. AL U operations are modeled with a minimum 
delay of one; typical delay of two, and maximum de-
lay of t.hree. It is assumed that they require 21 units 
of area. J\ilultiply operations have a minimum delay of 
four; a typical delay of five, and a maximUIn delay of 
six. It is a.<;;sumed that they require 43 units of area. 
:\1ultiplexors are modeled with a base area of three u-
nits, corresponding to a 2x1 multiplexor. For an (N xl) 
multiplexor the area is modeled as base * (N - 1). 
The results of exploration are shown in Table 1. The 
table shows the active filters for each test, the amount 
of CPU runtime required for the test; the total num-
ber of configurations explored, and the number of so-
lutions in the final Pareto point set. For the hierar-
chical approach, the graph is broken in two sets: ALU 
operations and multiplication operations. Using this 
approach; fe\ver solutions are found, but the quality of 
the solutions arc comparable. For example, comparing 
the results of the non-hierarchical approach using none 
of the filters; vyith the hierarchical approach, also us-
ing none of t.he filt.ers, it. is found t.hat. t.he first. met.hod 
yielded 292 solutions, while the second method yield-
ed only 82 solutions. Of the 292 solutions, there are 
five unique Pareto points. Of the 82 solutions from the 
hierarchical approach t.here are also the same Pareto 
points. Vlhen all filters; excluding the minimal-latency 
filter, are used, the two approaches yield 81 and 26 so-
lutions. Again, both methods give the same 5 Pareto 
points. The unique Pareto points are shown in Table 2. 
There have been a couple of asynchronous designs of 
the differential equation solver: one using hard\vired 
control [18] and one using microcode [9J. Both of these 
designs use 2 ALUs and 2 mult.ipliers. Our met.hod 
finds this datapath, as well as 4 other alternative dat-
apat.hs. 
The second case study uses a fifth order digital ellip-
tical wave filter. The DFG for the filter is taken from 
[15]. The same parameters given above are used for 
the functional units. For these results, the hierarchical 
approach is used with a maximum block size of ten. 
This means that the algorithm randomly breaks each 
set of similar operations into blocks of ten. Exploration 
is then done considering only resource edges bet\v€€n 
operations in each block. Runtime grows rapidly as the 
block size is increased. After exploration is done on all 
sets, exploration is done again considering only criti-
cal resource edges \~lhich arc included in the individual 
block Paret.o point. solut.ions. Table 3 shows t.he exper-
imental results. The fastest solution uses 4 adders, 4 
multipliers, and 14 regist.ers and ha.s a t.ypical delay of 
37. The minimum area solution found uses 2 adders 
and 1 multiplier and 13 registers with a typical delay 
of 61. 
The mCT is the most difficult example to solve be-
cause of the high degree of parallelism between opera-
tions. The data flow graph for the mCT is from [15]. 
Table 1. DIFFEQ: experimental results (I = im-
plied, R = redundant, M = minimal latency, H
= hierarchical).
Table 2. DIFFEQ: unique Pareto points.
Table 3. EWF: experimental results using hi-
erarchical approach (I=implied, R=redundant,
M=minimal latency).
Table 4. IDCT: experimental results using hi-
erarchical approach (I=implied, R=redundant,
M=minimal latency)
Figure 6. Comparison with synchronous
methods.
References 
[I] V. Akella and G. Gopalakrishnan. SHILPA: A 
high-level synthesis system for self-timed circuit-
s. In Pme. International Conf. Cornp'Uter'-Aided 
Design (ICCAD), pages 587-591. IEEE Comput-
er Society Press, November 1992. 
[2] Brandon M. Bachman. Architectural-level synthe-
sis of asynchronous systems. 1Iaster's thesis, Uni-
versity of Utah, 1998. 
[3] R. M. Badia and J. Cortadella. High-level syn-
thesis of a."iynchronous systems: Scheduling and 
process synchronization. In Proc. European Con-
fer'enee on Design A'Utornation (EDAC), pages 70-
74. IEEE Comput.er Societ.y Press, 1993. 
[4] Kees van Berkel, Joep Kessels, Marly Ronck-
en, Ronald Saeijs, and Frits Schalij. The VLSI-
programming language Tangram and its transla-
tion into handshake circuits. In Proc. European 
Conference on Design Automation (EDAC), pages 
384-389, 1991. 
[5] n.. Brayton and n.. Spence. Sensitivity and Opti-
mization. Elsevier, 1980. 
[6] G. De ylicheli. Synthesis and Optimization of Dig-
ital Circuits. 1,.'IcGraw-Hill, Inc., ~ew York, New 
York,1994. 
[7] M. K. Dhodhi, F. H. Hidscher, n.. H. Storer, and 
.J. I3hasker. Datapat.h synthesis using a problem-
space genetic algorithm. IEEE Transactions on 
Computer-Aided Design, August 1995. 
[8] A. Hashimoto and J. Stevens. Wire routing by 
optimizing channel assignment \vithin large aper-
tures. In Proceedings of the 8th Design Automa-
tion Work8hop, pages 155-163. IEEE Computer 
Society Press, 1971. 
[9] Hans M. .Tacobson and Ganesh Gopalakrish-
nan. Application-specific programmable control 
for hihg-performance asynchronous circuits. Pro-




D. Ku and G. Dc Micheli. Relative scheduling un-
der t.iming constraint.s: Algorithms for high-level 
synthesis of digital circuits. IEEE Transactions on 
Computer-Aided De8ign, .June 1992. 
G. Lakshminarayana and 1\. K. Jba. High-level 
synthesis of power-optimized and area-optimized 
circuits from hierarchical data-flow intensive be-
haviors. IEEE Transactions on Compute'r-Aided 
De8ign, Marcb 1999. 
[12] Alain J. Martin. Programming in VLSI: From 
communicating processes to delay-insensitive cir-
cuits. In C. A. R. Hoare, editor, Developments 
in Concu.rrency and Communication, UT Year of 





P. Paulin and .J. Knight. Force-directed schedul-
ing for the behavioral synt.besis of asic's. In IEEE 
Transactions on Computer-Aided Design; pages 
661 679. IEEE Computer Society Press, 1989. 
.T. M. Rabaey and M. Potkonjak. Estimating im-
plementation bounds for real time dsp application 
specific circuits. IEEE Transactions on Computer-
Aided Design, JUlle 1994. 
W. F . .T. Verhaegh, P. E. R. Lippens, E. H. L. 
Aart.s, J. H. M. Korst, J. L. van Meerbergen, and 
A. vall der Werf. Improved force-directed schedul-
ing in high-throughput digital signal processing. 
IEEE Transactions on Computer-Aided Design, 
August 1995. 
[16] C.-Y. Wang and K. K. Parhi. High-level d-
sp synthesis using concurrent transformations, 
scheduling, and allocation. IEEE Transactions on 
Computer-Aided Design, March 1995. 
[17] T.-Y. Wuu. Synthesis of asyncbronous system-
s from data-flow specifications. Technical report 
isi/rr-93-366, University of Southern California, 
1993. 
[18] Kenneth Y. Yun, Peter A. Beerel, Vida Vakilo-
tojar, Ayoob E. Dooply, and Julio Arceo. The 
design and verification of a high-performance low-
control-overhead asynchronous differential equa-
tion solver. IEEE Transactions on VLSI Systems, 
6(4) :643-655, December 1998. 
