Towards behavioral synthesis of asynchronous circuits - an implementation template targeting syntax directed compilation by Nielsen, Sune Fallgaard et al.
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
Towards behavioral synthesis of asynchronous circuits - an implementation template
targeting syntax directed compilation
Nielsen, Sune Fallgaard; Sparsø, Jens; Madsen, Jan; Selvaraj, Henry
Published in:
EUROMICRO Symposium on Digital System Design
Link to article, DOI:
10.1109/DSD.2004.1333290
Publication date:
2004
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Nielsen, S. F., Sparsø, J., Madsen, J., & Selvaraj, H. (Ed.) (2004). Towards behavioral synthesis of
asynchronous circuits - an implementation template targeting syntax directed compilation. In EUROMICRO
Symposium on Digital System Design IEEE. DOI: 10.1109/DSD.2004.1333290
Towards behavioral synthesis of asynchronous circuits – an
implementation template targeting syntax directed compilation.
S. F. Nielsen J. Sparsø J. Madsen
Technical University of Denmark, Informatics and Mathematical Modelling
Richard Petersens Plads, Bldg. 322, DK-2800 Kgs. Lyngby, Denmark
e-mail: {sfn,jsp,jan}@imm.dtu.dk
Abstract
This paper presents a method for behavioral syn-
thesis of asynchronous circuits. Our approach aims
at providing a synthesis ﬂow which is very similar to
what is found in existing synchronous design tools. We
adapt the synchronous behavioral synthesis abstraction
into the asynchronous handshake domain by introduc-
ing a computation model, which resembles the syn-
chronous datapath and control architecture, but which
is completely asynchronous. The datapath and control
architecture is then expressed in the Balsa-language,
and using syntax directed compilation a corresponding
handshake circuit implementation is produced. The
paper also reports area, speed and power ﬁgures for
a couple of benchmark circuits, which have been syn-
thesized to layout.
1 Introduction
Asynchronous circuits have a number characteris-
tics that can be exploited to advantage in the design
of current and future submicron integrated circuits,
and the design and implementation of asynchronous
circuits is by now well understood [8, 10, 16, 19]; How-
ever, in order to enable a more widespread adaptation
of asynchronous design, access to eﬃcient high level
synthesis tools is crucial and unfortunately such tools
are largely lacking. In this paper we outline a complete
behavioral synthesis ﬂow, and present some important
steps of this ﬂow which uses traditional front-end be-
havioral synthesis techniques and which uses an exist-
ing asynchronous synthesis tool as the backend.
Figure 1 illustrates the synchronous and asyn-
chronous design ﬂows that are typical of today, and
it shows where the work presented in this paper ﬁts
in. The details will be explained below and in the
following section.
Synthesis of synchronous circuits, which is illus-
trated in the left column of ﬁgure 1, has succeeded
   program
Synchronous Asynchronous
design
   
   


  
Abstraction level
(Representations)
Behavioral
       Synthesis
= This paper:
       − Computation model
       − Scheduling etc.
       − Implementation template
Design Flow:
Verilog
SystemC/
VHDL/
CDFG
  description
RTL
Netlist of
components
Layout Layout
Gate/ CellGate/ Cell
Handshake
components
CSP−type
Behaviour −> CDFG −> CSP−type program  −>  Circuit
design
Figure 1: Existing synchronous and asynchronous de-
sign ﬂows and the design ﬂow addressed in this paper.
in raising the level of abstraction to that of specify-
ing circuits at the behavioral level. From a behav-
ioral description in a language like VHDL, Verilog
or System-C some intermediate representation is ex-
tracted – often a control data ﬂow graph (CDFG).
From the CDFG the classic synthesis tasks [15] of
scheduling, allocation, and binding is performed re-
sulting in a RTL level circuit description which is then
synthesized into gate level circuits and eventually a
layout.
Synthesis of asynchronous circuits is illustrated in
the right column of ﬁgure 1. It is less mature and sev-
eral somewhat diﬀerent approaches is being pursued.
The most inﬂuential of the available synthesis tools
falls in two categories: (i) synthesis of large-scale RTL
level circuits based on syntax directed compilation
from CSP-like languages: Tangram [3, 20], OCCAM
[4], Balsa [2], ACK [14] and TAST [18], and (ii) syn-
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04) 
0-7695-2203-3/04 $ 20.00 IEEE 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on February 17,2010 at 08:05:48 EST from IEEE Xplore.  Restrictions apply. 
thesis of small-scale sequential control circuits [9, 11].
The tools that perform syntax directed compilation
target a library of so-called handshake components;
some examples will appear in section 5. The hand-
shake components can be designed using in principle
any of the sequential control circuit synthesis tools.
The syntax directed compilation approach is radi-
cally diﬀerent from the behavioral synthesis ﬂow used
by designers of synchronous circuits; the compiler
merely performs a one-to-one mapping of the program
text into a corresponding circuit structure. Although
syntax directed compilation does allow the designer
to work at a relatively high level it does not provide
any optimizations; “what you program is what you
get”. In some situations this can be considered an ad-
vantage but in general it puts more burden on the de-
signer: exploring alternative implementations requires
actually programming these, whereas in a traditional
synchronous synthesis ﬂow, the designer can quickly
and easily experiment with diﬀerent constraints and
goals and in this way create alternative implementa-
tions from the same program text. In our work we
use Balsa as a back-end and take advantage of the
one-to-one mapping which allow us to describe spe-
ciﬁc implementations at a high level.
It is interesting to note that the internal representa-
tion of circuit behavior used in synchronous behavioral
synthesis is actually based on an asynchronous model
– a CDFG, i.e., a dependency graph expressing the
control- and data-ﬂow of the application. This natu-
rally raises the question, addressed in this paper: Is
it possible to apply the transformations and optimiza-
tions used in synchronous synthesis, for asynchronous
design as well?
The design ﬂow that we target in our work is illus-
trated in ﬁgure 1, and as illustrated this paper focus on
behavioral synthesis, i.e. transforming a CDFG repre-
sentation into a structural netlist of handshake com-
ponents (represented as a Balsa program). In this way
we leverage existing and mature tools and techniques
for both high level design of synchronous circuits and
(back end) synthesis tools for asynchronous design.
The contribution of this paper is the addition of be-
havioral synthesis to asynchronous circuit design in
the form of automatic resource sharing and constraint
based design space exploration. In particular our con-
tributions are: (1) an abstract event based compu-
tation model, (2) synthesis algorithms for scheduling,
allocation and binding and (3) a suitable target im-
plementation template. We have previously studied
scheduling algorithms usable in this context [17] and
there is nothing preventing the use of scheduling algo-
rithms developed by other researchers [5, 12].
The paper is organized as follows: Section 3 in-
troduces the concept which allows us to adapth the
techniques from synchronous behavioral synthesis into
behavioral synthesis of asynchronous design. Section 4
describes details of the asynchronous datapaths. Sec-
tion 5 brieﬂy explains the Balsa templates, and ﬁnally
section 6 presents and discusses some results on the
eﬃciency of the approach.
2 Related work
The introduction mentioned a number of asyn-
chronous high level synthesis tools. Tangram [3, 20] is
a proprietary tool of Phillips. It is quite mature and
has been used to design circuits which are currently in
production. Balsa [2] is a somewhat similar tool which
has been developed by the University of Manchester
and which is available in the public domain. These
tools are based on syntax directed compilation where
there is a one-to-one correspondence between the pro-
gram source and the resulting circuit and where the
control is highly distributed. TAST [18] and in par-
ticular ACK [14], involve the generation of a datapath
and one or more centralized controllers. ACK is no
longer supported and TAST is not available in the
public domain.
A number of papers have presented work on syn-
thesizing asynchronous circuits from DFG or CDFG
representations, but they are surprisingly few and they
have a diﬀerent and/or more limited scope [1, 6, 7, 13].
The ﬁrst paper limits itself to DFGs and focus mostly
on a synthesis algorithm and its runtime. The remain-
ing papers address synthesis from a CDFG represen-
tation and they target solutions where a centralized
controller or a distributed structure of controllers are
speciﬁed at the level of individual signal transitions
(in the form of signal transition graphs or burst-mode
state graphs).
Our approach is diﬀerent in that it targets hand-
shake components and syntax directed compilation.
This makes it both simpler and more powerful: Sim-
pler because the controller is synthesized implicitly in
a distributed fashion whereas in the previously pub-
lished approaches it represents a major task of the
synthesis. And more powerful because Balsa allows
very large circuits to be synthesized.
Some research seems to indicate that the dis-
tributed control and the handshake signaling, which
characterize circuits produced by syntax directed com-
pilation, results in poor speed. To alleviate this a num-
ber of low-level post-synthesis techniques are being
used. One approach is peephole optimization which
replaces common structures of handshake components
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04) 
0-7695-2203-3/04 $ 20.00 IEEE 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on February 17,2010 at 08:05:48 EST from IEEE Xplore.  Restrictions apply. 
k
r,j
Ew,j
E
1
0
0
w,i
Ew,k
Er,k
k
0
2
j
Ew,i
Er,j
Ew,j
Relaxation
E2
E1
E0
−1E
Er,k
Ew,k
i
j
i
t
E
Figure 2: Adapting synchronous synthesis (left) into
the asynchronous handshake domain (right).
with simpler ones [20, 12] and other approaches involve
re-synthesis from a speciﬁcation of the behavior of one
or more handshake components into a more eﬃcient
implementation [5]. In any case this work is orthogo-
nal to the work presented in this paper where focus is
on high level synthesis.
3 From synchronous to asynchronous
behavioral synthesis
Let us ﬁrst review and analyze the elements of syn-
chronous behavioral synthesis. The target for behav-
ioral synthesis is a hardware architecture consisting
of a datapath which is able to perform a set of op-
erations, and a controller which controls the execu-
tion sequence of these operations in order to perform
a given application. A key issue in behavioral syn-
thesis is to reuse hardware resources for the diﬀerent
operations in order to minimize area, and to explore
possible parallelism by executing several hardware re-
sources concurrently in order to increase performance.
Most behavioral synthesis tools make optimizations
based on a CDFG which is extracted from a behavioral
speciﬁcation of the circuit behavior. This speciﬁcation
may be expressed in a hardware description language
such as VHDL, Verilog or SystemC, or in a traditional
programming language such as C, C++ or Java. Be-
havioral synthesis-tools for synchronous systems use
modern compiler techniques to translate source code
into some variant of a CDFG as part of their front-
end. This process is well understood [15] and will not
be addressed in this paper.
Based on the CDFG, synchronous behavioral syn-
thesis tools perform three sets of transformations in
order to create a suitable hardware architecture;
• Scheduling, in which operator nodes of the CDFG
are grouped into operation-groups or time-slots,
and where the execution of the next operation-
group is handled by a synchronization event, Ei,
where i strictly orders the events in time. In
Datapath
Event
synchronizer
Event
synchronizer
Er,i
Ew,i
Storage
Computation
Controller
Figure 3: Computation model in the asynchronous
handshake domain, where the labeling refers to the
role the handshake components play in our model.
the case of synchronous behavioral synthesis Ei
is controlled by the system clock.
• Allocation, in which the minimum hardware re-
sources/ functional units (FUs), required for exe-
cution of the operation-groups are determined.
• Binding (or assignment), where individual opera-
tor nodes are tied to speciﬁc hardware resources.
The synchronization events determine (i) the begin-
ning of executing an operation (ii) writing the result
of an operation.
The CDFG extracted in the synchronous behavioral
synthesis is a 1-bounded colored Petri net, where col-
ors represents data values, edges represent places, and
nodes represent transitions. Interestingly, the Petri
net model is based on an asynchronous execution se-
mantics which should make it a obvious model for
asynchronous synthesis as well. In the synchronous
synthesis, ﬁgure 2 (left), operations are ordered ac-
cording to a global synchronization event, Ei, i.e., read
events (Er,j) for operator j happens at the same point
in time as the write events (Ew,i) for operator i in
the previous operation-group: E0w,i = E
0
r,j = E
0, and
furthermore all operations in an operation-group are
executed simultaneously: E0r,j = E
0
r,k = E
0.
If we relax these assumptions: Ew,i = Er,j and
Er,j = Er,k as shown in ﬁgure 2 (right), and if we
make these synchronization events controlled by the
controller, we can create a hardware architecture con-
sisting of a datapath and a controller as shown ﬁgure
3. It resembles the synchronous architecture but it
is completely asynchronous. For this model to oper-
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04) 
0-7695-2203-3/04 $ 20.00 IEEE 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on February 17,2010 at 08:05:48 EST from IEEE Xplore.  Restrictions apply. 
w8
5
7
2
3
6
4
1
t=T
t=0
t
Mult ALU
0y
* 2
x1
y1
x2
a1
a3
0x
a0
a2
1
+ 3
+
−
*
+ 6 −
> 4
5
1
2
3
4
5
6
7
7
8
0
w
w
w
w
w
w
w
Figure 4: (Right) Our example CDFG with labels on
temporary data. (Left) Scheduling of our CDFG.
ate with arbitrary synchronization events, the compu-
tation part (functional units) has to act as an inde-
pendent process, with its own local control, decoupled
from the storage. In this way we have adopted the syn-
chronous abstraction to the asynchronous handshake
domain.
This idea allows us to use any of, but not restricted
to, the many synchronous behavioral synthesis tech-
niques to obtain a hardware architecture (datapath
and controller) and then to implement this architec-
ture using asynchronous circuit techniques. At the
same time, this idea allows the use of behavioral syn-
thesis techniques operating in continuous time.
4 Datapath synthesis
Lets assume we are given a CDFG and that schedul-
ing, allocation and assignment has been performed as
shown in ﬁgure 4 using the FU library shown in ta-
ble 1. The FU library have been normalized with re-
spect to the ALU component. We will consider the
schedule to operate in continuous time. However it
is of no importance whether the schedule have been
obtained using an asynchronous scheduling method or
through a synchronous method which have been re-
laxed into continuous time, as discussed in the previ-
ous section. Note that the operator nodes have been
labeled: 1,2,..,8 and temporary data: w0,w1,...,w7.
The branch part of the CDFG, nodes {6, 7, 8}, gives
rise to two paths in the schedule. Determined by the
execution of node 4, either 6 and then 8, or 7.
The scheduling in ﬁgure 4 results in the fastest ex-
ecution of the CDFG on a datapath containing only
one Mult and one ALU component.
The general structure of the asynchronous datap-
ath is shown in ﬁgure 5 and it follows the computation
model presented in the previous section. The internal
variables (L0...Ln) in our datapath are implemented
as latches. The functional units (FU0...FUm) are im-
FU σ t A E
ALU {+,−, >} 1 1 1
Mult {∗} 2.6 10 13
Table 1: Simple example normalized FU library.
...
MUX
LATCH
MUX
FU
OUTPUT
INPUT
LnLn
FU1
Y0Y1
X0
FU1_a FU1_b
FU1_opr
alu
L0 L1
FU1_z
... Yn
X1 Xn...
*
FUm_bFUm_a
FUm
FUm_z
Li
*
FU0
FU0_a FU0_b
FU0_z
...
......
Figure 5: General structure of the datapaths our asyn-
chronous circuits have.
plemented as independent processing units, with local
control, wrapping the computation part with latches
on both input and output ports. The functional units
can be simple combinatorial blocks or they can be aug-
mented with input and output latches. This choice has
consequences on circuit area, lifetime of the variables,
speed and power consumption. In this paper we as-
sume that the functional units have normally opaque
latches on input and output ports. This is a some-
what arbitrary choice and has no fundamental impli-
cations on the approach or the synthesis algorithms.
The use of input and output latches tends to increase
speed and to reduce power consumption by preventing
spurious signal transitions to propagate beyond latch
boundaries. If input and output latches are not used,
more variable latches may be needed in the datapath
in order to accommodate the longer lifetime require-
ments and in order to avoid auto assignments.
To compute the life times we need to determine
how long a variable is to be kept. Since our FUs have
input latches we only need to hold the variable until
it have been read for the last time, at the start of the
last computation. This reduces the variable life time
requirements, leading to a possible reduction in the
number of variables needed. We estimate the overhead
for reading and writing a result to a variable latch
to be t∆ = 1/3tALU , which is added to the variable
lifetime.
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04) 
0-7695-2203-3/04 $ 20.00 IEEE 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on February 17,2010 at 08:05:48 EST from IEEE Xplore.  Restrictions apply. 
w1 x2
y1 y10y
0x
t=t
t=0
t=T
t
0
3
1
5 6
2
4
7w
w
w
ww
w
w
x
Figure 6: Variable lifetime for our scheduled CDFG.
w
y
x10
x
y1
x2
L1L0
t
t=0
t=T
t=t
L2 L3
6
5
0
2
4
1
3
7
w
w
w
w
w
w
w
0
Figure 7: Latch assignments for our scheduled CDFG.
For our example, the variable lifetime using this
latch convention is shown in ﬁgure 6. We can use the
left-edge algorithm [15] to ﬁnd the minimum number
of latches required in the datapath, which in this case
is 4 latches. The resulting variable to latch assignment
is shown in ﬁgure 7.
With the FU allocation, operator to FU assign-
ment and variable latch assignment the datapath can
be constructed by connecting the components through
multiplexors. The datapath for our example is shown
in ﬁgure 8. The controller to this circuit implements
the schedule and starts the FUs with the right data at
their designated times.
5 Balsa implementation
For implementing the controller and datapath in
asynchronous hardware, we are utilizing the Balsa
CAD framework. In ﬁgure 9 is shown the Balsa hand-
shake circuit equivalent to our datapath from ﬁgure
8. Such a Balsa handshake circuit is built from hand-
shake components which implements the equivalent
RTL operations as latching data, multiplexing data,
X2
*
FU0 FU1
MUX
LATCH
MUX
FU
OUTPUT
INPUT
c
Y0Y1
X0
a1a2a0
FU0_a FU0_b FU1_a FU1_b
FU1_opr
alu
L0 L1 L2 L3
FU0_z FU1_z
a3
X1
Figure 8: Datapath for our scheduled CDFG.
addition etc. Each of these handshake components has
its own local asynchronous control to ensure proper
asynchronous functionality and to handle the asyn-
chronous handshake communication protocol [19].
Besides these asynchronous handshake components
which have their equivalent RTL counter parts, there
are the demux components which handles “wire-
forks”, and more importantly the transfer handshake
components connecting the asynchronous controller
with the datapath; the latter play the role of event
synchronizers, refer to ﬁgure 3, controlling the com-
putation. These extra components augments the mux
layers with sublayers of demux and transfer compo-
nents. Notice the mux components implement a merge
functionality and is not directly connected to the con-
troller, neither are the latches, demuxes or FUs (ex-
cept the opr control signal), only the transfer compo-
nents are connected to the controller. The FUs are
autonomous components which start computing when
all their input data is present. Using these compo-
nents and our computation model, there is a one to
one correspondence between the datapath of ﬁgure 8
and ﬁgure 9.
In our design we use a bundled data 4-phase pro-
tocol where signals contain a 1 bit request and a 1 bit
acknowledge wire additional to the data wires. Fur-
thermore, the transfer components degenerate to sim-
ple wire connections containing no logic.
As an example of how the datapath is constructed
using the Balsa-language consider the assignment of
subtraction operator 3 (ﬁgure 4) to ALU FU1 (ﬁgure
9). This subtraction operator has inputs w0 w1 and
output w2 (w2 = w0 − w1), assigned to variables L0
L1 and L0 respectively. Starting the computation is
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04) 
0-7695-2203-3/04 $ 20.00 IEEE 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on February 17,2010 at 08:05:48 EST from IEEE Xplore.  Restrictions apply. 
INPUT
L0
r,i
w,i{E     }
{E    }
c
opr
a3 a0 a2 a1
L1 L3L2
FU0_a
FU1_z
FU1_a FU1_bFU0_b
alu
FU0_z
FU1_opr
Y1Y0
*
(TRANSFER)
(TRANSFER)
(DEMUX)
MUX
LATCH
MUX
FU
X0 X1 X2
OUTPUT
Figure 9: Final datapath for our scheduled CDFG us-
ing Balsa/Tangram handshake components.
performed by executing the following parallel Balsa-
statement:
FU1_opr<-alu_sub || FU1_a<-L0 || FU1_b<-L1
This set of parallel channel assignment statements
tells FU1 to perform a subtraction, and to use the
data of L0 and L1. The result w2 of the computation
is written to L0 using the following Balsa-statement:
FU1_z->L0
Both statements will synchronize the controller with
the ALU using the transfer components. Due to the
design of the FUs with both input and output latches,
the controller and the rest of the datapath is free to
do other work while FU1 computes. The reading of
input X0 to the internal variable L0 and placing the
results of internal variable L0 on output channels Y0
is executed in a similar way. These Balsa-statements:
(i) starting a computation, (ii) writing the result of
computation or (iii) communicating with the outside
world, implements the events described in section 3.
These are then sequenced in the right order, using the
Balsa-sequence operator “;” implementing our sched-
ule.
The full details of implementing the circuits in
Balsa with conditional computation, as well as ex-
plaining the optimizations which can be performed
import [Balsa.types.basic]
import [FU_types]
import [FU_lib]
procedure Ex(input X0,X1,X2:word;
output Y0,Y1:word) is
variable L0,L1,L2,L3:word
channel FU0_a,FU0_b,FU0_z:word
channel FU1_a,FU1_b,FU1_z:word
channel FU1_opr:alu_operation
constant a0= 255
constant a1= 255
constant a2= 255
constant a3= 255
begin
Mult(FU0_a,FU0_b,FU0_z) ||
ALU(FU1_opr,FU1_a,FU1_b,FU1_z) ||
loop
X0->L0 || X1->L1 || X2->L2 ;
FU0_a<-L0 || FU0_b<-L1 || FU1_a<-L0
|| FU1_b<-a0 || FU1_opr<-alu_add ;
FU1_z->L0 || FU1_a<-L1 || FU1_b<-a2
|| FU1_opr<-alu_gre ;
FU1_z->L3 ;
if L3=0 then FU1_a<-L1 || FU1_b<-L2
|| FU1_opr<-alu_add
else FU1_a<-L1 || FU1_b<-L2 ||
FU1_opr<-alu_sub end ;
FU0_z->L2 || FU1_z->L1 ;
if L3=0 then FU0_a<-a3 || FU0_b<-L1
end || FU1_a<-L0 || FU1_b<-L2 ||
FU1_opr<-alu_add ;
FU1_z->L0 ;
FU1_a<-L0 || FU1_b<-a1 ||
FU1_opr<-alu_sub ;
if L3=0 then FU0_z->L1
end || FU1_z->L0 ;
Y0<-L0 || Y1<-L1
end
end
Figure 10: Balsa program for our example.
both to improve the computation speed, reduce tem-
porary variables, and to decouple the control circuit to
take advantage of possible variable computation times
is beyond the scope of this paper. The full Balsa pro-
gram of our running example, implementing the con-
troller and datapath, is shown in ﬁgure 10.
6 Results
In order to demonstrate the feasibility of the pro-
posed approach and in order to evaluate the eﬃciency
of the proposed implementation template we have syn-
thesized diﬀerent versions of a couple of benchmark
circuits, FIR and HAL, and we have simulated the
post place-and-route netlists. In this way we are re-
porting speed, area and energy ﬁgures for actual cir-
cuit implementations.
It is important to stress the results do not represent
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04) 
0-7695-2203-3/04 $ 20.00 IEEE 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on February 17,2010 at 08:05:48 EST from IEEE Xplore.  Restrictions apply. 
id Alg. ∗ ALU t [ns] A [mm2] E [nJ ]
1 FIR 8 7 124.7 0.877 2.95
2 FIR 2 1 284.8 0.282 2.80
3 HAL 5 5 171.2 0.667 2.03
4 HAL 2 1 309.6 0.260 1.89
5 HAL 1 1 397.4 0.151 2.01
Table 2: Layout results.
FU σ t[ns] A [mm2] E [nJ ]
ALU {+,−, >} 25.5 0.0112 0.0266
Mult {∗} 56.3 0.105 0.314
Table 3: FU library (16-bit) based on layout in 0.18µm
technology, used by our synthesis algorithm.
an attempt to evaluate the asynchronous implemen-
tations against corresponding synchronous ones; our
focus is on the eﬃciency of the automated resource
sharing within the asynchronous domain.
The simulation results have been obtained using
the following steps: (1) Given a CDFG and circuit
constraints in the form of a maximum resource alloca-
tion our tool produces a corresponding Balsa program.
In this process we target an operator-library consist-
ing of an ALU and a multiplier, and these operators
are themselves implemented as small Balsa program
modules. (2) The Balsa CAD-tools are then used to
generate a Verilog netlist of the asynchronous circuit
(single rail 4-phase early protocol) and the Cadence
CAD tools are used to generate the corresponding lay-
out. We are using the 0.18µm STM standard-cell
technology, which have been augmented with stan-
dard cell components for implementing various special
asynchronous components such as Muller C-elements.
(3) Finally simulation results are obtained by simulat-
ing the Verilog netlist together with extracted layout
information in NanoSim. We simulate 200 computa-
tions, using random numbers with out any correlation.
All the circuits are implemented using 16-bit variables
and are simulated at 1.8V and at a temperature of
25oC.
The benchmark results are shown in table 2, where
t is the average time to do one computation, A is the
layout area and E is the average energy consumption
per computation. In a similar way we have charac-
terized the ALU and multiplier operators, see table 3.
The speed ﬁgures in table 3 have been used in the
above mentioned step 1 to calculate the schedules.
Implementations 1 and 3 in table 2 are the di-
id Alg. ∗ ALU t [ns] A [mm2] E [nJ ]
1 FIR 8 7 121.8 0.916 2.91
2 FIR 2 1 285.4 0.221 2.91
3 HAL 5 5 169.5 0.580 1.84
4 HAL 2 1 269.2 0.221 1.84
5 HAL 1 1 381.7 0.116 1.84
Table 4: Model results.
rect non-resource-shared circuit implementations of
the computations. These have also been designed us-
ing latches on the input and output of the multipli-
ers. Although this gives an extra area overhead it is
insigniﬁcant compared to the area of the multiplier.
The important fact is that it reduces the combinato-
rial depth of the circuit and thus reduces the power
consumption, which leads to a more fair comparison.
The speed ﬁgures in table 2 includes a 20ns handshake
delay in the testbench used to simulate the layouts.
The results in table 2 shows that resource sharing
saves area at the expense of reduced speed. This is as
could be expected. Concerning energy consumption it
is interesting to note that it remains constant. Given
that resource sharing leads to more control circuitry
for the same computation, an increase in energy con-
sumption could be expected. It seems that the smaller
size of the layout and the reduced wirelength, which
results from this leads to a power saving which corre-
sponds to the increase caused by the added control.
In order to estimate the overhead of the control cir-
cuitry which is introduced by resource sharing, we can
use the ﬁgures in table 3 and estimate the cost of an
ideal resource shared implementation, e.g. an imple-
mentation in which the added control has zero area,
latency and energy consumption. Such ideal ﬁgures
are shown in table 4. Comparing tables 4 and 2 it is
seen that the control circuitry introduced by resource
sharing accounts for 10-30% of the area and 0-15% of
the speed, whereas it does not aﬀect energy consump-
tion (as discussed above). It should be noted that
the are ﬁgure conforms with, ﬁgures reported by the
tool Balsa-cost which provides cost estimates at the
handshake-component level.
We ﬁnd these results encouraging and in support
of the design ﬂow, the implementation template, and
the approach to resource sharing, which is proposed
by this paper.
7 Conclusion
The paper presented a design-ﬂow for behavioral
synthesis of asynchronous circuits and it makes the
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04) 
0-7695-2203-3/04 $ 20.00 IEEE 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on February 17,2010 at 08:05:48 EST from IEEE Xplore.  Restrictions apply. 
following contributions: (i) A method for synthesiz-
ing a CDFG to a Balsa-description have been devel-
oped using a methodology closely related to, but not
restricted to, traditional synchronous behavioral syn-
thesis. This allows the designer to perform design
space exploration by adding physical constraints to
the circuit. (ii) Using this method and the Balsa and
Cadence design tools ﬁve layouts have been designed
and simulated. The results show that it is possible to
do tradeoﬀs between area and circuit delay for asyn-
chronous circuits. We ﬁnd there is a 10 − 30% area
overhead and a 0− 15% time overhead and no power
overhead implementing this method. We ﬁnd these
results encouraging and in support of the design ﬂow,
the implementation template, and the approach to re-
source sharing, which is proposed by this paper. Fu-
ture work will include automating the front-end part
of the ﬂow, exploration and adaption of more synthe-
sis algorithms, misc. optimizations at the circuit level
and last but not least, more and larger benchmarks.
References
[1] B. M. Bachman, H. Zheng, and C. J. Myers. Ar-
chitectural synthesis of timed asynchronous systems.
In Proc. ICCD’99 (IEEE International Conference on
Computer Design: VLSI in Computers and Proces-
sors), pages 354–363, October 1999.
[2] A. Bardsley and D. A. Edwards. The Balsa asyn-
chronous circuit synthesis system. In Forum on De-
sign Languages, September 2000.
[3] C. H. (Kees) van Berkel, Cees Niessen, Martin
Rem, and Ronald W. J. J. Saeijs. VLSI program-
ming and silicon compilation. In Proc. Interna-
tional Conf. Computer Design (ICCD), pages 150–
166. IEEE Computer Society Press, 1988.
[4] Erik Brunvand. Translating Concurrent Communicat-
ing Programs into Asynchronous Circuits. PhD thesis,
Carnegie Mellon University, 1991.
[5] T. Chelcea and S. M. Nowick. Resynthesis and peep-
hole transformations for the optimization of large-
scale asynchronous systems. In Proc. ACM/IEEE De-
sign Automation Conference, June 2002.
[6] J. Cortadella and R. M. Badia. An asynchronous
architecture model for behavioral synthesis. In
Proc. European Conference on Design Automation
(EDAC), pages 307–311. IEEE Computer Society
Press, 1992.
[7] J. Cortadella, R. M. Badia, E. Pastor, and a: Pardo.
Achilles: a high-level synthesis system for asyn-
chronous circuits. In D. D. Gajski, editor, Proc.
6th International Workshop on High-Level Synthesis,
pages 87–94. Univ. California, 1992.
[8] J. Cortadella, M. Kishinevsky, A. Kondratyev,
L. Lavagno, and A. Yakovlev. Logic Synthesis of
Asynchronous Controllers and Interfaces. Springer-
Verlag, 2002.
[9] Jordi Cortadella, Michael Kishinevsky, Alex Kon-
dratyev, Luciano Lavagno, and Alexandre Yakovlev.
Petrify: a tool for manipulating concurrent speciﬁca-
tions and synthesis of asynchronous controllers. In
XI Conference on Design of Integrated Circuits and
Systems, Barcelona, November 1996.
[10] R. M. Fuhrer and S. M. Nowick. Sequential Optimiza-
tion of Asynchronous and Synchronous Finite-State
Machines Algorithms and Tools. Kluwer Academic
Publishers, June 2001. ISBN 0-7923-7425-8.
[11] R. M. Fuhrer, S. M. Nowick, M. Theobald, N. K. Jha,
B. Lin, and L. Plana. Minimalist: An environment
for the synthesis, veriﬁcation and testability of burst-
mode asynchronous machines. Technical Report TR
CUCS-020-99, Columbia University, NY, July 1999.
[12] G. Gopalakrishnan, P. Kudva, and E. Brunvand.
Peephole optimization of asynchronous macromodule
networks. In Proc. International Conf. Computer De-
sign (ICCD), pages 442–446. IEEE Computer Society
Press, October 1994.
[13] Euiseok Kim, Jeong-Gun Lee, and Dong-Ik Lee. Au-
tomatic process-oriented control circuit generation for
asynchronous high-level synthesis. In Proc. Inter-
national Symposium on Advanced Research in Asyn-
chronous Circuits and Systems, pages 104–113. IEEE
Computer Society Press, April 2000.
[14] P. Kudva, G. Gopalakrishnan, and V. Akella. High
level synthesis of asynchronous circuit targeting state
machine controllers. In Asia-Paciﬁc Conference on
Hardware Description Languages (APCHDL), pages
605–610, 1995.
[15] G. De Micheli. Synthesis and optimization of digital
circuits. McGraw-Hill, 1994.
[16] Chris J. Myers. Asynchronous Circuit Design. John
Wiley & Sons, July 2001. ISBN: 0-471-41543-X.
[17] S. Nielsen and J. Madsen. Power Constrained High-
level Synthesis of Battery Powered Digital Systems.
Proceedings Design Automation and Test Europe,
March 2003
[18] M. Renaudin, P. Vivet, and F. Robin. A design frame-
work for asynchronous/synchronous circuits based on
CHP to HDL translation. In Proc. International Sym-
posium on Advanced Research in Asynchronous Cir-
cuits and Systems, pages 135–144, April 1999.
[19] J. Sparsø and S. Furber, editors. Principles of
asynchronous circuit design – A systems perspective.
Kluwer Academic Publishers, 2001.
[20] Kees van Berkel. Handshake Circuits: an Asyn-
chronous Architecture for VLSI Programming, vol-
ume 5 of International Series on Parallel Computa-
tion. Cambridge University Press, 1993.
Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04) 
0-7695-2203-3/04 $ 20.00 IEEE 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on February 17,2010 at 08:05:48 EST from IEEE Xplore.  Restrictions apply. 
