Average-case optimized technology mapping of one-hot domino circuits* by Myers, Chris J. & Chou, Wei-chun
Average-Case Optimized Technology Mapping of One-Hot Domino 
Circuits*
W ei-chun Chou'*' P eter  A . Beerel'*' R an Ginosar*** R akefet Kol+
Chris J. Myers^ Shai R otem ^ K en neth  Stevens^ K en neth  Y . Yunll
^E E -System s, U n iversity  o f Southern  California, Los A n geles, C A , U SA .
♦VLSI S ystem  R esearch C enter, E E  and CS D ep ts ., T he T echnion, H aifa, Israel.
§EE D ep artm en t, U n iversity  o f U tah , Salt Lake C ity, U T , U SA .
% itel C orp., H illsboro, O R , U SA . 
llECE D ep t., U n iversity  o f C alifornia, San D iego , C A , U SA .
** on sab batica l leave at Intel C orp., H illsboro, OR.
A bstract
This paper presents a technology mapping tech­
nique for optimizing the average-case delay of asyn­
chronous combinational circuits implemented using 
domino logic and one-hot encoded outputs. The tech­
nique minimizes the critical path for common input 
patterns at the possible expense of making less com­
mon critical paths longer. To demonstrate the appli­
cation of this technique, we present a case study of a 
combinational length decoding block, an integral com­
ponent of an Asynchronous Instruction Length De­
coder (AILD) which can be used in Pentium® pro­
cessors. The experimental results demonstrate that 
the average-case delay of our mapped circuits can be 
dramatically lower than the worst-case delay of the cir­
cuits obtained using conventional worst-case mapping 
techniques.
1 Introduction
Asynchronous circuits are attractive alternatives to 
synchronous circuits because they have the poten­
tial advantages of higher average-case performance 
[12, 13, 6], lower power consumption, and freedom 
from clock-skew problems. Recent emerging asyn­
chronous designs have shown impressive results for dig­
ital signal processing [9, 23, 19] and microprocessors 
[7, 10, 4], but the lack of CAD support is still limiting 
their advances in some areas. This paper focuses on a 
CAD tool for a specific type of design, combinational 
circuits that convert data signals into control signals.
*This research is funded in part by a gift from Intel Corpo­
ration and a NSF CAREER G rant MIP-9502386.
These circuits typically perform instruction decoding 
of some type and, due to their complexity, are often 
the bottleneck in both synchronous and asynchronous 
microprocessors. We focus on a new design style and 
an accompanying CAD tool which can remove this bot­
tleneck, offering in some cases dramatic improvements 
in average-case delay.
Traditionally, combinational circuits that convert 
data into control signals are implemented using single­
rail bundled-data techniques. This method unfortu­
nately implies that the delay of the circuit is deter­
mined by the most complex data needed to be de­
coded (rather than the most common data). Dual-rail 
techniques, in which each signal is encoded with two 
bits, can also be used to design these circuits and fa­
cilitate the optimization for average-case performance. 
Traditional dual-rail designs, however, are typically 
larger, consume more power, and are slower (due to 
the complex completion sensing structures required) 
than single-rail designs.
In this paper, we consider a different design style for 
these decoders which applies a combination of domino 
logic, dual-rail signaling, and one-hot encoded outputs. 
Chris Myers initially conceived of this design [11] and 
Benes et al. independently developed a similar tech­
nique that they used in a decompression circuit for 
embedded processors [4]. Domino logic is used for its 
well-known speed advantage over static logic and be­
cause it guarantees that the outputs are hazard-free. 
However, a single stage of domino logic can only real­
ize functions that are monotonic. Thus, to implement 
all functions, some dual-rail inputs and some dual-rail 
internal signals are sometimes needed. Moreover, the
1
design style uses one-hot encoded outputs to reduce 
the overhead of completion detection of the evaluation 
phase of the domino logic. The completion detection 
of the precharge phase is simply removed with a timing 
assumption on the precharge signal. The key advan­
tage of this design style is that the domino logic can be 
optimized to prioritize the computation of instructions 
depending upon the instruction frequency, potentially 
leading to dramatic improvements in average-case de­
lays. The circuits, however, can be large and complex, 
and thus could benefit substantially from supporting 
CAD tools.
In this paper, we focus on the technology m ap­
ping problem for this class of circuits. The circuits 
are specified with a set of incompletely-specified input  
patterns,  each associated with a probability that re­
flects the input pattern ’s relative frequency of occur­
rence. In practice, these probabilities can be derived 
from architectural simulation of the design on typi­
cal data. In addition, we assume that the degree of 
sharing between cones of logic has been determined 
by technology-independent optimization. More specif­
ically, we assume that the unmapped circuit structure 
is given in the form of a NAND-decomposed graph.
The key obstacle to technology mapping of these cir­
cuits is that the delay of a circuit for an incompletely- 
specified pattern cannot be precisely determined be­
cause the critical path is unknown when a primary 
input is specified to be an “X” . Fortunately, one-hot 
domino circuits have a special property that allows us 
to easily bound  the delay for an incompletely-specified 
pattern. In particular, for each incompletely-specified 
pattern c, we identify two representative,  completely- 
specified patterns, c; and cu , that yield lower and up­
per bounds of the delay for pattern c.
Based on this theory, we propose to reduce the tech­
nology mapping problem of one-hot domino circuits to 
the completely-specified input-pattern dependent ap­
proach proposed in [2, 3], which is modified slightly 
to handle domino logic. Specifically, we replace each 
incompletely-specified pattern by one of its two rep­
resentative patterns. Then, we call the mapping rou­
tines described in [2, 3] to minimize the average-case 
delay. Finally, we use the representative patterns to 
derive bounds of the average-case delay of the mapped 
circuit.
We demonstrate our approach with a case study 
of an asynchronous instruction length decoder (AILD) 
for Pentium® processors. In particular, we describe 
two  combinational blocks for length decoding which 
are key components of a fast asynchronous length de­
coder. Our experiments support three im portant re­
sults:
• The range of average-case circuit delays that we
derived by our representative  patterns is narrow 
(within 11%), thereby illustrating the precision of 
our bounds.
• The average-case delays of both our circuits are 
significantly smaller than the average-case de­
lays of the comparable circuits derived using syn­
chronous techniques, thereby illustrating the po­
tential power of our new technology mapper.
• The average-case delays of both our circuits are 
dramatically smaller than the worst-case delay 
of the comparable synchronous circuits, demon­
strating the potential performance benefit of asyn­
chronous circuits.
The remainder of the paper is organized as fol­
lows. Section 2 provides the necessary background 
on technology mapping. Section 3 describes the fea­
tures of one-hot domino logic. Section 4 describes the 
extensions to existing technology mapping techniques 
to handle incompletely-specified patterns and domino 
logic. Section 5 presents the case study in which this 
technique is applied to the design of an asynchronous 
instruction length decoder. Finally, Section 6 gives our 
conclusions.
2 Technology mapping back­
ground
For synchronous circuits, technology mapping is of­
ten reduced to directed acyclic graph (DAG) covering 
which can be efficiently approximated by a sequence of 
optimal tree coverings [14]. The optimized equations 
(obtained from the technology-independent optimiza­
tion) are decomposed into a DAG where each node 
is a base function. Particularly, the DAG is called a 
N AN D -decom posed  graph if the set of base functions 
consists of only a NAND2 and an INVERTER [5]. The 
technology mapping problem is to find a minimum cost 
covering of the decomposed graph using available li­
brary gates. For area optimization, the cost of a cover 
is defined as the sum of the gate areas. For delay op­
timization, the cost of a cover is defined as the worst- 
case delay of the circuit. Both Chaudhary and Pe- 
dram [5], and Touati [17] extend these works to solve 
the minimum area problem under delay constraints. 
However, they consider only synchronous static cir­
cuits and employ pessimistic static timing analysis to 
determine the worst-case critical paths.
For fundamental mode designs, such as burst-mode 
circuits, Siegel and De Micheli show that with only 
small modifications, synchronous technology mapping 
technique can be applied to asynchronous circuits [16]. 
They use Unger’s result [18] to perform hazard-free
2
decomposition and present an algorithm to identify li­
brary gates which might be hazardous for mapping. 
Their results demonstrate that most library gates can 
be used safely except some complex gates. The key 
shortcoming of this technique is that the underlying 
synchronous technology mappers are limited to opti­
mizing worst-case performance, not average-case per­
formance.
In [2, 3], Beerel et al. extended these works to 
perform decomposition and covering that optimized 
the average-case delay of the burst.-mode asynchronous 
control circuit. The possible inputs to these circuits 
are given by a set of completely-specified patterns each 
of which is associated with a frequency of occurrence. 
Then, an input-pat.tern-dependent, approach is used 
to minimize the weighted sum of the delay incurred 
by each input, pattern, thereby minimizing average- 
case delay. The techniques used include rotating the 
NAND-decomposed network to push more frequent, 
primary inputs closer to the input, and a dyna.mic- 
progra.mming-ba.sed technique to explore mappings of 
the optimized decomposed network that, are deemed 
likely to minimize average-case delay.
Here, we further extend this work to combinational 
circuits implemented using domino circuits and one- 
hot. outputs. Unlike burst.-mode circuits, the possi­
ble inputs to these circuits are specified with a set. of 
incompletely-specified patterns each of which is associ­
ated with a frequency of occurrence which complicates 
the technology mapping problem.
3 One-hot domino logic
The basic block diagram of a one-hot. domino com­
binational logic block in an environment, is shown in 
Figure 1. This section describes the structure and op­
eration of the logic as well as its advantages over other 
currently known approaches.
3.1 T he dom ino core
Domino logic is widely used in high-speed circuits be­
cause of its inherent, performance advantages. It. has 
smaller parasitic capacitance [20] and separates the 
pull-up and pull-down events to avoid the fight, be­
tween the precharge and discharge current. [21], often 
yielding circuits that, are faster than circuits obtain­
able with static CMOS.
Domino logic consists of two types of gates: static 
CMOS and dynamic precharged gates, both of which 
must, be inverting. As illustrated in Figure 2, the type 
of gates alternates along any path from inputs to out­
puts. This is sometimes referred to as the domino 
constraint. Notice that, we allow the static gate t.o be
Dual-rail inputs Single-rail inputs
x1 x1 x2 x2 yiy2y3
Figure 1: A block diagram of the one-hot dom ino logic 
design style for combinational circuits.
precharge
dynamic static dynamic static
Figure 2: An illustration of domino logic.
any inverting CMOS gate [21], whereas, traditionally, 
the static gate is restricted t.o be an inverter [20].
Notice that, all dynamic (static) gates precharge 
(discharge) simultaneously during the precharge 
phase. Thus, the precharge time is fast., and essen­
tially dat.a.-independent. Consequently, we need only 
optimize the evaluation delay of the circuit..
The gates closest, t.o the primary inputs, referred 
to as P I  gates, should be dynamic rather than static. 
This is because the primary inputs can be assumed 
to be stable but. it. is not. known whether they will 
be stable 1 or stable 0 at, the start, of evaluation or 
precharge. Consequently, if the PI gate is static, a 
stable 0 at, the primary input, may cause a value of 1 
to appear at, the input, to the subsequent, dynamic gate 
at, the start, of evaluation, possibly causing accidental 
discharge.
Although we restrict, our mapped circuits to the 
style of domino logic depicted in Figure 2, we note 
that, it, may be desirable to further optimize the cir­
cuits after technology mapping. For example, in some 
cases, the pull-down transistor driven by the precharge
3
line (shaded in Figure 2) can be removed creating w hat 
is som etim es called semi-control led  dom ino logic. This 
can lead to  faster evaluation tim es because it reduces 
the stack size, bu t m ay lead to  significant short-circuit 
current during precharge [21]. The short-circuit cur­
rent som etim es creates reliability problem s, which can 
be avoided by staggering the precharge signal [23].
In addition, we note th a t charge-sharing problem s 
are always crucial to  dom ino circuits [20]. We assume 
th a t either the problem s are m inim ized by precharg­
ing each transistor of the pull-down network of each 
dynam ic gate or th a t further charge-sharing analysis 
is applied to  the m apped circuits.
A key feature of dom ino logic is th a t, when designed 
properly, it can have only m onotonic transitions [20]. 
Consequently, by its very nature, it is hazard-free. It 
can therefore be easily used in asynchronous circuits by 
controlling the precharge signal via an asynchronous 
controller ra ther th an  a global clock [8, 23, 22]. Fur­
ther descriptions of the expected operation of this con­
troller will be given below.
One com plication of dom ino logic is th a t one stage 
of dom ino logic can im plem ent only those functions 
which are m onotonic in their inputs. In particular, 
b inate functions cannot be im plem ented. Fortunately, 
this is not a serious lim itation  because by introducing 
some dual-rail prim ary inputs, any function can be 
im plem ented [20].
3.2 C om p letion  sensing
A naive m eans of detecting com pletion of a one-hot 
encoded com binational logic block is to  explicitly de­
rive a done signal from  the logical OR of all one-hot 
encoded outputs. W hen the done signal rises, the sub­
sequent operation can then be in itiated . This means 
th a t the s ta rt of the next operation is delayed by at 
least the delay associated w ith a possibly wide OR 
gate. Fortunately, there are m any instances in which 
a much b etter approach can be used.
Consider, for example, the case in which each ou t­
pu t Oi  should in itia te  a different operation i, as de­
picted in Figure 1. To im plem ent this, a different 
controller associated w ith each operation can be used. 
W hen the i-th  controller senses signal Oi  rising, it can 
trigger the s ta rt of the next operation (by rising Got)  
sim ultaneously w ith acknowledging the com pletion of 
the one-hot logic (by rising A ck i ) .  The logical OR of 
all acknowledgments, Ack i ,  can trigger the precharge 
phase. Thus, the com pletion sensing delay of the eval­
uation phase can be com pletely hidden. We note th a t 
this approach was recently used by Benes et al. in the 
im plem entation of a high-speed decompression circuit 
for em bedded processors [4].
3.3 Precharge phase
In a purely speed-independent im plem entation the 
precharging of the logic block m ust also have some type 
of com pletion detection. In this design style, however, 
tim ing assum ptions can be used to  remove the need for 
an explicit com pletion detection m echanism . Specifi­
cally, all dynam ic gates are sim ultaneously precharged 
m aking the precharge tim e essentially fixed and data- 
independent. Consequently, control circuitry can eas­
ily be used to  guarantee th a t the precharge signal 
does not become de-asserted until after all gates have 
been precharged. If some gates are semi-controlled, 
however, a delay line m ay be necessary to  model the 
precharge delay [4]. An efficient technique to  combine 
the delay line w ith the precharge logic for an asyn­
chronous adder is described in [23].
3.4 C om parison to  oth er approaches
We first contrast one-hot dom ino logic w ith trad itional 
single-rail, bundled-data  approaches in which the ou t­
pu t control signals are all latched and the ou tpu t of 
the latches, which are guaranteed to  be hazard-free, 
are used to  drive controllers. Using one-hot hazard- 
free ou tpu ts com pletely avoids the latch overhead, in­
cluding the latch propagation delay and the set-up and 
hold-tim es. Moreover, single-rail techniques m ust m in­
imize the worst-case delay am ong all ou tpu ts for all in­
pu t com binations. Using one-hot techniques, each ou t­
pu t can be independently m inim ized to  prioritize the 
m ost frequently occurring input com binations which 
make it fire. Our experim ental results suggest th a t 
this flexibility can lead to  significant speed advan­
tages. The disadvantage of this technique com pared 
w ith single-rail approaches is th a t one-hot logic m ay be 
larger, dom ino logic typically consumes more power, 
and dom ino logic often requires careful a tten tion  to 
layout to  ensure correct operation.
We note th a t it is also possible to  build these combi­
national circuits using the speculative com pletion sig­
naling approaches proposed by Nowick et al. [12, 13]. 
In this approach, the core logic can be optim ized for 
the com m on case and side logic can be created to 
identify when com m on input d a ta  arrives and trig ­
ger the done signal to  designate th a t the result is ob­
tained. This approach can lead to  some reduction in 
the average-case delay, bu t it is unclear how easy it 
would be to  generate the side logic for general func­
tions. The advantage of speculative com pletion ap­
proaches is th a t they can be applied to  sta tic  logic, 
which is sim pler to  design.
We also note th a t the concept of using dom ino cir­
cuits in asynchronous designs is not new. For example, 
W illiam s dem onstrated  the power of dom ino circuits
4
very convincingly in his landm ark asynchronous di­
vider [22]. In addition, Yun et al. used it effectively in 
asynchronous adder and m ultiplier designs [23].
4 Technology mapping
We now describe how we can extend the technology 
m apping techniques in [2, 3] to  accom m odate one-hot 
dom ino logic. In particular, we show how we perform  
technology m apping in the presence of incompletely- 
specified input pa tterns and the dom ino constraint.
4.1 In com p lete ly -specified  pattern s
An incompletely-specified p a tte rn  is a function from 
prim ary input variables to  the set {0, I, X } .  The prob­
lem is th a t the delay of the circuit for an incompletely- 
specified p a tte rn  cannot be precisely defined because 
the exact set of gates th a t will evaluate is unknown in 
the presence of a prim ary input w ith the value “X” . 
Moreover, since the exact set of evaluating gates is un­
known, it is unclear which paths the technology m ap­
per should optimize.
To address these problem s, we interpret an 
incompletely-specified p a tte rn  as a collection of 
m interm s over the input variables, where each m interm  
corresponds to  a compatible completely-specified pat­
tern. Formally, a completely-specified p a tte rn  is a 
function from  prim ary inputs variables to  the set 
{0, 1}. A completely-specified p a tte rn  i is com patible 
w ith an incompletely-specified p a tte rn  c if the assign­
m ents agree on all input variables not assigned to  “X” 
in c.
It is clear th a t the circuit delay for a completely- 
specified p a tte rn  is well defined and can be established 
through sim ulation. Consequently, we can define a 
range of delays for an incompletely-specified p a tte rn  
c as follows. The m inim um  (m axim um ) of the range 
is the sm allest (largest) circuit delay incurred by any 
com patible pa ttern . Note th a t the num ber of com pat­
ible pa tterns can be exponential in the num ber of cir­
cuit variables. Thus, exhaustively sim ulating all com­
patib le pa tterns is com putationally  very expensive.
Fortunately, the special nature  of dom ino logic can 
be used to  simplify this analysis. Specifically, this sec­
tion proves th a t two easily identifiable com patible p a t­
terns, referred to  as representative patterns, yield the 
lower and upper bounds of the p a tte rn  delay for an 
incompletely-specified pattern . The section then de­
scribes how we can use these representative patterns 
in technology m apping.
4 .1 .1  B o u n d in g  th e  d e la y  o f  in c o m p le te ly -  
s p e c if ie d  p a t t e r n s :  in tu i t io n
The in tu ition  behind our theory m ay be described w ith 
an analogy to  the game called dom inos (which is the
origin of the nam e ’’dom ino logic” ). In this game, rect­
angular tiles are often arranged in a linear fashion (or 
som etim es in more complex networks) such th a t the 
first tile falling causes a chain reaction of falling tiles. 
The delay of the chain reaction is the tim e in between 
the first and last tile falling. Notice th a t more than  
one tile can fall sim ultaneously to  s ta rt the chain re­
action and th a t some tiles m ay rem ain standing after 
the chain reaction completes.
Consider further the case where the set of tiles th a t 
s ta rt the reaction is not fully specified. In particu ­
lar, consider the case where certain  com binations of 
tiles can be chosen to  s ta rt the chain reaction bu t the 
choice of which com bination is unknown. In this case, 
the chain reaction delay cannot be determ ined. How­
ever, a lower bound on the chain reaction delay can 
be obtained by tipping over any tile which is tipped 
over in any com bination. Similarly, an upper bound 
on the chain reaction delay can be obtained by tip ­
ping over only those tiles which are tipped over in all 
com binations.
The analogy is th a t a dom ino gate is like a tile. 
We say a dynam ic (static) gate evaluates if its ou t­
pu t falls (rises). G ates th a t evaluate are like tiles th a t 
fall; they cannot re tu rn  to  their original value until the 
precharge phase. W hen one gate evaluates it can cause 
other gates to  evaluate in w hat is like a chain reaction. 
Moreover, the evaluation delay is analogous to  the de­
lay of the chain reaction. Finally, an incompletely- 
specified input p a tte rn  is analogous to  the situation  
where the set of tiles th a t s ta rts  the chain reaction is 
not fully specified.
Thus, to  find a lower bound of the delay for an 
incompletely-specified pattern , we force any PI gate 
th a t evaluates under any com patible p a tte rn  to  eval­
uate. Similarly, to  find the upper bound of the delay 
we force only those PI gates th a t evaluates under all 
com patible pa tterns to  evaluate.
In our application, the PI gates are restricted to 
be dynam ic (see Section 3.1). Thus, to  find the lower 
bound we set all unknown inputs to  one. Similarly, 
to  find the upper bound we set all unknown inputs to 
zero.
More formally, we define two representative patterns 
for an incompletely-specified p a tte rn  c. The lower pat­
tern ci is obtained by switching all X 's  in c to  1 and 
yields a lower bound of c’s p a tte rn  delay. Similarly, the 
upper pattern cu is obtained by switching all X 's  in c 
to  0 and yields an upper bound of c’s p a tte rn  delay.
It is im portan t to  note th a t the bound is loose in the 
presence of dual-rail inputs aT and aF since in reality 
bo th  aT and aF cannot be set to  the same value.
5
4 .1 .2  B o u n d in g  th e  d e la y  o f  in c o m p le te ly -  
s p e c if ie d  p a t t e r n s :  th e o r y
This section formalizes our in tu ition . F irst, we in tro­
duce some additional terminology.
D e f in it io n  4 .1  (C o n tro l l in g  in p u t )  A n  input  f  o f  
a gate g is a controlling input  o f  g i f f  f  has a value or  
a transi t ion which independently forces g to evaluate.  
A n  input  which is not  controlling is referred to as non­
controlling.
Given a p a tte rn  i, let FC'(i ,  g) denote the set of con­
trolling inputs to  g. Similarly, F N C ' ( i , g )  denotes the 
set of g's  non-controlling inputs. Let g j  denote a gate 
which connects g to  its input / .  Let d ( g , f , i )  denote 
a pin-to-pin delay of g for input / .  If g evaluates and 
has a controlling input, the pat tern arrival  t ime  of g 
for p a tte rn  i, denoted pa t ( i , g ) ,  is defined as follows:
p a t ( i , g ) =  m in \pat(i,  gf ) +  d ( g , f ,  i)\ (1)
f£ F C (i,g )
If g evaluates bu t has only non-controlling inputs, 
pa t ( i , g )  is defined as follows:
p a t ( i , g ) =  m ax \pat(i,  gf ) +  d(g,  / ,  i)] (2)
f£ F N C (i,g )
Each g a te ’s p a tte rn  arrival tim e can be com puted 
by recursively applying E quation 1 and 2 in postorder 
of gates in the circuit. Note th a t since the circuit is 
one-hot encoded, any input p a tte rn  can make only one 
PO gate (any gate th a t drives a prim ary ou tpu t) eval­
uate. Let po(i)  denote a function which returns the 
evaluating PO when p a tte rn  i is applied. The pat tern  
delay of the circuit for p a tte rn  i, denoted pjdelayi ,  is 
equal to  pat ( i ,po ( i ) ) .
In addition, let Fi denote the set of all P is whose 
value is 1 when p a tte rn  i is applied. Moreover, let 
F I ( g )  denote the set of all inputs of gate g and let 
E ( i ,  k)  denote the set of all evaluating gates in level k 
of the circuit when p a tte rn  i is applied.
The following two lem m as prove our in tu ition  th a t 
the representative patterns c; (cu ) yields the lower (up­
per) bound of the delay for an incompletely-specified 
p a tte rn  c. Inform ally speaking, the first lem m a proves 
th a t the more P is set to  one the more gates will evalu­
ate and the second lem m a proves th a t the more gates 
th a t evaluate the sm aller the resulting p a tte rn  delay. 
Their proofs are given in the appendix.
L e m m a  4.1  I f  all P I  gates are dynamic  and Fi C Fj,  
then E ( i , l )  C E ( j , l )  f o r  every level I.
L e m m a  4 .2  I f  all P I  gates are dynamic  and Fi C Fj, 
then, f o r  every level I and all g £ E( i , l ) ,  we have that  
p a t { j ,g )  < pat( i ,  g).
The following corollary follows directly from  the ap­
plication of Lem m a 4.2 on the prim ary ou tpu ts from 
which it is easy to  conclude our argum ent.
C o ro lla ry  4 .1  I f  all P I  gates are dynamic,  Fi C Fj 
then p .delay j  <  p-delayi .
T h e o re m  1 Let  ci and cu be the lower and upper pa t­
tern o f  an incompletely-specif ied input  pat tern c, re­
spectively. A s s u m in g  all P I  gates are dynamic,  then 
f o r  all c, p jd e layCl (p-delayCu)  is a lower (upper)  bound  
o f  all pat te rn delays f o r  all completely-specified pat ­
terns that  are compatible with c.
P ro o f : Consider a completely-specified p a tte rn  i th a t 
is com patible w ith c. Since p a tte rn  c; (cu) is generated 
by switching all X 's  in p a tte rn  c to  1 (0), FCu C Fi C 
FCl. Therefore, according to  Corollary 4.1, p_delayCl <  
P-delayi  <  p_delayCu. □
4 .1 .3  O p tim iz in g  fo r  r e p r e s e n ta t iv e  p a t t e r n s
As m entioned earlier, the technology m apping algo­
rithm s presented in [2, 3] cannot handle input com­
binations described using incompletely-specified p a t­
terns. One m eans of working w ith incompletely- 
specified patterns is to  optim ize w ith respect to  all 
com patible patterns. However, this has two prob­
lems. F irst, it is unknown how the probability  of an 
incompletely-specified p a tte rn  is d istribu ted  over all 
of its com patible patterns. Thus, only approxim ate 
measures of overall p a tte rn  delay could be com puted. 
Second, since the num ber of com patible pa tterns could 
be quite large, analyzing all com patible pa tterns inde­
pendently can be com putationally  in tractable.
In this paper, we propose to  optim ize the circuit 
for one representative p a tte rn  for each incompletely- 
specified pattern . The choice of representative p a t­
terns is very im portan t and different input representa­
tive patterns can lead to  very different results.
In this paper, we tested two sets of represen­
tative patterns to  optim ize for. For a set of 
incompletely-specified input pa tterns C,  we define 
L  =  {c;| for all c £ C'} as the l ower set  of C,  and 
U = {cu \ for all c £ C'} as the upper set  of C.  We 
run the optim ization  procedure twice, once optim izing 
the benchm ark for the lower set and once optim izing 
the benchm ark for the upper set. Since the average- 
case delay is the weighted sum  of all p a tte rn  delays for 
all incompletely-specified patterns [2, 3], we can easily 
conclude th a t the average-case delay for the lower (up­
per) set is the lower (upper) bound of the average-case 
delay for the original incompletely-specified patterns.
6
Therefore, for each of the two optimization results, we 
use the upper and lower sets again to obtain a range 
of average-case delay. Then, we let the user select the 
better result.
4.2 H andling th e  dom ino constraint
Recall that the input to the covering is a NAND- 
decomposed DAG, referred to as a subject graph. Our 
goal is to cover the subject graph with a set of library 
gates which are all inverting and either static or dy­
namic. Let (N,  E)  be a subject graph where N  is a set 
of nodes and, E  is a set of edges (E  C N  x N) .
Recall also, that one stage of domino logic can im­
plement only monotonic logic. This limitation is m an­
ifested in technology mapping by the fact that not all 
decomposed networks can be mapped using domino 
logic. Consider the decomposed network in which 
there are two reconvergent fanout paths from u to 
v, where u is a gate driven by a primary input. Let 
the first be u, rii, n%, . . .,r i /, v and let the second be 
u, n[, n'2, . . ., n \,, v. If I and I1 are both odd (both even) 
then the domino constraint demands that v is imple­
mented with a dynamic (static) gate. If I is even and 
I1 is odd (or vice-versa) then no mapping exists. For­
tunately, this situation can be resolved by duplicating 
portions of the NAND-decomposed network and intro­
ducing dual-rail inputs [20]. The result is an altered 
NAND-decomposed graph which is domino-feasible, as 
defined below.
D e f in it io n  4 .2  (D o m in o -fe a s ib le  D A G ) A
domino-feasible D A G  is a triple (N , E , X ), where N  
is a set o f  nodes and, E  is a set of  edges (E  C N  x N )  
and A is a labeling function N l  —>■ {Dynamic ,  Static}  
that satisfies A(u) =  D y n a m ic  for  all u £ I  and 
A(u) ^  A(i>) i f  (u, v) £ E  where u, v £ N l .
Then, to extend the technology mapping technique 
in [2, 3] to domino circuits we simply restrict the 
matching of static (dynamic) nodes to only static (dy­
namic) gates. The remaining parts of the algorithm 
need not be changed and we refer the reader to [2, 3] 
for more details.
5 A case study
We now describe the key combinational block of an 
asynchronous instruction length decoder (AILD). The 
overall architecture of the instruction decoder and the 
associated control circuits are outside of the scope of 
this paper and will hopefully be reported in separate 
papers.
5.1 Instru ction  form at
Figure 3 shows the general instruction format for the 
Pentium® processor [1]. Instructions consist of 4 op­
tional instruction prefixes, opcode bytes, an optional
Instruction Address- Operand- Segment
prefix size prefix size prefix override
0 or 1 Bytes 0 or 1 Bytes 0 or 1 Bytes 0 or 1 Bytes
Opcode ModR/M SIB Displacement Immediate
1 or 2 Bytes 0 or 1 Bytes 0 or 1 Bytes tesytB4or,20, 0,1,2 or 4 Bytes





7 65 4 3 2 1 0
SS Index Base
Figure 4: The M odR/M  and SIB fields.
address specifier consisting of the M odR/M  byte and 
the SIB (Scale Index Base) byte, and optional displace­
ment and immediate fields.
Each prefix is one byte long. Only the operand-size 
prefix and the address-size prefix affect the instruction 
length. Because these are very rare, we choose to trap 
and handle them using slower exception logic which 
will not be discussed here. The opcode represents the 
operation of the instruction. It identifies the size of 
the operation, the displacement, and the immediate. 
It is either one byte long or two bytes long where the 
first byte is always OF. The M odR/M  byte identifies 
a special addressing form for instructions that refer to 
an operand in memory. The M odR/M  byte always fol­
lows the opcode. Some M odR/M  bytes are followed by 
the SIB byte, a second addressing byte. M odR/M  and 
SIB also determine the existence and size of the dis­
placement and immediate. The displacement follows 
the opcode, or M odR/M , or SIB (which ever is last). 
The immediate, if present, is always the last field of 
an instruction. Both the displacement and immediate 
fields can be one, two, or four bytes long. The max­
imum valid instruction length is 15 bytes. Figure 4 
shows the M odR/M  and SIB byte format. The details 
of each field can be found in [1].
5.2 Instru ction  len gth  frequencies
The motivation of the asynchronous design stems from 
an analysis of several benchmark programs in which 
instruction lengths are monitored. This analysis led to 
the frequency histogram presented in Figure 5. This 
chart clearly shows that instructions of lengths two 
and three are very frequent, whereas others are much 
less frequent. Instructions of length greater than seven 
are extremely rare. This motivates our design to be 





Figure 5: The frequency of instruction lengths.
Precharge
Figure 6: The block diagram of the asynchronous in­
struction length decoder.
instructions are handled separately using slower logic 
that is not discussed here.
5.3 O ne-hot dom ino logic blocks
One-hot. domino logic forms the combinational block 
that inputs an instruction and yields the one-hot. 
encoded instruction length for the instructions with 
lengths less than 7. Specifically, as shown in Figure 6, 
this block is decomposed into 6 one-hot. domino logic 
blocks: Opcode 1, Opcode2, M eml, Mem2, and two 
length merging blocks, Merge 1 and Merge2. The Op­
code 1 and Opcode2 blocks compute the length con­
tributed by the first, and second opcode byte, respec­
tively. The M eml and Mem2 blocks compute the 
length contributed by the M odR/M  byte for one-byt.e 
and two-byte opcodes, respectively. The two merging 
blocks add these contributions to form the final length 
outputs.
The Opcodel block generates the following 11 
one-hot. encoded outputs: O plO lNoM , OplOc2M l, 
0p l02N oM , O plOc3M l, 0p l03N oM , OplOc4M l,
0p l04N oM , 0p l05N oM , OplOcGMl, 0p l07N oM , 
and isOF. OplOc2M l, for example, denotes that the 
first, byte of the instruction is the only opcode byte 
and it. contributes two bytes for the total length and 
the M odR/M  byte is present.. 0p l02N oM  denotes 
the same information as O plO c2M l except, that, no 
M odR/M  byte is present.. The other outputs have sim­
ilar interpretations. Note that OplOGNoM, for exam­
ple, is not. possible. The isOF output, is asserted when 
the opcode consists of two byt.es (in which case the 
first, byte must, be OF).
The Opcode2 block generates 6 one-hot. encoded 
outputs defined similarly. 0p20c3M 2, for example, 
denotes that, the second opcode byte contributes three 
byt.es (including the first, opcode byte OF) and the 
M odR/M  byte is present..
The M eml (Mem2) block checks the M odR/M  byte 
for the one-byt.e (t.wo-byt.e) opcode t.o generate 5 one- 
hot. encoded outputs: M10, M il, M12, M14, and M15 
(M20, M21, M22, M24, and M25). These represent, 
that the M odR/M  byte contributes 0, 1, 2, 4, and 5 
byt.es for the total length, respectively.
The Merge 1 block combines the Opcode2’s outputs 
(except. 0p206NoM ) and the Mem2’s outputs to ob­
tain the length for the instructions having a t.wo-byt.e 
opcode (see Table 1). The Merge2 block then combines 
the outputs of the Opcodel, M eml, and Mergel (along 
with the 0p206N oM  from the Opcode2) to obtain the 
final one-hot. length outputs, as defined in Table 2. 
This configuration means that, the instructions having 
a t.wo-byt.e opcode will have longer length computation 
time than the instructions having a one-byt.e opcode 
except, the one represented by the 0p206N oM . This 
improves the average-case delay of the length compu­
tation because most, one-byt.e-opcode instructions are 
more frequent, than the two-byt.e-opcode instructions. 
The 0p206NoM  is chosen to be fed directly to the 
Merge2 since it. is also frequent, and it. need not. be 
ANDed with any Mem2’s output..
L Out. Equation
3 L3_0F Op20c3M2*M20 +  0p203N oM
4 L4_0F 0p20c3M2*M21 +  Op20c4M2*M20 
+  0p204NoM
5 L5_0F 0p20c3M 2*M 22 +  0p20c4M2*M21
6 L6_0F 0p20c4M2*M22
7 L7_0F 0p20c3M 2*M 24
8 L8_0F 0p20c3M 2*M 25 +  0p20c4M2*M24
9 L9_0F 0p20c4M2*M25





2 L2 OplOc2Ml*M10 +  0p l02N oM
3 L3 O plO c2M l*M ll +  OplOc3Ml*M10 +  
0p l03N oM  +  is0F*L3_0F
4 L4 OplOc2Ml*M12 +  O plO c3M l*M ll +  
OplOc4Ml*M10 +  0p l04N oM  +  
is0F*L4_0F
5 L5 OplOc3Ml*M12 +  O plO c4M l*M ll +  
0p l05N oM  +  is0F*L5_0F
6 L6 OplOc2Ml*M 14 +  OplOc4Ml*M12 +  
OplOc6Ml*M10 +  is0F*Op2O6NoM + 
is0F*L6_0F
7 L7 OplOc2Ml*M15 +  OplOc3Ml*M 14 +  
O plO c6M l*M ll +  0p l07N oM  +  
is0F*L7_0F
8 L8 OplOc3Ml*M15 +  OplOc4Ml*M 14 +  
OplOc6Ml*M12 +  is0F*L8_0F
9 L9 OplOc4Ml*M15 +  is0F*L9_0F
10 L10 OplOc6Ml*M14
11 L ll OplOc6Ml*M15
freq. (%)
Table 2: The length equations implemented in the 
Merge2 block.
5.4 P rod u ct term  frequencies
For each combinational logic block, a two-level 
minimizer is used to obtain an optimized set of 
product terms. Then, architectural simulations is 
used to obtain frequency statistics of each product 
term. We then associate with each product term 
an incompletely-specified pattern and use the normal­
ized product-term frequencies as an estimate of the 
frequency of the incompletely-specified pattern. The 
resulting frequency distributions of the incompletely- 
specified patterns for the O plO lN oM  and 0 p l0 c 2 M l 
outputs are given in Figure 7.
The distributions of all patterns for all outputs of 
both the Opcode 1 and 0pcode2 blocks, along with the 
ou tpu t’s optimized NAND-decomposed network, are 
then input to our technology mapping program.
5.5 E xp erim en ta l results
This section reports the technology mapping results 
for both the Opcode 1 and 0pcode2 blocks which are 
the shaded blocks in Figure 6. A summary of the 
complexity of each output logic is given in Table 3. 
Notice that the fourth column reports the number of 
incompletely-specified input patterns which cause the 
output to evaluate to 1. The fifth column reports the 
number of nodes in the NAND-decomposed DAG. The 
sixth column reports the relative frequency of each 
output evaluating to a 1.
— Op1O1NoM
— Op1Oc2M1
s 0.8 0.3 . 1.0 1.0 
* pt.
\ \ » .
Figure 7: The frequency distribution of product terms 
of O plOlNoM  and O plO c2M l. The first opcode byte 
and the M odR/M  byte are inputs of the product terms. 
For O plOlN oM , the inputs from the M odR/M  byte 
are don’t-cares (not shown for simplicity).
Note that all mappings are performed using the Ub2 
gate library (that is available in the tool SIS [15]) 
which is modified in two ways. First, we remove all 
non-inverting gates because such gates cannot be used 
in domino logic. Second, for each inverting static gate, 
we add a corresponding dynamic gate with the same 
area and delay characteristics. This made it possible 
for us to compare our results with those obtained using 
worst-case mapping techniques that do not ensure the 
domino constraint [5]. All experiments were performed 
on a 120-MHz Pentium® Processor with manageable 
CPU times.
Table 4 reports the average-case delays obtained by 
optimizing the logic for both the lower set (the 2nd 
and 3rd columns) and the upper set (the 4th and 5th 
columns). Not surprisingly, the results indicate that 
when we optimize for the lower pattern set, the lower 
bound is typically smaller than when we optimize for 
the upper pattern set. Similarly, optimizing for the up­
per pattern set leads to smaller upper bounds. When 
comparing circuits, we always try to be conservative 
and thus report the upper bound of our circuits. Con­
sequently, it appears that optimizing the upper bound 
of our circuits generally leads to more favorable con­
servative comparisons.
Interestingly, the ranges in average-case delay ob­
tained by optimizing for the upper pattern set are al­
ways smaller than those obtained by optimizing for 
the lower pattern set. This may be because the crit­
ical path for an upper pattern with high frequency is 
typically very short because it has been highly opti­
mized. Consequently, when the corresponding lower 
pattern is applied, the path is still critical. On the 
other hand, when we optimize for the lower pattern 
set, we may optimize for a critical path that is differ-
9










OplOlNoM 16 1 20 574 0.202
OplOc2M l 16 1 10 321 0.473
0pl02N oM 16 1 9 322 0.114
OplOc3M l 18 1 6 280 0.065
0pl03N oM 18 1 7 307 0.018
OplOc4M l 18 1 4 231 0.004
0pl04N oM 8 1 1 56 0.001
0pl05N oM 19 1 8 361 0.056
OplOc6M l 18 1 4 227 0.015
0pl07N oM 12 1 2 112 0.000
0p202NoM 16 1 16 427 0.001
0p20c3M 2 16 1 15 379 0.025
0p203NoM 6 1 1 42 0.000
0p20c4M 2 11 1 2 95 0.001
0p204NoM 11 1 2 74 0.003
0p206NoM 5 1 1 33 0.022
Table 3: Summary of complexity of each combina­
tional logic output.
ent from the one that is critical for the upper pattern, 
thereby yielding a large range.
Table 4 also presents data derived from circuits ob­
tained using the worst-case mapping techniques de­
scribed in [5] (columns 7-10). Using this data we can 
make two m ajor comparisons.
First, we compare the average-case delay of our best 
circuits (optimized for the upper pattern set) with the 
average-case delay of circuits obtained with worst-case 
mapping techniques. This is of interest because it 
establishes the potential benefit of explicitly optimiz­
ing average-case delay during technology mapping. To 
be conservative, we compare the upper bound of our 
mapped circuits with the lower bound of the circuits 
derived using worst-case mapping techniques. The re­
sults demonstrate that our circuits are at least 31% 
faster on average than that of worst-case mapped cir­
cuits.
Second, we can compare the average-case delay of 
our circuits with the worst-case delay of the compara­
ble synchronous circuit. This comparison can give us 
an estimate of the potential benefit of asynchronous 
circuits. It is im portant to note, however, that this es­
tim ate assumes that the synchronous circuit adopts 
the same decomposition of blocks that is described 
here. Specifically, we cannot account for the possi­
bility that a different decomposition might be better 
suited for optimizing for worst-case delay. W ith this 
caveat stated, the results indicate that our circuits are 
on average at least 54% faster than the comparable
6 Conclusions
The paper focuses on the design of asynchronous com­
binational circuits that incorporate domino logic and 
one-hot logic with timing assumptions that are eas­
ily met. In particular, we discuss a novel technology 
mapping technique for this design that leverage off of 
existing work. We apply this technique to two combi­
national logic blocks that are an integral part of a fast 
asynchronous instruction length decoder.
We compare our circuits with those obtained us­
ing a more conventional synchronous technology m ap­
per (that optimizes for worst-case delay). Our ex­
perimental results suggest that our mapped circuit 
is at least 31% faster than the average-case delay 
of the conventionally-mapped circuit, illustrating the 
utility of our new technique. Moreover, the average- 
case delay of our circuit is more than 50% smaller 
than the (worst-case) delay of the conventionally- 
mapped circuit, demonstrating the potential advan­
tage of asynchronous one-hot domino circuits over 
both synchronous implementations and conventional 
bundled-data implementations.
Appendix: Proof of lemmas
L e m m a  4.1  I f  all P I  gates are dynamic and F( C Fj, 
then E ( i , l )  C E ( j , l )  for  every level I.
P ro o f :  (By induction)
Base: Let 1 = 1 .  Fi C Fj. Since all PI gates are 
dynamic, E(i,  1) C E (j ,  1).
Inductive hypothesis: For I =  k, E( i ,  k) C E (j ,  k). 
Inductive step: Let I =  k +  1. Let gu+i  £ E(i,  k +  1). 
First consider the case where gu+i has a controlling 
input fk  £ FI(gk+1) which is driven by a gate g& that 
evaluates when i is applied. Since E(i,  k) C E (j ,  k), g& 
must evaluate in pattern j .  Since the controlling na­
ture of an input is pattern-independent (because an 
evaluating gate always drives its output to a value 
that is independent of the pattern applied), /& must 
also be a controlling input of gu+i when j  is applied. 
Therefore, gu+i must evaluate when j  is applied, i.e., 
gk+i  £ E (j ,  k + 1). Thus, E(i,  k + 1) C E (j ,  k + 1).
Now consider the case where gu+i evaluates and all 
fk  £ EI(gk+i)  are non-controlling inputs to gu+i in 
pattern i. Since E(i,  k) C E (j ,  k), all g^ s must evalu­
ate and all corresponding /&’s must be non-controlling 
inputs of gk+i when j  is applied. Thus, gu+i must 
evaluate when j  is applied, i.e., gu+i £ E ( j , k  +  1). 
Therefore, E(i,  k + 1) C E(j ,  k + 1). □
L e m m a  4 .2  I f  all P I  gates are dynamic and F( C Fj, 
then, for  every level I and all g £ E( i , l ) ,  we have that 
pat{j ,g)  < pat( i ,g) .
synch ronous c irc u its .
10
Average-case Mapping vs. Worst-case Mapping
Circuit Average-case (AC) Worst-case (WC) Improve
A C D ® ' 1 A C D i ' 1 A C D ^ U A C D f ’u Area6'’11 A C D » A C D f WCD Area™ AA AW
OplOlNoM 3.672 1.570 2.965 2.229 114144 3.802 3.279 5.130 97904 10% 42%
OplOc2M l 2.404 2.131 2.186 2.018 55680 3.510 3.459 4.070 55216 37% 46%
0pl02N oM 1.775 1.698 1.811 1.800 54752 4.012 4.002 4.440 51040 55% 59%
OplOc3M l 2.201 2.047 2.170 2.070 43616 3.206 3.151 4.010 46400 31% 46%
0pl03N oM 2.821 2.821 2.821 2.821 50576 3.542 3.542 4.200 49648 20% 33%
OplOc4M l 2.761 2.761 2.761 2.761 33872 3.104 3.104 3.370 39440 11% 18%
0pl04N oM 1.587 1.587 1.587 1.587 6032 1.587 1.587 1.590 6032 0% 0%
0pl05N oM 2.800 2.800 2.800 2.800 61248 3.516 3.516 4.200 61248 20% 33%
OplOc6M l 2.647 2.647 2.647 2.647 37584 3.181 3.181 3.370 39440 17% 21%
0pl07N oM 2.400 2.400 2.400 2.400 14384 2.400 2.400 2.750 14384 0% 13%
Ave.(Opl) 2.620 2.017 2.364 2.114 471888 3.604 3.462 5.130 460752 32% 54%
0p202NoM 4.391 1.810 2.150 2.043 77024 4.270 4.181 4.790 70528 49% 55%
0p20c3M 2 3.206 1.507 2.567 2.224 74704 3.624 3.137 4.680 65424 18% 45%
0p203NoM 1.400 1.400 1.400 1.400 5104 1.400 1.400 1.440 5104 0% 3%
0p20c4M 2 1.675 1.675 1.675 1.675 12064 2.403 2.403 2.410 12064 30% 30%
0p204NoM 1.537 1.537 1.537 1.537 10208 1.627 1.627 2.070 9280 6% 26%
0p206NoM 1.335 1.335 1.335 1.335 4640 1.335 1.335 1.330 4640 0% 0%
Ave.(Op2) 2.311 1.445 1.962 1.794 183744 2.530 2.293 4.790 167040 14% 59%
Ave.(Opl+2) 2.604 1.987 2.343 2.098 655632 3.548 3.401 5.130 627792 31% 54%
Table 4: Delay and area of average-case mapping vs. delay and area of worst-case mapping. ACD denotes 
the average-case delay while WCD denotes the worst-case delay. Subscripts and superscripts on ACD and Area 
denote the type of optimization performed and the bound of the average-case delay reported. Specifically, the 
superscript a denotes the use of our average-case mapper while w denotes the use of the worst-case mapper. The 
superscripts u and I denote the optimization is performed for the upper set and the lower set, respectively. In 
contrast, the subscripts u and I denote the numbers reported are the upper and lower bound of the average- 
case delay, respectively. For the percentage improvements, the numbers in column AA are computed using 
(1-A C  D®'u /  A C  D f ) * 100%, and the numbers in column AW are computed using (l-ylC D “’tI/WCD)*100%.
P ro o f : (By induction)
Base: Let 1 = 1. Since Fi C Fj, according to Lemma 
4.1, E(i,  1) C E (j ,  1). For g £ E(i,  1), two conditions 
that F C ( i , g )  C F C ( j , g )  and F N C ( i , g )  = F N C ( j , g )  
must hold. Thus, pat ( j ,g )  < pat(i ,g) .
Inductive hypothesis: For I =  k, and for all gk £ 
E ( i , k ) ,  pa t( j ,g )  < pat (i ,g) .
Inductive step: Let I =  k + 1. Consider gk+i £ E(i,  k + 
!)•
Case 1: gk+i has a controlling input. According to 
Equation 1 , pat ( i ,gk+1) =  min f k e F C ( i , g k +1 ) pat{i ,gk) 
+ d(i ,gk+1, f k ), and p a t ( j , g k+1) =  minf k€FC(j,gk+1) 
p a t ( j , g k) + d ( j , g k+i, f k ). According to Lemma 4.1, 
since gk evaluates when i is applied, gk must evalu­
ate when j  is applied. Since f k is a controlling in­
put for gk+1 in i, we know that it must be a con­
trolling input for gk + 1  in j .  Thus, FC'(i ,gk+i) C 
F C ( j , g k-|_i). By the inductive hypothesis we also 
know that p a t ( j , g k) < pat ( i ,gk).
Moreover, we know that the pin-to-pin delay 
of an evaluating gate is pattern independent, i.e., 
d(i,gk+1 , f k ) = d(j ,gk+1, f k ). Therefore, we conclude 
that p a t ( j , g k+1) < pat ( i ,gk+1).
Case 2: gk+i has only non-controlling inputs. From 
Equation 2, pat(i ,gk+1) =  m a* f k e F N C (i,g k+1) pat{i ,gk) 
+  d(i,gk+1, f k ), and pat( j ,gk+1) =  m aX fk e F N C (j,gk+1) 
p a t ( j , g k) + d(j ,gk+i, f k ). According to Lem m a4.1, all 
gk s that evaluate in i must evaluate in j .  Since all f k ’s 
are non-controlling in i they must be non-controlling 
in j .  Therefore, F N C ( i , g k+1) = F N C ( j , g k+1) must 
hold. Also, by the inductive hypothesis we know 
that pat(j ,  gk)<pat(i ,  gk). Moreover, we know that 
d(i,gk+1, f k ) = d(j ,gk+1, f k ). Thus, we conclude that 
pat{ j , gk+i) < pat ( i ,gk+i). □
Acknowledgments
We would like to acknowledge Peter Yeh, You-Pyo 
Hong, and Aiguo Xie of the University of Southern 
California for help comments on this paper.
11
References
[1] Intel architecture software developer’s manual, 
volume 2: Instruction set reference manual, 
http: / /developer.intel.com/design.
[2] P. A. Beerel, K. Y. Yun, and W. -C. Chou. A heuris­
tic covering technique for optimizing average-case de­
lay in the technology mapping of asynchronous burst­
mode circuits. In Proc. European Design Automation  
Conference (EURO-DAC), September 1996.
[3] P. A. Beerel, K. Y. Yun, and W. -C. Chou. Opti­
mizing average-case delay in technology mapping of 
burst-mode circuits. In Proc. International Sympo­
sium on Advanced Research in Asynchronous Circuits 
and Systems, April 1996.
[4] M. Benes, A. Wolfe, and S.M. Nowick. A high­
speed asynchronous decompression circuit for embed­
ded processors. In Proceedings of the 17th Conference 
on Advanced Research in VLSI, Los Alamitos, CA, 
September 1997. IEEE Computer Society Press.
[5] K. Chaudhary and M. Pedram. Computing the area 
versus delay trade-off curves in technology mapping. 
IEEE Transactions on Computer-Aided Design, pages 
1480-1489, December 1995.
[6] A. Davis and S. M. Nowick. Asynchronous cir­
cuit design: Motivation, background, and methods. 
In Graham Birtwistle and Al Davis, editors, A syn ­
chronous Digital Circuit Design, Workshops in Com­
puting, pages 1-49. Springer-Verlag, 1995.
[7] S. B. Furber, J. D. Garside, S. Temple, J. Liu, P. Day, 
and N.C. Paver. AMULET2e: An asynchronous em­
bedded controller. In Proc. International Symposium  
on Advanced Research in Asynchronous Circuits and 
Systems. IEEE Computer Society Press, April 1997.
[8] S. B. Furber and J. Liu. Dynamic logic in four-phase 
micropipelines. In Proc. International Symposium on 
Advanced Research in Asynchronous Circuits and Sys­
tems. IEEE Computer Society Press, March 1996.
[9] J. Kessels and P. Marston. Designing asynchronous 
standby circuits for a low-power pager. In Proc. Inter­
national Symposium on Advanced Research in A syn ­
chronous Circuits and Systems. IEEE Computer So­
ciety Press, April 1997.
[10] A. J. Martin, S. M. Burns, T. K. Lee, D. Borkovic, 
and P. J. Hazewindus. The design of an asynchronous 
microprocessor. In Charles L. Seitz, editor, Advanced 
Research in VLSI: Proceedings of the Decennial Cal­
tech Conference on VLSI, pages 351-373. MIT Press, 
1989.
[11] C. J. Myers. Private communication, July 1995. C. 
J. Myers is an assistant professor at the University of 
Utah.
[12] S. M. Nowick. Design of a low-latency asynchronous 
adder using speculative completion. IEE Proceed­
ings, Part E, Computers and Digital Techniques, 
143(5):301-307, September 1996.
[13] S. M. Nowick, K. Y. Yun, P. A. Beerel, and A. E. 
Dooply. Speculative completion for the design of high- 
performance asynchronous dynamic adders. In Proc. 
International Symposium on Advanced Research in 
Asynchronous Circuits and Systems. IEEE Computer 
Society Press, April 1997.
[14] R. Rudell. Logic Synthesis for VLSI  Design. PhD 
thesis, U. C. Berkeley, April 1989. Memorandum 
UCB/ERL M89/49.
[15] E. M. Sentovich, K. J. Singh, L. Lavagno, C. Moon, 
R. Murgai, A. Saldanha, H. Savoj, P. R. Stephan, 
R. K. Brayton, and A. Sangiovanni-Vincentelli. SIS: 
A system for sequential circuit synthesis. Technical 
Report UCB/ERL M92/41, University of California, 
Berkeley, May 1992.
[16] P. Siegel, G. De Micheli, and D. Dill. Automatic tech­
nology mapping for generalized fundamental-mode 
asynchronous designs. In Proc. A C M /IE E E  Design 
Automation Conference, pages 61-67, June 1993.
[17] H. J. Touati, C. W. Moon, R. K. Brayton, and 
A. Wang. Performance-oriented technology mapping. 
In W. J. Dailey, editor, 6th M IT  Conference on Ad­
vanced VLSI  Conference, pages 79-97, 1995.
[18] S. H. Unger. Asynchronous Sequential Switching Cir­
cuits. Wiley-Interscience, John Wiley & Sons, Inc., 
New York, 1969.
[19] K. van Berkel, R. Burgess, J. Kessels, M. Roncken, 
F. Saeijs, and A. Peeters. Asynchronous circuits for 
low power: A DCC error corrector. IEEE Design & 
Test of Computers, pages 22-32, Summer 1994.
[20] N. H. E. Weste and K. Eshraghian. Principles of 
CMOS VLSI  Design. Addison-Wesley, 2nd edition, 
1993.
[21] T. E. Williams. Dynamic logic: Clocked and asyn­
chronous, 1996. ISSCC Tutorial.
[22] T. E. Williams and M. A. Horowitz. A zero-overhead 
self-timed 160ns 54b CMOS divider. IEEE Journal 
of Solid-State Circuits, 26(11):1651-1661, November 
1991.
[23] K. Y. Yun, P. A. Beerel, V. Vakilotojar, A. E. Dooply, 
and J. Arceo. The design and verification of a high- 
performance low-control-overhead asynchronous dif­
ferential equation solver. In Proc. International Sym ­
posium on Advanced Research in Asynchronous Cir­
cuits and Systems. IEEE Computer Society Press, 
April 1997.
12
