Design and analysis of up-down counters  by Segers, John & Ebergen, Jo C.
Science of 
Computer 
ELSEVIER Science of Computer Programming 27 (1996) 185-204 
Programming 
Design and analysis of up-down counters ’ 
John Segers a,*,2, Jo C. Ebergen b 
a Department of‘ Mathematics and Computing Science, Eindhoven University of Technology, 
P. 0. Box 513, 5600 MB Eindhoven, The Netherlands 
b Department of’ Computer Science, University of Waterloo. Waterloo. Ontario, Canada N2L 3GI 
Received October 1994; revised September 1995 
Communicated by M. Rem 
Abstract 
Several designs for an up-down N-counter are derived for any N > 0. The updown N-counter 
has a simple specification, but allows many non-trivial, efficient implementations. All designs 
are represented by means of a BP-like program and are analyzed with respect to area, response 
time, and power consumption. Our final design has optimal area of Q(log N), a bounded response 
time, and a bounded power consumption. 
1. Introduction 
By means of a small example we illustrate how techniques from parallel program 
design can be used in the design and analysis of a digital circuit. The example concerns 
the design of an updown counter. The reason for choosing the up-down counter 
is that it has a very simple specification but allows many non-trivial and efficient 
implementations. 
An informal description of the communication behavior of an updown N-counter, 
for N > 0, is as follows. (We often write ‘N-counter’ when referring to an updown 
N-counter.) Two operations can be performed on an updown N-counter, namely an 
up and a down. The current count - or, simply, count - is defined as the number 
of up’s minus the number of down’s that have been received. Inputs to the counter 
are acknowledged by one of three outputs. If an input makes the count 0, then the 
message empty is sent to acknowledge the input. If an input makes the count equal 
* Corresponding author. 
’ This work is partially supported by the Natural Sciences and Engineering Research Council of Canada 
under grant OGP0041920 and by a grant from the Information Technology Research Centre of Ontario. 
’ This work was done while this author was visiting the Department of Computer Science of the University 
of Waterloo. 
0167~6423/96/$15.00 @ 1996 Elsevier Science B.V. All rights reserved 
PI1 SO167-6423(96)00005-6 
186 J. Segers. J.C. EbergenlScience of Computer Programming 27 (1996) 185-204 
to N then the message full is sent. If the result of an input is a count between 0 and 
N, the message neither is sent. Initially, the value of the count is 0. The environment 
of the counter is assumed not to cause underflow or overflow, i.e., the environment 
will not attempt an operation that makes the count less than 0 or greater than N. 
The task we set ourselves is to design an efficient implementation for uppdown 
N-counters, for any N > 0. An implementation consists of a set of communicating 
components, where the size of the components is independent of the size of the up 
down counter. To measure the efficiency of an implementation, we calculate estimates 
of the area, response time, and power consumption of certain hardware implementations. 
Informally, the area of an implementation is represented by a first-order approximation 
of the number of components in the implementation. The response time is represented 
by a first-order approximation of the delay that may occur between an input to the 
counter implementation and the succeeding acknowledgment. The power consumption is 
represented by a first-order approximation of the total amount of work that is performed 
per input to the counter, where work is measured in number of communications. The 
measure for power consumption is inspired by the research of Van Berkel et al. [15]. 
Up-down N-counters can be used, for instance, in hardware implementations of 
bounded queues, stacks and, semaphores. In [3], for example, an up-down counter is 
used to implement a bounded FIFO-queue. Many hardware implementations of up 
down counters are commercially available or are described in the literature. See, for 
example, [5,7, 11, 121 (sometimes, the specifications of these published designs are 
only slight variations on the counter described above). We found that all published 
designs are described using techniques from switching theory or graphical methods. 
Our goal is to present some designs using techniques from parallel programming. In 
[8] a similar approach is taken for the design of a modulo-N counter. 
Furthermore, all published designs are clocked designs, have a logarithmic area, 
usually do not give a response time analysis, do not give a power consumption analysis, 
and only apply to very few values of N (usually only for values 2k - 1, k > 0). We 
present several designs for the updown N-counter that also have a logarithmic area, but 
unlike all previously known designs, can be implemented easily as unclocked circuits, 
have a bounded response time, have a bounded power consumption, and apply to any 
N > 0. These bounds are asymptotically optimal. 
Since the semantics of our program notation is based on directed trace structures, 
we first give a formal specification of the uppdown counter in terms of a directed trace 
structure. Then we present a program for the uppdown counter and a first implemen- 
tation. Although our first implementations do not exhibit any parallel behavior, these 
sequential implementations achieve optimal area and power consumption, but not opti- 
mal response time. Introducing parallelism then leads to a number of implementations 
with optimal growth rates for the three performance measures. Finally, we discuss some 
issues in obtaining hardware implementations of updown counters using our designs 
in terms of communicating components. 
J. Segers, J.C. EbergenlIcience of Computer programming 27 (1996) 185-204 187 
2. Specification of an up-down N-counter 
Since the semantic domain of our program notation consists of directed trace struc- 
tures, we first give a specification of the up-down counter in terms of a directed trace 
structure. 
A directed trace structure is a triple (1,0, T). I and 0 are finite sets, and 7’ is a set 
of finite strings over 1 U 0. I is called the input alphabet, 0 the output alphabet, and 
7’ the trace set. For a trace structure S, the input alphabet, output alphabet, and trace 
set are denoted by is, OS, and tS. The abbreviation aS is used to denote iS U OS, the 
alphabet of S. 
In case a directed trace structure S describes a communicating process, iS is the set 
of input ports, OS is the set of output ports, and tS is the set of all possible sequences 
of communications on the ports. The sets of inputs and outputs are required to be 
disjoint, and the trace set is required to be non-empty and prefix-closed. The non- 
emptiness comes from the property that the empty trace is always a possible behavior, 
and the prefix-closedness comes from the property that every prefix of a possible 
behavior is a possible behavior as well. This trace semantics is based on [6]. The 
only difference is that in [6] trace structures are not directed. since they have only one 
alphabet. 
Message passing on input or output ports is modeled by taking the Cartesian product 
of port name and data type in the alphabets, instead of just the port name. Communi- 
cation on a port p is then written as a pair (p, u), where u is the communicated value. 
The set of values that can be sent or received on a port is called the port’s data type. 
This approach is similar to the one taken in [14]. 
For the specification of the up-down counter we use two data types: ud stands for 
the set {up, down} and enf stands for the set {empcv, neither, full}. 
The non-empty, prefix-closed trace structure specifying the behavior of the up-down 
N-counter, UDC(N), is given as follows. For any N > 0, the input alphabet of the 
trace structure consists of a port r with data type ud, and the output alphabet consists of 
a port a with data type enf. All elements of the trace set are alternations of inputs and 
outputs, starting with an input (except the empty trace). Furthermore, for every trace, 
the number of up’s minus the number of down’s on r is at least 0 and at most N. The 
value communicated on port a is empty iff the number of up’s minus the number of 
down’s in all preceding communications on r is 0. The value communicated on a is@/ 
iff the number of up’s minus the number of down’s in all preceding communications 
on r is N. In all other cases the value communicated on a is neither. The largest trace 
set that is prefix closed and that satisfies the above requirements is the trace set for 
UDC( N). 
Our program notation is a mix of CSP (see [6]) and Dijkstra’s guarded commands 
(see [2]). We introduce the notation by means of a simple example. The example is 
a program for a component that accepts bits as inputs and copies different values to 
188 J. Segers, J. C. EbergenlScience of Computer Programming 27 (1996) 185-204 
different ports. 
gc (in a : bit, out b, c : bit) = 
I[ var z1 : bit :: 
pref *[a?v;ifu=Othenb!u[v= 1 thenc!ufi] 
II. 
The first line contains the name of the command and its parameters. The notation ‘in a’ 
denotes that a is an input port, ‘out b,c’ denotes that b and c are output ports. All 
three ports are of type bit, which is the set (0, 1). 
The next line shows the declaration of the local variable ZJ. Brackets I[ and ]I delineate 
the scope of local variables. As usual, the result of a?v is that a value is received on a 
and stored in variable v. The result of b!v is that the value of o is sent on b. If a port 
is declared as an input port, it can only be used to receive values. A port declared as 
an output port can only be used to send values. Sequential composition is denoted by 
the semicolon. We use *[S] to denote unbounded repetition of S. The prefix-operator 
pref is used to construct prefix-closed trace structures. 
A guarded selection with two alternatives has the form 
if Go then S, 1 G1 then S1 fi. 
(In general there may be more than two alternatives.) The guards Go and G1 are 
Boolean expressions, SO and S1 are guarded commands. In all our programs exactly 
one of the guards evaluates to true when a guarded selection statement is executed. 
The execution of the guarded selection amounts to the execution of the command with 
the guard that is true. 
The trace structure S associated with the program above can be given as follows: 
is = {(a,O>,(a, l)) 0s = {(b, 01, (b, 11, Cc, 01, Cc, 1 )I, 
and tS contains all finite concatenations of (a, O)(b, 0) and (a, 1 )(c, 1 ), and their prefixes. 
It is easy to write a program for the up-down N-counter in our program notation. 
A variable n is used to record the number of up’s minus the number of down’s This 
is expressed by program invariant 10: 
1, : n = r#up - #down A 0 d n < N. 
we use &down to denote the number of down’s received on port r. Conjunct 0 d n d N 
can be added as a result of the assumption that no underflow or overflow is caused by 
the environment. 
The value of n is used to determine which value is sent on port a. If n becomes 0 as 
a result of an input, then the value empty is sent. If n becomes N, then the value jiilZ 
is sent. In all other cases the value neither is sent. The value to be sent is stored in 
variable vu, and an output a!va communicates the value of ua. Invariant 1, expresses 
J Seyers. J. C. Ebergen I Science of’ Compuler Progranminy 27 (IY96) 185 -204 189 
the relation between n and vu: 
11 : ( vu = empty -_ n = 0) A 
(vu=neither E O<n<N)A 
(vu = fill -n=N). 
The command for the updown N-counter is named UDC and given below. 
UDC (N : int,in r : ud,out a : 
IN > 01 
I[ var n : O..N, var UY : ud, 
pref (n,va := 0,empt~ 
{ Inv.: Z, A It} 
; *[ r?tv 
;if w = up 
enf, = 
var ca : enf :: 
then {n < N,see below} 
n:=n+l 
; if n # N then vu := neither 0 n = N then vu 
0 L’Y = down 
then {n > 0,ser below} 
n:=n-1 
;if n # 0 then ra := neither ] n = 0 then vu := 
fi 
;a!va 
= 
= full fi 
emp t)’ fi 
1 
) 
Under the assumption that the environment does not cause overflow, assertion n < N 
is valid after guard vr = up. By symmetry, n > 0 is a valid assertion after guard w = 
down. 
Obviously, the program above depends on N. We would like to implement this 
program by a number of communicating components, each of which is independent of 
N. The number of components of the implementation, however, may depend on N. 
3. Implementation 
To obtain an implementation for an N-counter in terms of a set of communicating 
components, we use an inductive approach. The idea is to decompose UDC into a 
component with a bounded number of states and an M-counter, where M < N. The 
component and the M-counter communicate with each other on internal channels. 
With this in mind, we start by representing n by two variables d and m. We maintain 
the relation n = Bm + d for some B greater than 0. In the next step we try to replace 
190 3. Segers, J. C. Ebergen I Science of Computer Programming 27 (1996) 185-204 
operations on m by communications with an M-counter. Therefore we want 0 d m d M 
to be an invariant. Which values should we allow d to have? If we want to maintain 
n = Bm + d for n < B, we must at least allow 0 d d < B. We introduce a parameter 
D for an upper bound for d, with the requirement that D 2 B - 1. 
The maximum value that Bm + d can have is BM + D. So B, M and D have to be 
chosen in such a way that N = BM+D, with B > 0, M > 0, D 2 B- 1, and M <N. 
The last requirement is necessary to make the induction work: we have to make sure 
that we decompose an N-counter into a component and a smaiier counter. Since we 
already have B > 0 and D 2 B - 1, requirement M < N can be replaced by D > 0. 
Note that for any N > 2 we can find B, M, and D satisfying the conditions. 
Now we turn our attention to the transformation of the invariants and the program. 
We already have a new candidate for ZJ, namely 
& : Bm f d = r#up - r#dobon AO<m<MAO<ddD. 
In order to maintain 2, and the relation n = Bm + d, an increment n := n + 1 can be 
implemented by 
ifd#Dthend:=d+l 0 d=Dthend,m:=D-B+l,m+lfi 
A decrement n := n - 1 can be implemented by 
ifd#Othend:=d-1 1 d=Othend,m:=B-l,m-lfi 
Furthermore, since N = BM + D, we have also have 
n=O Em=OAd=O 
n=N =rn=M~d=D. 
This means that we can translate 11 directly into 
2, : (ua = empty E m = 0 Ad = 0) 
A (ua = neither E y(m = 0 A d = 0) A l(m = M A d = 0)) 
A (va = fill G m=MAd=D). 
After similar transformations to the program UDC we get 
C(B, M, D : int, in r : ud, out a : enf) = 
{M>O/‘/B>OAD>OAD>B-1) 
I[ var m : O..M, var d : O..D, var vr : ud, var va : enf :: 
pref (m, d, ua := O,O, empty 
{Inv.: ZJ A 11) 
;*[ r?vr 
; if vr = up 
then{m<MVd<D} 
ifd#Dthend:=d+l 
]d=Dthend,m:=D-B+l,m+l 
fi 
J. Segers. J. C. EbergenlScience of Computer Programming 27 (19961 185-204 191 
;ifm#MVd#Dthenva:=rzeither 
1 m = M A d = D then va := jidl 
fi 
0 vr = down 
then {m > 0 V d > 0} 
ifd#Othend:=d- 1 [d=Othend.m:=B- l.nz- I fi 
;ifm#OVd#Othenva:=neither 
0 m = 0 A d = 0 then va := ernptJ 
fi 
fi 
;a!va 
(10 A 51 
1 
1 
Let us take a closer look at this program, in particular the statements involving vari- 
able m. The only fragments that contain variable m are the Boolean expressions m = 0, 
m = M, m # 0. m # M, and the assignments m := m - 1 and m := m + 1. This suggests 
that an up-down M-counter can be used to implement the operations on variable m. 
Consequently, we introduce an M-counter with ports sr and sa and variable LWU to 
store the value received at port sa. The assignment m := m + 1 can be implemented 
by sr!up;su?vsa and, similarly, m := m - 1 can be implemented by sr!down;sa?vsa. 
Using the induction hypothesis for the updown M-counter with M < N, we can 
implement the expression m = 0 by zva = empty. Similarly, expressions m # 0, 
m = M, and m # M can be implemented by vsa # empty, LW = jitll, and rsa #_fidl. 
respectively. 
Because we made the assumption for any uppdown M counter that the environment 
will not cause any overflow or underflow, we have to check whether this condition 
is satisfied for the up-down M counter in our decomposition. This is easy, since the 
invariant 0 < m < M in the above program guarantees that no overflow or underflow 
occurs in the M-counter, 
Performing the substitution of sr#up - sr#don,n for m and the corresponding substi- 
tutions for the expressions m = 0, m = M, m # 0, and m # M leads to the following 
invariants 
TO : B * (sr#ap - s&down) + d = r#up - r#dol,,n A 0 d d d D 
and 
21 : (va=empty E vsa=enzptyAd=O)A 
(vu = neither E l(vsa = empty’ A d = 0) A T(vsa = full A d = D)) A 
(vu = fill E vsa = fill A d = D). 
192 J. Segers. J. C. EbergenlScience of Computer Programming 27 (1996) 185-204 
After performing the substitutions in the program, we get a component CO, given 
by 
CO(B, D : int, in r : ud, out a : enf, out sr : ud, in sa : enf) = 
{B>OAD>OAD>B-1) 
I[ var d : O..D, var VY : ud, var vu, vsa : enf :: 
pref (d, vu, vsa := 0, empty, empty 
{Inv.: Z0 A 1,) 
; *[ r?vr 
; if vr = up 
thenifd#Dthend:=d+l 
0 d = D then d := D - B + 1 ]I (sr!up;sa?vsa) 
fi 
; if vsa #full v d # D then vu := neither 
~vsa=fullAd=Dthenva:=full 
fi 
1 vr = down 
thenifd#Othend:=d-1 
1 d = 0 then d := B - 1 I] (sr!down;sa?vsa) 
fi 
; if vsa # empty V d # 0 then vu := neither 
1 vsa = empty Ed = 0 then vu := empty 
fi 
fi 
; a!va 
{To A z } 
1 
1 
II. 
The execution of sr!up corresponds to a carry propagation. Similarly, sr!down cor- 
responds to a borrow propagation. 
We use ‘II’ to denote parallel composition of statements or components. Parallel 
composition is interleaving with synchronization on common ports or symbols [6]. 
With the derivation above we have proved that 
UDC(BM + D,r,a) = I[ sr,sa :: CO(B, D, r, a, sr, sa) ]I UDC (A4, ST, sa) ] I 
for B > 0, A4 > 0, D 2 B - 1, and D > 0. The brackets I[ and ]I delineate the scope 
of the internal channels ST and sa. 
For the basis of our inductive approach we use an up-down l-counter. For 2 < N 
the inductive step applies. An updown l-counter is easy to specify: 
UDCl(in r : ud,out a : enf) = 
pref * [ r?up; a!full; r?down; a!empty ] 
J. Segers. J.C. EbergenIScience of‘ Computer Programming 27 (1996) 185-204 193 
4. Performance analysis 
Using the strategy described above, an N-counter can be implemented in a number 
of ways by choosing different values for B and D for the cells that implement the 
counter. How do we compare these implementations? We consider three performance 
criteria that are related to the area, response time, and power consumption of certain 
hardware implementations. For the area we consider the number of basic components 
in the implementation; for the response time we consider upper bounds for the delay 
between an input and the succeeding output of the implementation; and for the power 
consumption we consider the number of internal communications amortized over the 
number of external input operations. We only study first-order approximations of these 
performance measures. The notations O(f(n)), R(f(n)), and @(f(n)) are used to 
indicate a first-order approximation for an upper bound, lower bound, and tight bound 
of a function of n. 
The names of ports do not influence the properties we want to investigate. Therefore, 
we often refer to a cell CO as CO(B,D) with specific values for B and D and without 
the names of the ports. 
We consider up-down N-counter implementations consisting of a number of com- 
ponents. We assume that the number of components is K, one of which is a small 
updown counter. We number the components, starting with 0 for the front cell, and 
ending with K - 1 for the small counter, which is the last component in the array. We 
refer to the local variables and actual parameters of component i by adding a subscript 
i. For example, in an implementation consisting of CO cells, we use do to refer to 
the local count of the front cell, that is, the value of the local variable d in the front 
cell. For the end cell, we use d~_i to refer to its local count. For the values of the 
parameters B and D for component i, we use Bi and Di. 
There is a simple relation between the local counts, cli, of the components in the 
implementation and the global count of that implementation if all Bi’S are equal. If 
the implementation is in a stable state (i.e., all components are waiting for input). the 
global count is 
(Ci:Odi<K:di*B’), 
with B for the value of the Bi’S. 
Any updown counter can be implemented by a UDC 1 and a number of CO cells, 
all with B = 1 and D = 1. In this case the count of the implementation is just the sum 
of the local counts of the cells, provided that the implementation is in a stable state. 
The implementation can be called a radix-l implementation. 
With this approach it takes N - 1 CO(l, 1) cells and a UDC 1 to implement an N- 
counter. In general, if all the cells in the implementation of an N-counter have B = I, 
then the number of cells in the implementation grows linearly with N. We say that the 
implementation’s cell count is O(N). 
Using a combination of CO(2,l) cells and CO( 1,1) cells results in a lower cell count 
for an N-counter implementation. These types of cells together with a UDC 1 cell are 
194 J. Segers, J. C. EbergenIScience of Computer Programming 27 (1996) 185-204 
sufficient to make updown counters of any size with a cell count of O(logN). For 
N > 0, we use the decomposition of UDC(2N + 1) into a CO(2,l) cell and UDC(N). 
For N 3 1, UDC(2N) is decomposed into a CO( 1,1) cell and UDC(2N - 1). If 2N - 1 
is greater than 1, then a CO(2,l) cell can be used to further decompose UDC(N - 1). 
For N = 1, a single UDC 1 cell is used. The final result is a decomposition of an 
N-counter in O(logN) cells of type CO(2,l) and O(logN) cells of type CO(l, 1). The 
number of cells in the implementation is O(logN). 
If N = 2K - 1 for some K 2 1, then the implementation contains K - 1 cells of type 
CO(2,1), and a UDC 1 cell. The count is given by 
provided that no carry or borrow propagation is being performed. This is a simple 
binary counter. For other N, we obtain a mixed-radix implementation. 
It is not possible to obtain a lower growth rate for the cell count than O(logN). 
Recall that each cell must have a bounded number of states. The parallel composition 
of a set of K cells with each at most q states then has O(qK) states. This means that 
the parallel composition of a set of O(logK) cells has O(K) states. Since an N-counter 
has R(N) states, and our CO cells have a bounded number of states, it takes SZ(logN) 
cells to implement the N-counter. 
Another interesting performance characteristic is the delay that may occur between 
an input to a counter implementation and the succeeding output, also called the re- 
sponse time. We consider the worst-case response time and the amortized response 
time. The worst-case response time is defined as the maximum amount of time that 
may elapse between a request on the input port and the following acknowledgment 
on the output port of the implementation. The amortized response time is obtained 
by first calculating the average response time for each allowed input sequence and 
then taking the maximum over these response times. The average response time of a 
sequence of operations is the sum of the response times between matching requests 
and acknowledgments divided by the number of request-acknowledgment pairs. One 
could also say that the amortized response time is the average response time of the 
worst-case sequence of input operations. 
In order to get an estimate for the response time of a particular N-counter imple- 
mentation, we assume that delays in the cells of the implementation are bounded. For 
cells of type CO the structure of the specification is 
pref * [ r?; if G then sr!; sa? 0 -G then skip fi ; a! 1, 
where G is a guard that expresses whether there is carry or borrow propagation. This 
command is obtained by abstracting away from the data that is communicated between 
the components. We assume that the delay between an input r? and the next output is 
bounded from below and from above, and we assume that the delay between an input 
sa? and the next output a! is also bounded from below and from above. For reasons of 
simplicity we assume one lower bound, 6, and one upper bound, A, for these delays. 
The assumptions express that an output is produced within a bounded delay from the 
J. Segers, J.C. EbergenIScience of Computer Programming 27 (1996 J 185-204 195 
occurrence of the enabling input. This is a valid assumption for CO cells, since the 
amount of internal computation between consecutive communications is bounded. 
Let us now consider the response time of our implementation of the updown N- 
counter. The response time of a request r is at most d if in the first cell G is false. If 
in the first cell G is true, then the response time of the request is at most d plus the 
response time for the sub counter plus another d. More precisely, the response time 
of an input operation is proportional to the length of the carry or borrow chain caused 
by that operation. It is possible for the implementation to be in a state where di = D, 
for all i < K - 1, and dK_-] = 0. The carry chain of an up received in this state has 
length K - 1. So the response time of this up is O(K), which is also the worst-case 
response time. 
If the counter implementation is in the state described above, and if Bi 2 2 for all 
i < K - 1, then the assignment di := Di -Bi + 1 is executed for all i < K - 1 as a result 
of the up input. Furthermore, if also D, = B; - 1, then this assignment sets the values 
of all d,‘s to zero. This means that now the receipt of a docrn results in a borrow 
chain of length K - 1, and a return to the original state with d, = Di for all i < K - 1. 
A behavior can be constructed in which, after some initial behavior, up’s and down’s 
alternate and all requests have a response time of O(K ). By making the number of 
alternations large enough, the response times of the initial behavior can be ignored in 
calculating the average response time of the behavior. Thus, an average response time 
of O(K) is obtained for a behavior of this kind. Since the maximum length of any 
carry chain is K - 1, there are no behaviors with worse average response time. Hence 
the amortized response time is O(K) for implementations that have D, = B, - 1 and 
B, 3 2 for all i < K - 1. This is the case, for instance, for the binary counter. 
If Di > B, - 1 and B; 3 2 for all i < K, then the situation is different. In this case, if 
d; = Di for some i < K and cell i receives an up, the new value of di, i.e., D, - Bi + I, 
will be greater than 0 and less than Di. So whatever the next input is, up or down, 
it will not result in a carry or borrow propagation. In general, for any component, 
any operation after an operation that leads to a carry or borrow propagation will not 
lead to a carry or borrow propagation. Consequently, any sequence of input operations 
that leads to j communications on rj, leads to at most j/2 communications on r,_,. 
Therefore, for any sequence of j (external) input operations, the number of internal 
requests is at most 
(C i : 1 < i < K - 1 : j/2’) < j * (C i : 1 < i : l/2’) = j. 
This means that the average length of a carry or borrow chain is at most 1. Since the 
response time of a request is proportional to the length of the carry or borrow chain, 
the amortized response time of a behavior is bounded from above by a constant. 
In the case that B, = 1 for all i < K, the amortized response time grows linearly with 
K. This can be seen by looking at the behavior with N request-acknowledgment pairs, 
where all requests are up’s. The carry chains caused by the first Do up’s have length 0, 
the carry chains caused by the next D1 up’s have length 1, the carry chains caused by 
the next 02 up’s have length 2, and so on. Since Di does not depend on N for any i, 
196 J. Segers. J C. Ebergen I Science of Computer Programming 27 (1996) IX-204 
the carry chain for the jth up has length O(j). Now it follows that the average length 
of the carry chain on this behavior is (Cj : 0 < j < N : O(j))/N = O(N). There are 
O(N) cells in the implementation, so the longest carry chain has length O(N) too. This 
means that the average carry chain cannot be worse than O(N). A similar reasoning 
holds for the borrow chains, and so we can conclude that the amortized response time 
is O(N). 
In summary, we have the following results. If an updown N-counter is implemented 
by K cells of type CO and a UDC 1, its worst-case response time is O(K). For radix-l 
implementations this means that the worst-case response time is O(N); for implementa- 
tions with B > 1 for all cells, the worst-case response time is O( log N). The amortized 
response times are different for different values of B and D. For radix-l implementa- 
tions the amortized response time is O(N). If for all cells in the implementation B >, 2 
and D = B - 1, then the amortized response time is O(logN). If, on the other hand, 
B 2 2 and D > B - 1 for the cells in the implementation, then the amortized response 
time is O( 1). 
Our last performance measure is the communication count. In a sequential compu- 
tation the number of primitive operations performed to compute a result can be taken 
as a measure for the ‘cost’ of the computation. In our counter implementations, the 
communications can be taken as the primitive operations. We can ignore the internal 
operations in the cells, because the number of communications a cell engages in is 
proportional to the number of internal operations it performs, and we only consider 
first-order approximations. 
For a sequence of input operations, the communication count is the number of com- 
munications amortized over the number of input operations. For an uppdown counter 
implementation we can compute the communication count in the following way. We 
can attribute every internal communication to an external up or down operation. All 
communications that belong to the carry or borrow chain caused by an input operation 
are attributed to that input operation. Furthermore, the input operation itself is a com- 
munication, and the resulting output operation is a communication too. Therefore, we 
can define the cost of an input operation as two times the length of the carry or borrow 
chain it causes plus two. This means that the communication count of an N-counter im- 
plementation with CO cells has the same order of growth as its amortized response time. 
5. Introducing parallelism 
In the previous section a program was presented that abstracted away from the 
communicated values in a counter cell. We repeat that program here 
pref * [ r?; if 1G then skip 1 G then sr!; sa? fi; a! 1, 
It is clear that there is no parallelism: at no point during the execution more than one 
communication action is enabled. Introducing parallelism may result in better response 
times for counter implementations. 
.J. Seqers. J. C. Ehergen i Science of Computer Programming 2 7 (19% I 185-204 197 
We introduce parallel behavior by deleting unnecessary ordering of actions, without 
changing the cell’s behavior with respect to the pair of channels r and a. In the program 
above, action a! can be performed in parallel with the communications on sr and sa. 
This gives the following result 
pref * [ r?; if -G then a! 1 G then (sr!; sa?) I( u! fi 1. 
We can relax the ordering even further by allowing communications on Y to occur in 
parallel with communications on sr and sa. This results in the communication behavior 
pref (i-? 
; *[ if -G then a!; r? 0 G then (sv!; XI?) 11 (a!; r?) fi ] 
). 
Now input on r is allowed directly after an output on a has been sent. 
Unfortunately, these transformations do not always work if we take into account 
which values are communicated. In some cases a value received on sa is needed to 
compute the value to be sent on a. We analyze when the transformations are possible 
for command CO. We notice that there are two cases where communication on sr and 
scl occurs. 
For the first case, where or = up, we look at the following program fragment from 
CO. 
if d # D then d := d + 1 1 d = D then d := D - B + 1 // (sr!up;sa?csa) fi 
; if cm #,full V d # D then va := neither I] vsa = full A d = D then ua :=-full fi 
In the case d = D, the value D-B+1 is assigned to d, and then a new value is assigned 
to DU. The assignment vu := neither is guarded by PSU # full V d # D. Substituting 
D-B+ 1 for d in this guard, we see that it can be simplified to rsa #fil/lVB # 1. Since 
the guard of zla := full is the negation of the guard of ua := neither, the statement 
L’~J := neither is performed if B # 1, irrespective of the value of csa. In other words. 
the value to be sent on channel a does not depend on the value received on channel 
.ru. and so the ordering between sa? and a! can be relaxed if B # 1. If B = 1, however, 
the value to be sent on channel a does depend on the value received on channel sa, 
and the ordering between sa? and u! cannot be relaxed. 
For the second case, where vr = do,$x, we look at the following program fragment 
from CO. 
if d # 0 then d := d - 1 0 d = 0 then d := B - 1 /I (sr!down;sa?csa) fi 
; if usa # empty V d # 0 then va := neither 
1 usa = empty A d = 0 then ua := empty 
fi 
A similar argument as for the first case applies. After assignment d := B - 1, the guard 
csa # empty V d # 0 can be simplified to csa # empty V B # 1. Again B # I imphes 
198 J. Segers, J. C. EbergeniScience of Computer Programming 27 [I9961 185-204 
that vu := neither will be performed, irrespective of the value of vsa. Again, B = 1 
means that the ordering between sa? and a! cannot be relaxed. 
Using the above observations, a command Cl can be derived from CO. For Cl we 
assume that B > 1. This ensures that the correct acknowledgment can be computed 
without having to wait for input from the subcomponent. Another difference between 
CO and Cl is the elimination of variable vu from Cl. Instead of assigning a value to 
vu in the different alternatives and then sending the value of vu on a, we now write 
a!empty, a!neither, or a!fuil, whichever is appropriate. 
Cl(B,D : int,in r : ud,out a : enf,out SY : ud,in sa : enf) = 
{B>lAD>OAD>B-1) 
I[ var d : O.B, var vr : ud, var vsa : enf :: 
pref (d, vsa := 0, empty 
; r?vr 
;*[ifvr=up 
then {d < D V vsa #full} 
ifd#D 
thend :=d+ 1 
; if vsa #fill V d # D then a!neither; r?vr 
1 vsa = full A d = D then a!full; r?vr 
fi 
[d=D 
then d := D - B + 1 11 (sr!up; sa?vsa) 11 (a!neither; r?vr) 
fl 
1 vr=down 
then (d > 0 V vsa # empty) 
ifd#O 
thend :=d- 1 
; if vsa # empty V d # 0 then a!neither; r?vr 
0 vsa = empty A d = 0 then a!empty; r?vr 
fi 
Id=0 
then d := B - 1 11 (sr!down; sa?vsa) 11 (a!neither; r?vr) 
fi 
fi 
)I 
II. 
After some further simplification, the command Cl can be rewritten as 
Cl(B,D : in&in r : ud,out a : enf,out sr : ud,in sa : enf, = 
{B>lAD>OAD>B-1) 
I [ var d : O..D, var vr : ud, var vsa : enf :: 
J. Segers, J. C. EbergenlSeience of Computer Programming 27 (19961 185304 199 
pref (d, vsa := 0, empty 
;r?vr 
;*[ifvr=upAd#DA(vsa#fuNvd+l #D) 
then d := d + 1 11 (a!neither; r?vr) 
1 vr=upAvsa=Jidlr\d+l =D 
then d := d + 1 11 (a!jLZl; r?vr) 
1 vr=downr\d#OA(vsa#emptyVd-1 #O) 
then d := d - 1 jl (a!neither;r?rr) 
1 vr = down A vsa = empty A d - 1 = 0 
then d := d - 1 /I (a!empty; r?vr) 
1 vr=upr\d=D 
then d := D - B + 1 II (sr!up; sa?t:sa) 11 (a!neither: r?z.r) 
1 ur = down A d = 0 
then d := B - 1 I/ (> yrl .d own; sa?vsa) I/ (a!neither; r?vr) 
fi 
1 
) 
II. 
How can an N-counter be decomposed into a collection of Cl cells and a small up- 
down counter? For N > 0, the (2N+2)-counter can be decomposed into a Cl(B,D) cell 
and an N-counter, where B = 2 and D = 2. The (2N + 1)-counter can be decomposed 
into a Cl(B, D) cell and an N-counter, where B = 2 and D = 1. 
One of two end cells is required in the decomposition, UDCl or UDC?. In the 
previous section the specification of UDCl can be found. The specification of UDC:! 
reads 
UDC2(in r : ud,out a : enj) = 
pref (r?up; a!neither 
;*[ r?up; a!fulI; r?down; a!neither I r?dotin; a!ernpty; r?up; a!neither ] 
1. 
An up-down 2-counter first receives an up on port r. which is acknowledged with 
a neither on port a. The count of the counter is now 1. Subsequently, one of two 
sequences of actions may occur; each will lead back to the same state. The first of the 
two sequences starts with the receipt of an up, which is followed by the acknowledg- 
ment full, the receipt of a down, and the acknowledgment neither. The second possible 
sequence consists of the receipt of a down, which is followed by the acknowledgment 
empty, the receipt of an up, and the acknowledgment neither. 
6. Performance analysis of Cl implementations 
Let us see if we can achieve lower upper bounds for the cell count, response time 
and communication count of our counter implementations by using Cl cells instead of 
CO cells. 
200 J. Segers, J. C. EbergenlScience of Computer Programming 27 (1996) 185-204 
The cell count of N-counter implementations built from Cl cells is @(log N), since 
all Cl cells have B > 1. 
The communication count of counter implementations built from Cl cells does not 
improve either. Recall that the communication count is determined by the average 
length of a carry or borrow chain of a worst-case sequence of inputs. It does not matter 
whether carries and borrows caused by different inputs can occur in parallel, as long 
as all carries and borrows are taken into account. Since there is no difference between 
an implementation consisting of CO cells and one consisting of Cl cells looking only 
at the number of communications taking place in response to an input sequence, it 
follows that the communication count of an updown counter built from Cl cells is 
the same as that of a counter of the same size, but built from CO cells with the same 
parameters. 
The response time analyses, however, are different this time. As a result of the 
parallelism, requests may be acknowledged before the carry or borrow propagation has 
ended. This means that for counter implementations built from Cl cells, we cannot 
use the length of carry and borrow chains to analyze the response time. 
Let us consider an N-counter implementation that consists of K cells of type Cl and 
an end cell. Recall that, since B > 1, K = O(logN). For the purpose of response time 
analysis we abstract away from communicated values and internal computations and 
concentrate on the communication behavior of the cells. The communication behavior 
reads 
pref (r? 
; *[ if -G then a!; r? 0 G then (sr!; sa?) 11 (a!; r?) fi ] 
). 
Guard G in the guarded command may be interpreted as ‘there is a carry or borrow 
propagation.’ As before, we assume that delays in cells can vary between a fixed lower 
bound 6 and a fixed upper bound A. A delay of a cell with respect to an output is the 
time difference between that output and the input enabling that output. For example, 
the input enabling output a! (and output sr!) may be input r? or input sa?, whichever 
occurs last. 
We distinguish two cases in the response time analysis, namely Di = Bi - 1 for all 
i < K, and Di > Bi - 1 for all i < K. 
For the first case the worst-case response time is still O(K), as it was for implemen- 
tations without parallelism. But now a very specific delay distribution is required to 
obtain a Q(K) response time, whereas for sequential implementations only the length 
of the carry chain mattered. If a response time of R(K) is to occur, the implementa- 
tion must be in a state where an input on rg will cause a carry or borrow chain of 
length R(K). It is possible to have all succeeding inputs on t-0 cause a borrow or carry 
chain of length R(K) by alternating up and do\vn requests. Having Cl(K) carries and 
borrows propagating fast (with delays 6 between inputs to and outputs from cells) and 
then choosing delays A for acknowledgment propagation from the last to the first cell, 
leads to a response time of R(K). Although such behaviors are unlikely to occur in 
J. Segers. J. C. EbergenlScience of Computer Programming 27 (1996) 185904 201 
practice, they are theoretically possible in our variable-delay model. If all delays are 
constant, that is, 6 = A, then the worst-case response time is constant as well. 
For the second case, where all cells have Di > B, - 1, the worst-case response time 
is O( 1). The proof for this claim is quite involved, and outside the scope of this paper. 
A crucial property for the proof, is the fact that the average length of carry and borrow 
chains is O( 1). The second author will address this subject in a more general setting 
in a forthcoming publication. 
The amortized response time for the second case is also O( 1). For the first case the 
amortized response time is O( 1) as well, since for every input with a response time 
of O(K) in a behavior there are R(K) inputs with a response time of O( 1). 
The reader is reminded that our performance analyses are first-order approximations 
only. As such, they tell us what the optimal design would be for large N. For small 
N, one may have to do a more detailed analysis, since the hidden constants and 
second order terms may become dominant. Also, we have seen that increasing D while 
keeping B fixed, leads to shorter average carry and borrow chains, and thus, in our 
first-order model, to better or equally good response times. However, when considering 
implementations of Cl(B, D) cells, the response time of the individual cell may be seen 
to increase with D. So from a practical point of view one would want to choose the 
value for D as small as possible, under the constraint that they are large enough to 
obtain bounded response time. 
7. Optimal implementations 
An N-counter implementation is optimal with respect to all three of our performance 
criteria if its cell count is @(log N), its worst-case response time is O( 1 ), and its com- 
munication count is O( 1). No implementation built from CO cells is optimal. because 
its worst-case response time is not O( 1). 
Earlier we described how any N-counter can be built from Cl(B, D) cells, with 
B = 2 and D = 1,2, and a 1- or 2-counter as end cell. This strategy does not always 
lead to an implementation with a worst-case response time of O( 1) or communication 
count of O( 1). For instance, if N = 2K - 1 for some K 2 1, then the implementation 
consists of Cl(2,l) cells and a l-counter. Both the worst-case response time and the 
communication count for this implementation are O(K). 
There are several ways to obtain implementations with better performance charac- 
teristics. One of them is to use Cl(B,D) cells, with B = 2 and D = 2,3, and a l- 
or 2- or 3-counter for an end cell. Then for every N-counter implementation, all cells 
have D > B - 1, and the worst-case response time and communication count are O( 1). 
Since counting is in base 2, the cell count is @(log N). 
Another solution is to use C1(2,2) cells and CO( 1,1) cells. A (2N + 2)-counter is 
decomposed into a C1(2,2) cell and an N-counter; a (2N + 1 )-counter is decomposed 
into a CO( 1,1) cell and a 2N-counter. It is sufficient to have a l-counter as an end cell. 
An implementation built in this way has no neighboring CO cells (except maybe at the 
202 J Segers, .I C. EbergenIScience of Computer Programming 27 (1996) 185-204 
end where two CO cells may be adjacent). Since at least half the cells are C1(2,2) 
cells, the cell count of an N-counter implementation is O(logN). The crucial property 
for bounded amortized response time and bounded communication count still holds: 
the average length of a carry or borrow chain is O( 1). Bounded communication count 
is a direct result of this property. Without proof we also mention that the worst-case 
response time is O( 1). 
8. Concluding remarks 
We presented a small example in the design and analysis of a frequently used hard- 
ware component: an up-down counter. Although the specification of the counter is very 
simple, there were many interesting and non-trivial ways in which this specification 
can be implemented. Each implementation was described in a simple CSP-like notation 
as a collection of communicating cells and was derived using some simple program- 
ming techniques. With just a few different cells, up-down counters of any size could 
be implemented. We also analyzed the performance of the different implementations 
with respect to cell count, response time, and communication count. Table 1 gives an 
overview of the performance of these implementations. Several designs with an optimal 
cell count, optimal worst-case response time, and optimal communication count were 
suggested. 
Most published hardware implementations we have found are synchronous imple- 
mentations and are based on some sort of binary representation of the count. For most 
designs this means that the clock frequency depends on the counter size N due to 
the carry and borrow propagation during increments and decrements [9]. Furthermore, 
every synchronous implementation based on a binary representation of the count has 
to clock in each period about 1ogN storage devices (even when no input operation is 
performed), making the power consumption at least proportional to 1ogN. 
Although our designs can be implemented by means of a synchronous circuit, they 
are particularly suited for an asynchronous circuit implementation. In fact, there are sev- 
eral ways in which our cells can be implemented in an asynchronous circuit. Each of 
the methods described in [ 1, 16, lo] can be used to translate the commands CO and Cl 
into an asynchronous circuit. In [4] a simple implementation for cell Cl(2,l) is given. 
In case of an asynchronous circuit implementation, the performance measures also may 
be interpreted in more concrete terms. The cell count of our designs gives a first-order 
approximation of the area of the hardware implementation, since the area needed for 
connection wires among the cells is negligible due to the simple nature of the linear 
arrayzof cells and no additional area is needed for the clock distribution. Furthermore, 
in absence of a clock frequency, the response time gives an estimate for the speed of 
the asynchronous implementation. Finally, under certain conditions, the communication 
count can be used as a measure of the power consumption for an asynchronous imple- 
mentation ([4, 151). Since all our performance estimates are based on the program texts 
and I& on a particular hardware implementation, and since these estimates are only 
J. Segers. J. C. EbergenlScience of Computer Programming 27 (1996) IX-204 203 
Table 1 
Summary of performance results for counter decompositions, M > 0. BSC stands for Binary Stored Carry 
([13]). which refers to the use of cell C1(2,2), and BSDC stands for Binary Stored Double Carry ([13] ). 
which refers to the use of cell C1(2,3). Cells CO have a sequential communication behavior, cells Cl have 
a parallel communication behavior 
Case Decomposition Cell 
count 
Response time 
Worst-case Amortized 
Communication 
count 
Radix I: all N 
N>l CO( 111) C’DC( N - 1) O(N) O(N) 
N=l UDC 1 
Binary: N = Zk - I 
,y zz Zk” _ 1 CO(2. l), UDC(2k - 1) @(log N) O(logN) 
N = 1 (/DC1 
Mived radi.y (I,.? ,: all N 
N=2M+l CO(2, l), UDC(A4) @(log N) O(logN) 
N = 2M CO(1. I), UDC(2M - 1) 
N = 1 UDC 1 
BSC; N = 2k+’ - 2 
N z 2k+’ _ 2 Cl(2.2). UDC(2k+’ - 2) @(log N) Q(1) 
N=2 LIDC2 
BinarylBSC; all N 
N = 2M + 2 Cl(2,2), UDC(M) @(log N) O(togN) 
N = 2M + 1 Cl(2.1). UDC(M) 
N=2 lJDC2 
N=l UDCI 
Radix 2 with BSCIBSDC; all N 
N=2M+3 C1(2.3), UDC(M) @(log N) Q(1) 
IV = 2M + 2 Cl(2.2). UDC(A4) 
N=3 UDC3 
N=? UDC2 
N-l UDC 1 
Mixed radix with BSC; all N 
N=2M+2 Cl(2.2), UDC(M) @(log N) @(I 1 
N = 2M + 1 CO( 1.1). UDC(2M) 
N = 2 CO(l,l).UDCl 
N=l CfDC 1 
Q(N) Q(N) 
Q(logN) Q(log N) 
Q(logN) @(log N ) 
Q(1) 
Q(1) 
Q(I) 
O( log N ) 
Q(l) 6)(l) 
Q(1) 
first-order approximations, it would be interesting to see how accurate these estimates 
are with respect to different asynchronous implementations. Experimental results for a 
slightly different counter reported by Van Berkel in [15] have been encouraging. 
Acknowledgements 
The Eindhoven VLSI Club and the Maveric Club at the University of Waterloo are 
gratefully acknowledged for their comments on previous versions of this paper. 
204 J. Segers, J. C. EbergenIScience of Computer Programming 27 (1996) 185-204 
References 
[I] A. Davis, B. Coates and K. Stevens, Automatic synthesis of fast compact asynchronous control circuits, 
in: Proc. IFIP WGI0.5 Working Conf: on Asynchronous Design Methodologies (March 1993). 
[2] E.W. Dijkstra, A Discipline of Programming (Prentice-Hall, Englewood Cliffs, NJ, 1976). 
[3] D.L. Dill, S.M. Nowick and R.F. Sproull, Specification and automatic verification of self-timed queues, 
Formal Methorls in System Design 1 (1) (1992) 29-60. 
[4] J.C. Ebergen, J. Segers and I. Benko, Parallel program and asynchronous circuit design, in: G. Birtwistle 
and A. Davis, eds., Asynchronous Digital Circuit Design (Springer, Berlin, 1995) 50-103. 
[5] L.J. Guibas and F.M. Liang, Systolic stacks, queues, and counters, in: P. Penfield Jr., ed., 1982 Conf 
on Advanced Research in VLSI (Artech House, 1981) 155-164. 
[6] C.A.R. Hoare, Communicating Sequential Processes (Prentice-Hall, Englewood Cliffs, NJ, 1985). 
[7] E.V. Jones and G. Bi, Fast up/down counters using identical cascaded modules, IEEE J. Solid State 
Circuits 23 (1) (1988) 283-285. 
[S] J.L.W. Kessels, Calculational derivation of a counter with bounded response time, in: G. J. Milne and 
L. Pierre, eds., Correct Hardware Design and Verification Methods, IFIP WG10.2 Advanced Research 
Working Conf. (CHARME ‘93), Lecture Notes in Computer Science, Vol. 683 (Springer, Berlin, 1993) 
203-213. 
[9] X.D. Lu and P.C. Treleaven, A special-purpose VLSI chip: A dynamic pipeline updown counter, 
Microprocessing and Microprogramming 10 (1) (1982) l-10. 
[lo] A.J. Martin, Programming in VLSI: From communicating processes to delay-insensitive circuits, in: 
C.A.R. Hoare, ed., Developments in Concurrency and Communication (Addison-Wesley, Reading, MA, 
1990). 
[11] R.M.M. Oberman, Counting and Counters (MacMillan, NewYork, 1981). 
[12] B. Parhami, Systolic up/down counters with zero and sign detection, in: M. J. Irwin and R. Stefanelli, 
eds., IEEE Symp. on Computer Arithmetic (IEEE Computer Sot. Silver Spring, MD, Press, 1987) 
174-178. 
[13] B. Parhami, Generalized signed-digit number systems: A unifying framework for redundant number 
representations, IEEE Trans. Comput. 39 (1) (1990) 89-98. 
[14] M. Rem, Trace theory and systolic computations, in: J.W. de Bakker, A.J. Nijman and P.C. Treleaven, 
eds., PARLE: Parallel Architectures and Languages Europe, Vol. I, Lecture Notes in Computer 
Science, Vol. 258 (Springer, Berlin, 1987) 14-33. 
[15] C.H.K. van Berkel, VLSI programming of a modulo-N counter with constant response time and constant 
power, in: S. Furber and M. Edwards, eds., IFIP WG 10.5 Working Conf on Asynchronous Design 
Methodologies (Elsevier, Amsterdam, 1993). 
[16] K. van Berkel, Handshake Circuits: an Asynchronous Architecture for VLSI Programming, 
International series on Parallel Computation, Vol. 5 (Cambridge University Press, Cambridge, 1993). 
