Optimal Metastability-Containing Sorting via Parallel Prefix Computation by Bund, Johannes et al.
1Optimal Metastability-Containing Sorting
via Parallel Prefix Computation∗
Johannes Bund, Christoph Lenzen, Moti Medina
Abstract—Friedrichs et al. (TC 2018) showed that metastability can be contained when sorting inputs arising from time-to-digital
converters, i.e., measurement values can be correctly sorted without resolving metastability using synchronizers first. However, this
work left open whether this can be done by small circuits. We show that this is indeed possible, by providing a circuit that sorts Gray
code inputs (possibly containing a metastable bit) and has asymptotically optimal depth and size.
Our solution utilizes the parallel prefix computation (PPC) framework (JACM 1980). We improve this construction by bounding its
fan-out by an arbitrary f ≥ 3, without affecting depth and increasing circuit size by a small constant factor only. Thus, we obtain the first
PPC circuits with asymptotically optimal size, constant fan-out, and optimal depth.
To show that applying the PPC framework to the sorting task is feasible, we prove that the latter can, despite potential metastability, be
decomposed such that the core operation is associative. We obtain asymptotically optimal metastability-containing sorting networks.
We complement these results with simulations, independently verifying the correctness as well as small size and delay of our circuits.
F
1 INTRODUCTION
Metastability is a fundamental obstacle when crossing clock
domains, potentially resulting in soft errors with critical
consequences [14]. As it has been shown that metastability
cannot be avoided deterministically [25], synchronizers [19]
are employed to reduce the error probability to tolerable
levels. This approach trades precious time for reliability:
the more time is allocated for metastability resolution, the
smaller the probability of metastability-induced faults.
Recently, a different approach has been proposed, coined
metastability-containing (MC) circuits [10]. It accepts a limited
amount of metastability in the input to a digital circuit and
ensures limited metastability of its output, so that the result
is still useful. In a series of works [24], [3], [4], we applied
this approach to a fundamental primitive: sorting. The cir-
cuit given in [4] is asymptotically optimal in depth and size.
Our Contribution: In this article, we present the
machinery used to obtain the circuit from [4] in detail. We
prove that CMOS implementations of basic gates realize
Kleene logic (cf. [20, §64]), justifying the computational
model introduced in [10] and used in this article.
The task of sorting an arbitrary number of inputs can
be reduced to sorting two inputs by using sorting net-
works [21]. The 0-1-principle (cf. Section 2) shows that
plugging an MC 2-sort(B) circuit (for B-bit inputs) into a
sorting network (for n values) readily yields an MC circuit
that is capable of sorting n inputs. Hence, we need to design
a 2-sort(B) circuit sorting two inputs in an MC way.
As the choice of the encoding matters a lot for MC
circuits, we characterize the set of input strings we want to
∗This article generalizes and extends work presented at DATE 2018 [4].
• Johannes Bund and Christoph Lenzen are with the Max Planck Institute
for Informatics, Saarland Informatics Campus, 66123 Saarbru¨cken, Ger-
many. Email: {jbund,clenzen}@mpi-inf.mpg.de
• Moti Medina is with the School of Electrical & Computer Engineer-
ing, Ben-Gurion University of the Negev, 8410501 Beer Sheva, Israel.
Email: medinamo@bgu.ac.il
sort (“valid strings”). A valid string is either a (standard)
Gray code string or a string obtained from a Gray code
string by replacing the unique bit that would change on
the up-count to the “next” codeword by M for metastability
(the third logic value in Kleene logic). When using non-
redundant codes, the use of Gray codes is mandatory: when
converting an analog value to a digital one, continuously
changing the input can force any circuit (that uses the value
in a non-trivial way) into metastability [25]. Moreover, for
combinational circuits in the abstraction of Kleene logic, all
output bits that change when flipping a given input bit must
become unstable when the input bit is unstable, cf. [10].
For instance, encoding a value unknown to be 11 or 12
in standard binary code would result in a string that, once
metastability has been resolved, may represent any number
in the interval from 8 to 15, cf. Section 3.
Valid strings arise naturally when stopping a Gray code
counter asynchronously [12] or, more generally, whenever
performing analog-to-digital conversion; respective circuits
may risk multiple metastable bits to achieve better average-
case precision, but for the best worst-case precision one can
stick to guaranteeing valid strings as output. Exploiting the
structure of Gray code and the restriction to valid strings, we
show how to reliably sort all inputs despite the uncertainty
about the represented value arising from metastability.
We formally specify the 2-sort(B) circuit and then prove
that the task of comparing two valid strings can be decom-
posed into first performing a four-valued comparison on
each prefix pair of the two valid input strings, and then
inferring the corresponding output bits. This reduces the
design of 2-sort(B) to a parallel prefix computation (PPC)
problem, which for our purposes can be phrased as follows.
Definition 1.1 (PPC⊕(B)). For associative ⊕ : D × D → D
and B ∈ N, a PPC⊕(B) circuit is specified as follows.
Input: d ∈ DB ,
Output: pi ∈ DB ,
Functionality: pii =
⊕i
j=1 dj for all i ∈ [1, B].
ar
X
iv
:1
91
1.
00
26
7v
1 
 [c
s.D
C]
  1
 N
ov
 20
19
2Fast PPC circuits that are simultaneously (asymptoti-
cally) optimal in depth and size are known due to a cele-
brated result by Ladner and Fischer [23]. Going beyond [4],
we present the full range of solutions that can be derived
using their framework, which allows for a trade-off between
depth and size of the 2-sort circuit. Most prominently,
optimizing for depth reduces the depth of the circuit by a
factor of 2 compared to [4] to optimal dlogBe, at the expense
of increasing the size by a factor of up to 2.
However, relying on the construction from [23] as-is
results in a very large fan-out. We present a modification
reducing fan-out to any number f ≥ 3 without affecting
depth, increasing the size by a factor of only 1 + O(1/f)
(plus at most 3B/2 buffers). In particular, our results imply
that the depth of an MC sorting circuit can match the delay
of a non-containing circuit, while maintaining constant fan-
out and a constant-factor size overhead. Due to the fact that
PPC circuits lie at the heart of fast adders [27], we consider
this result of independent interest.
We complement our theoretical findings by simulations
confirming the correctness and small size of the devised
circuits. Post-layout area and delay of the designed circuits
compare favorably with a baseline provided by a straight-
forward non-containing implementation.
Organization of this Article: We discuss related
work in Section 2. Some preliminaries, the computational
model and its justification, as well as the problem specifica-
tion are given in Section 3. Next, in Section 4, we break the
task of designing a 2-sort(B) circuit down into comparing
prefixes and subsequently generating the output bits out of
the computed comparison values and the respective pair
of input bits. The comparison can be further decomposed
into sequential application of an associative operator, which
enables application of the PPC framework to compute all
prefixes efficiently in parallel with (asymptotically) optimal
depth. In order to keep this article self-contained, we com-
pactly review the PPC framework in Section 5. The section
then proceeds to showing how to modify the construction
for bounded fan-out and bounding the size of the resulting
circuits. In Section 6, we implement the base operators by
subcircuits and plug the pieces together to obtain complete
circuits. We then simulate them up to an input width of
B = 16 to independently verify their correctness, and
provide delay and area of the laid out circuits. We compare
to a non-containing version as baseline, demonstrating the
controlled increase in size of the circuit. We conclude the
article in Section 7, where we also briefly discuss follow-
up work that generalizes our results, demonstrating that
higher-level concepts of this work like sorting networks and
parallel prefix computation are applicable to further MC
circuits.
2 RELATED WORK
Sorting Networks: Sorting networks (see, e.g., [21])
sort n inputs from a totally ordered universe by feeding
them into n parallel wires that are connected by 2-sort
elements, i.e., subcircuits sorting two inputs; these can act
in parallel whenever they do not depend on each other’s
output. A correct sorting network sorts all possible inputs,
i.e., the wires are labeled 1 to n such that the ith wire outputs
the ith element of the sorted list of inputs. The size of a
sorting network is its number of 2-sort elements and its
depth is the maximum number of 2-sort elements an input
may pass through until reaching the output.
The 0-1-principle [21] states that a sorting network —
assuming the 2-sort circuits are correct — is correct if and
only if it sorts 0-1 inputs correctly. Thus, we obtain sorting
networks for inputs that may suffer from metastability by
constructing 2-sort circuits (w.r.t. a suitable order on such
inputs) and plugging them into existing sorting networks.
Sorting networks have been extensively studied. Tight
lower bounds of depth Ω(log n) (trivial) and size Ω(n log n)
(see, e.g., [8]) are known and can be simultaneously asymp-
totically matched [1]. More practically, for small values of
n optimal depth and/or size networks are known [6], [7],
[21]. Accordingly, our task boils down to finding optimal
(or close to optimal) metastability-containing 2-sort circuits.
For B-bit inputs, our 2-sort circuits have depth and size
O(logB) andO(B), respectively, which is (trivially) optimal
up to constants; as size and depth of our circuits are close
to non-containing 2-sort circuits (cf. Table 12), we conclude
that our approach yields MC sorting networks that are
optimal up to small constant factors in both depth and size.
Prior Work on MC Circuits: Recent work [10] shows
that for any Boolean function a combinational MC circuit
implementing its metastable closure (see Definition 3.8) exists.
The metastable closure can be seen as a best effort to contain
metastability: when for an input with (some) metastable bits
the stable input bits already determine a given output bit of
the original Boolean function, the closure attains the respec-
tive value on this output bit; otherwise it is metastable.
Unfortunately, the proof from [10], which uses a con-
struction dating back to Huffman [16], yields circuits of
exponential size in the number of input bits B. The same
is true for speculative computing [28]. Unconditional lower
bounds on MC circuits [17] show that this cannot be avoided
in general, even if the implemented function admits a small
non-containing circuit. The same work provides, assuming
that at most k input bits can be metastable, a construction
with multiplicative BO(k) and additive O(k logB) over-
heads in size and depth, respectively. For the 2-sort element,
k = 2 (each Gray code string may contain one metastable
bit), but the resulting circuits are still far from optimal.
In [10], an alternative construction relying on non-
combinational logic is given, achieving (up to minor-order
terms) factor 2k + 1 increase in size and additive Θ(log k)
increase in depth of the resulting circuit; for a 2-sort circuit,
k = 2, so these overheads are constant. Rule-of-thumb
calculations suggest that optimized versions of the circuits
presented here and derived by this method would have
comparable performance. A fair and detailed comparison
would require fully-fledged designs of both approaches,
which is beyond the scope of this article. Note, however, that
our design has the advantage of being purely combinational.
Parallel Prefix Computation: Ladner and Fischer
[23] studied the parallel application of an associative op-
erator to all prefixes of an input string of length ` (over an
arbitrary alphabet). They give parallel prefix computation
(PPC) circuits of depthO(log `) and sizeO(`) (where the cir-
cuit implementing the operator is assumed to have size and
depth 1). However, when requiring optimal depth of dlog `e,
3their corresponding solution suffers from fan-out larger
than `/2. An earlier construction by Kogge and Stone [22]
simultaneously achieves optimal depth and fan-out of 2.
This yields the fastest adder circuits to date (cf. [27]), but
at the expense of a large size of `(dlog `e− 1) + 1. A number
of additional constructions have been developed for adders,
including special cases ([2], [26]) of the one by Ladner and
Fischer, cf. [31]. However, no other construction achieves
asymptotically optimal depth and size.
3 MODEL AND PROBLEM
In this section, we discuss how to model metastability in
a worst-case fashion and formally specify the input/output
behavior of our circuits. Our model is a simplified version of
the one from [10] for combinational circuits (cf. [9, Chap. 7]).
This means to represent metastable “bits” by M and extend
truth tables as in Kleene’s 3-valued logic [20, §64].
Basic Notation: We set [N ] := {0, . . . , N−1} forN ∈
N and [i, j] = {i, i+ 1, . . . , j} for i, j ∈ N, i ≤ j. We denote
B := {0, 1} and BM := {0, 1,M}. For a B-bit string g ∈ BBM
and i ∈ [1, B], denote by gi its i-th bit, i.e., g = g1g2 . . . gB .
We use the shorthand gi,j := gi . . . gj , where i, j ∈ [1, B]
and i ≤ j. Let par(g) denote the parity of g ∈ BB , i.e,
par(g) =
∑B
i=1 gi mod 2. For a function f and a set A we
abbreviate f(A) := {f(y) | y ∈ A}.
3.1 Binary Reflected Gray Code
A standard binary representation of inputs is unsuitable:
uncertainty of the input values may be arbitrarily amplified
by the encoding. E.g. representing a value unknown to be 11
or 12, which are encoded as 1011 resp. 1100, would result
in the bit string 1MMM, i.e., a string that is metastable in
every position that differs for both strings. However, 1MMM
may represent any number in the interval from 8 to 15,
amplifying the initial uncertainty of being in the interval
from 11 to 12. An encoding that does not lose precision for
consecutive values is Gray code.
We use B-bit binary reflected Gray code, rgB : [N ] →
BB , which is defined recursively. For simplicity (and with-
out loss of generality) we set N := 2B . A 1-bit code is
given by rg1(0) = 0 and rg1(1) = 1. For B > 1, we
start with the first bit fixed to 0 and counting with rgB−1(·)
(for the first 2B−1 codewords), then toggle the first bit to 1,
and finally “count down” rgB−1(·) while fixing the first bit
again, cf. Table 1. Formally, this yields for x ∈ [N ]
rgB(x) :=
{
0 rgB−1(x) if x ∈ [2B−1]
1 rgB−1(2
B − 1− x) if x ∈ [2B ] \ [2B−1] .
As each B-bit string is a codeword, the code is a bijection
and the encoding function also defines the decoding func-
tion. Denote by 〈·〉 : BB → [N ] the decoding function of a
Gray code string, i.e., for x ∈ [N ], 〈rgB(x)〉 = x.
For two binary reflected Gray code strings g, h ∈ BB , we
define their maximum and minimum as
(maxrg{g, h},minrg{g, h}) :=
{
(g, h) if 〈g〉 ≥ 〈h〉
(h, g) if 〈g〉 < 〈h〉 .
For example:
• maxrg{0011, 0100} = maxrg{rgB(2), rgB(7)} = 0100,
• minrg{0111, 0101} = minrg{rgB(9), rgB(10)} = 0111.
TABLE 1: 4-bit binary reflected Gray code
# g1, g2,4 # g1, g2,4 # g1, g2,4 # g1, g2,4
0 0 000 4 0 110 8 1 100 12 1 010
1 0 001 5 0 111 9 1 101 13 1 011
2 0 011 6 0 101 10 1 111 14 1 001
3 0 010 7 0 100 11 1 110 15 1 000
TABLE 2: 4-bit valid inputs
g 〈g〉 g 〈g〉 g 〈g〉 g 〈g〉
0000 0 0110 4 1100 8 1010 12
000M − 011M − 110M − 101M −
0001 1 0111 5 1101 9 1011 13
00M1 − 01M1 − 11M1 − 10M1 −
0011 2 0101 6 1111 10 1001 14
001M − 010M − 111M − 100M −
0010 3 0100 7 1110 11 1000 15
0M10 − M100 − 1M10 − − −
3.2 Valid Strings
The inputs to the sorting circuit may have some metastable
bits, which means that the respective signals behave out-
of-spec from the perspective of Boolean logic. Such inputs,
referred to as valid strings, are introduced with the help of
the following operator.
Definition 3.1 (∗ Operator). For B ∈ N, define the operator
∗ : BBM × BBM → BBM by
∀i ∈ {1, . . . , B} : (x ∗ y)i :=
{
xi if xi = yi
M else.
Observation 3.2. The operator ∗ is associative and commutative.
Hence, for a set S = {x(1), . . . , x(k)} of B-bit strings, we can
use the shorthand ∗S := ∗x∈S x := x(1) ∗x(2) ∗ . . . ∗x(k). We
call ∗S the superposition of the strings in S.
Valid strings have at most one metastable bit. If this bit
resolves to either 0 or 1, the resulting string encodes either
x or x+ 1 for some x, cf. Table 2.
Definition 3.3 (Valid Strings). Let B ∈ N and N = 2B . Then,
the set of valid strings of length B is
SBrg := rgB([N ]) ∪
⋃
x∈[N−1]
{rgB(x) ∗ rgB(x+ 1)} .
3.3 Resolution and Closure
To extend the specification of maxrg and minrg to valid
strings, we make use of the metastable closure [10]. The
metastable closure is defined over the possible resolutions
of metastable bits.
Definition 3.4 (Resolution [10]). For x ∈ BBM , define the
resolution res(x) : BBM → P
(
BB
)
as follows:
res(x) := {y ∈ BB | ∀i ∈ {1, . . . , B} : xi 6= M⇒ yi = xi} .
Thus, res(x) is the set of all strings obtained by replacing
all Ms in x by either 0 or 1: M acts as a “wild card.” For any
x and y, we have that res(xy) = res(x) res(y).
We note two observations for later use.
Observation 3.5. For any x ∈ BBM , ∗ res(x) = x.
4Proof. Let x ∈ BBM and let I be the set of indices where x is
stable, i.e., i ∈ I iff xi 6= M. From Definition 3.4, we get that
∀i ∈ {1 . . . B} : i ∈ I ⇔ {xi} = res(xi) .
By Definition 3.1 and Observation 3.2,
∀i ∈ {1, . . . , B} : {(∗ res(x))i} =
{
res(xi) if i ∈ I
{M} else.
This entails the claim ∗ res(x) = x.
For example: ∗ res(0M10) = ∗{0010, 0110} = 0M10 .
Observation 3.6. For ∅ 6= S ⊆ BB , we have S ⊆ res(∗S).
Proof. Let ∅ 6= S ⊆ BB and s ∈ S. Define I as the set of
indices where ∗S is stable, i.e.,
∀i ∈ {1 . . . B} : i ∈ I ⇔ (∗S)i 6= M.
From Definition 3.1 and Observation 3.2, we conclude for
i ∈ {1 . . . B}:
(∗S)i =
{
si if i ∈ I
M else.
Since by Definition 3.4 each combination of replacing Ms by
0s and 1s occurs in res(∗S), we conclude that
∃x ∈ res(∗S) : ∀i /∈ I : xi = si.
Since ∀i ∈ {1 . . . B} : i ∈ I ⇔ {xi} = res(xi), s ∈ res(∗S).
This proves the claim S ⊆ res(∗S).
We observe that in general the reverse direction does not
hold, i.e., res(∗S) * S. For example, consider S = {01, 10}
and thus ∗S = MM such that res(∗S) = {00, 01, 10, 11} =
B2. Hence, S ⊆ res(∗S) but not res(∗S) ⊆ S. In contrast,
for | res(∗S)| ≤ 2, we can see that the reverse direction
holds.
Observation 3.7. For any subset of strings S ⊆ BB , if
| res(∗S)| ≤ 2, then res(∗S) = S.
Proof. Since ∗S can contain at most one M bit, we know
that S can contain at most two strings that differ in one
position. It is then straightforward to show that every string
in res(∗S) is element of S. Together with Observation 3.6
this shows the equality.
The metastable closure of an operator on binary inputs
extends it to inputs that may contain metastable bits. This is
done by considering all resolutions of the inputs, applying
the operator, and taking the superposition of the results.
Definition 3.8 (The M Closure [10]). Given an operator
f : Bn → Bm, its metastable closure fM : BnM → BmM is defined
by fM(x) := ∗{f(x′)|x′ ∈ res(x)}. Recalling the basic notation
we abbreviate this by fM(x) = ∗ f(res(x)).
The closure is the best one can achieve w.r.t. contain-
ing metastability with clocked logic using standard regis-
ters [10], i.e., when fM(x)i = M, no such implementation
can guarantee that the ith output bit stabilizes in a timely
fashion.
3.4 Output Specification
We want to construct a circuit computing the maximum and
minimum of two valid strings, enabling us to build sorting
networks for valid strings. First, however, we need to an-
swer the question what it means to ask for the maximum
or minimum of valid strings. To this end, suppose a valid
string is rgB(x) ∗ rgB(x + 1) for some x ∈ [N − 1], i.e.,
the string contains a metastable bit that makes it uncertain
whether the represented value is x or x + 1. If we wait for
metastability to resolve, the string will stabilize to either
rgB(x) or rgB(x+1). Accordingly, it makes sense to consider
rgB(x) ∗ rgB(x + 1) “in between” rgB(x) and rgB(x + 1),
resulting in the following total order on valid strings (cf. Ta-
ble 2).
Definition 3.9 (≺). We define a total order ≺ on valid strings
as follows. For g, h ∈ BB , g ≺ h ⇔ 〈g〉 < 〈h〉. For each x ∈
[N−1], we define rgB(x) ≺ rgB(x)∗rgB(x+1) ≺ rgB(x+1).
We extend the resulting relation on SBrg × SBrg to a total order by
taking the transitive closure. Note that this also defines , via
g  h⇔ (g = h ∨ g ≺ h).
We intend to sort with respect to this order. It turns out
that implementing a 2-sort circuit w.r.t. this order amounts
to implementing the metastable closure of maxrg and minrg.
Lemma 3.10. Let g, h ∈ SBrg. Then
g  h⇔ (maxrgM {g, h},minrgM {g, h}) = (h, g) .
Proof. If g ≺ h, Definitions 3.3 and 3.9 imply for all
g′ ∈ res(g) and all h′ ∈ res(h) that g′  h′ (cf. Table 2). Ob-
servation 3.5 shows that ∗ res(g) = g for any g ∈ SBrg. From
Definition 3.8, we can thus conclude that maxrgM {g, h} =∗ res(h) = h and minrgM {g, h} = ∗ res(g) = g.
If h ≺ g, analogous reasoning shows that
(maxrgM {g, h},minrgM {g, h}) = (g, h) 6= (h, g) .
The remaining case is that g = h. In the case where g does
not contain an M bit, we have maxrgM {g, h} = maxrg{g, h} =
g = h. Considering the second case (g = h = rgB(x) ∗
rgB(x + 1), for x ∈ [2B − 1]), we get that maxrgM {g, h} =∗{rgB(x), rgB(x+1)} = rgB(x)∗rgB(x+1) = h. Likewise,
minrgM {g, h} = h = g.
In other words, maxrgM and min
rg
M are the max and min
operators w.r.t. the total order on valid strings shown in
Table 2, e.g.,
• maxrgM {1001, 1000} = rg4(15) = 1000,
• maxrgM {0M10, 0010} = rg4(3) ∗ rg4(4) = 0M10, and
• maxrgM {0M10, 0110} = rg4(4) = 0110.
Hence, our task is to implement maxrgM and min
rg
M .
Definition 3.11 (2-sort(B)). For B ∈ N, a 2-sort(B) circuit is
specified as follows.
Input: g, h ∈ SBrg ,
Output: g′, h′ ∈ SBrg ,
Functionality: g′ = maxrgM {g, h}, h′ = minrgM {g, h}.
3.5 Computational Model and CMOS Logic
We seek to use standard components and combinational
logic only. We use the model of [10], which specifies the be-
havior of basic gates on metastable inputs via the metastable
5TABLE 3: Extensions to metastable inputs of AND (left), OR
(center), and an inverter (right) according to Kleene logic.
b
a
0 1 M
0 0 0 0
1 0 1 M
M 0 M M
b
a
0 1 M
0 0 1 M
1 1 1 1
M M 1 M
a a¯
0 1
1 0
M M
closure of their behavior on binary inputs, cf. Table 3. We use
the standard notational convention that a + b = ORM(a, b)
and ab = ANDM(a, b).
Note that in this logic, most familiar identities hold:
AND and OR are associative, commutative, and distribu-
tive, and DeMorgan’s laws hold. However, naturally the
law of the excluded middle becomes void. For instance, in
general, OR(x, x¯) 6= 1, as OR(M,M) = M.
We now argue that basic CMOS gates behave according
to this logic, justifying the model. For the sake of an intuitive
notation, we apply some slightly unusual conventions. In
the following, let R1 be a wildcard that can refer to any
resistance that is “low”, i.e., close to being negligible, as
e.g. that of a transistor in its stable conducting state (i.e.,
any PMOS transistor subjected to a low gate voltage or
any NMOS transistor subjected to a high gate voltage).
Similar, denote by R0 any resistance that is “high”, i.e., large
compared to R1, such as the resistance of a transistor in
its stable non-conducting state. Thus, with a stable input
b ∈ B (where we identify 0 with low and 1 with high
voltage), an NMOS transistor attains resistance Rb, while
a PMOS transistor attains resistance Rb¯. We can extend this
to unstable inputs M by making the conservative assumption
that RM is an arbitrary (possibly time-dependent) resistance.
With this notation, we can see that parallel and serial
composition of transistors implements AND and OR in
Kleene logic, respectively.
Lemma 3.12. For k ∈ N sufficiently small so that kR1  R0,
let a1, . . . , ak ∈ BM be input signals fed to k NMOS tran-
sistors interconnected (i) in parallel or (ii) sequentially. Set
σ :=
∑k
i=1 ai and pi :=
∏k
i=1 ai, i.e., the OR resp. AND over
all inputs. Then the resistance between input and output of the
resulting subcircuit is (roughly) (i) Rσ resp. (ii) Rpi .
Proof. Denote by R the resistance between the input and
output of the subcircuit. Suppose first that σ = 0, i.e., ai = 0
for all i. Then, for parallel composition, we get that 1/R =∑k
i=1 1/Rai = k/R0, yielding that R ≥ R0/k. On the other
hand, if σ = 1, there is an index i such that ai = 1, yielding
for parallel composition that 1/R =
∑k
i=1 1/Rai ≥ 1/R1
and thus R ≤ R1. This shows (i).
Now consider sequential composition and suppose first
that pi = 1, i.e., ai = 1 for all i. Thus, R =
∑k
i=1Rai ≤ kR1.
In case pi = 0, there is some index i so that ai = 0, implying
that R =
∑k
i=1Rai ≥ R0.
The same arguments apply to PMOS transistors.
Corollary 3.13. For k ∈ N sufficiently small so that kR1  R0,
let a1, . . . , ak ∈ BM be input signals fed to k PMOS tran-
sistors interconnected (i) in parallel or (ii) sequentially. Set
σ :=
∑k
i=1 a¯i and pi :=
∏k
i=1 a¯i, i.e., the OR resp. AND over
A
A
VDD
VSS
A B
A
B
VDD
VSS
A
B
A B
VDD
VSS
Fig. 1: Standard transistor-level implementations of inverter
(left), NAND (center), and NOR (right) gates in CMOS
technology. The latter can be turned into AND and OR,
respectively, by appending an inverter.
all inputs. Then the resistance between input and output of the
resulting subcircuit is (roughly) (i) Rσ resp. (ii) Rpi .
We remark that the factor of k reduction in the gap
between R1 and R0 may imply that a gate’s output signal
needs to be regenerated using a buffer. However, this is the
same behavior as for logic that assumes stable signals only,
so standard CMOS design techniques account for this.
From the above observations, we can readily infer that
standard CMOS gate implementations behave according
to Kleene logic in face of potentially metastable signals,
justifying the model from [10].
Theorem 3.14. The CMOS gates depicted in Figure 1 implement
the truth tables given in Table 3.
Proof. The output of the gates is 1 (high voltage) if the
resistances from VDD and VSS to the output are low (i.e.,
roughly R1) and high (R0), respectively. Similarly, it is 0 if
the roles are reversed. Thus, Lemma 3.12 and Corollary 3.13
show the claim for stable entries of the truth tables. For the
unstable ones, setting RM (which is a wildcard for an arbi-
trary resistance) to R0 or R1, respectively, leads to different
outcomes. Thus, the output voltage may attain almost any
value between VDD and VSS , i.e., the output is M.
Similar reasoning applies to many gates, e.g., NAND
and NOR gates. We stress, however, that the property of
implementing the closure of the function computed by the
gate on stable values is not universal for CMOS logic. For
instance, standard transistor-level multiplexer implementa-
tions do not handle metastability well, cf. [11].
4 DECOMPOSITION OF THE TASK
In this section, we show that computing maxrgM {g, h} and
minrgM {g, h} for valid strings g, h ∈ SBrg can be broken down
into composing simple operators in B2M × B2M → B2M.
4.1 Comparing Stable Gray Codes via an FSM
Figure 2 depicts a finite state machine performing a four-
valued comparison of two Gray code strings. In each step of
processing inputs g, h ∈ BB , it is fed the pair of ith input bits
gihi. In the following, we denote by s(i)(g, h) the state of the
machine after i steps, where s(0)(g, h) := 00 is the starting
state. For ease of notation, we will omit the arguments g
6g0,i−1=h0,i−1
par(g0,i−1)=0
[00]
Init
g0,i−1=h0,i−1
par(g0,i−1)=1
[11]
g ≺ h
[01]
g  h
[10]
11
00
10
01
true true
11
10
01
00
Fig. 2: Finite state machine determining which of two Gray
code inputs g, h ∈ BB is larger. In each step, it receives gihi
as input. State encoding is given in square brackets.
TABLE 4: Run of the FSM on inputs g = 1001 and h = 1000
i 0 1 2 3 4
gihi 11 00 00 10
s(i) = s(i−1)  gihi 00 11 11 11 01
g′i = out(s
(i−1), gihi)1 1 0 0 0
h′i = out(s
(i−1), gihi)2 1 0 0 1
and h of s(i) whenever they are clear from context. Table 4
shows an example of a run of the finite state machine.
Because the parity keeps track of whether the remaining
bits are to be compared w.r.t. the standard or “reflected”
order, the state machine performs the comparison correctly
w.r.t. the meaning of the states indicated in Figure 2.
Lemma 4.1. Let g, h ∈ BB and i ∈ [B + 1]. Then
• s(i) = 00 is equivalent to g1,i = h1,i and g ≺ h if and only
if gi+1,B ≺ hi+1,B ,
• s(i) = 11 is equivalent to g1,i = h1,i and g ≺ h if and only
if gi+1,B  hi+1,B ,
• s(i) = 01 is equivalent to g ≺ h, and
• s(i) = 10 is equivalent to g  h.
Proof. We show the claim by induction on i. It holds for
i = 0, as s(i) = 00, g1,0 = h1,0 is the empty string, and
g ≺ h if and only if g1,B = g ≺ h = h1,B . For the step from
i− 1 ∈ [B] to i, we make a case distinction based on s(i−1).
s(i−1) = 00: By the induction hypothesis, g1,i−1 = h1,i−1
and g ≺ h if and only if gi,B ≺ hi,B . Thus, if gihi = 00,
s(i) = 00, g1,i = h1,i, and by the recursive definition of
the code, gi,B ≺ hi,B ⇔ gi+1,B ≺ hi+1,B . Similarly, if
gihi = 11, also g1,i = h1,i, but the code for the remain-
ing bits is “reflected,” i.e., g ≺ h ⇔ gi+1,B  hi+1,B . If
gihi = 01, the definition implies that g ≺ h regardless
of further bits, and if gihi = 10, g  h regardless of
further bits.
s(i−1) = 11: Analogously to the previous case, noting that
reflecting a second time results in the original order.
s(i−1) = 01: By the induction hypothesis, g ≺ h. As 01 is an
absorbing state, also s(i) = 01.
s(i−1) = 10: By the induction hypothesis, g  h. As 10 is an
absorbing state, also s(i) = 10.
This lemma gives rise to a sequential implementation of
2-sort(B) based on the given state machine, for input strings
TABLE 5: Computing maxrg{g, h}i and minrg{g, h}i from
the current state s(i−1) and inputs gi and hi.
s(i−1) maxrg{g, h}i minrg{g, h}i
00 max{gi, hi} min{gi, hi}
10 gi hi
11 min{gi, hi} max{gi, hi}
01 hi gi
TABLE 6: Operators for next state and output. The first
operand is the current state, the second is the next input.
(a) The  operator.
 00 01 11 10
00 00 01 11 10
01 01 01 01 01
11 11 10 00 01
10 10 10 10 10
(b) The out operator.
out 00 01 11 10
00 00 10 11 10
01 00 10 11 01
11 00 01 11 01
10 00 01 11 10
in BB . Table 5 lists the ith output bit as function of s(i−1)
and the pair gihi. Correctness of this computation follows
immediately from Lemma 4.1.
We can express the transition function of the state ma-
chine as an (as easily verified) associative operator  taking
the current state and input gihi as argument and returning
the new state. Then s(i) = s(i−1)  gihi, where  is given in
Table 6a and s(0) = 00. The out operator is derived from
Table 5 by evaluating maxrg{g, h}i and minrg{g, h}i for all
possible values of gihi ∈ B2. Noting that s(0)x = 00x = x
for all x ∈ B2, we arrive at the following corollary.
Corollary 4.2. For all i ∈ [1, B], we have that
maxrg{g, h}i minrg{g, h}i = out
(
i−1
j=1
gjhj , gihi
)
.
Our goal in this section is to extend this approach to
potentially metastable inputs.
4.2 Dealing with Metastable Inputs
Our strategy is to replace all involved operators by their
metastable closure: for i ∈ [1, B] (i) compute s(i)M , (ii) de-
termine maxrgM {g, h}i and minrgM {g, h}i according to Table 5,
and finally (iii) exploit associativity of the operator comput-
ing the state s(i)M for usage in the PPC framework ([23], see
Section 5). Thus, we only need to implement M and the outM
(both of constant size), plug them into the framework, and
immediately obtain an efficient circuit.
The reader may ask why we compute s(i)M for all i ∈
[0, B − 1] instead of computing only s(B)M with a simple tree
of M elements, which would yield a smaller circuit. Since
s
(B)
M is the result of the comparison of the entire strings, it
could be used to compute all outputs, i.e., we could compute
the output by outM(s
(B)
M , gihi) instead of outM(s
(i−1)
M , gihi).
However, in case of metastability, this may lead to incorrect
results. This can be seen in the example run of the FSM given
in Table 7. We thus compute every intermediate state s(i)M .
Unfortunately, even with this modification it is not obvi-
ous that our approach yields correct outputs. There are three
hurdles to overcome:
(P1) Show that M is associative.
7TABLE 7: Run of the FSM on inputs g = 0M10 and h = 0010,
showing that computing only the last state is insufficient.
This yields outM(1M,M0) = ∗{00, 01, 10} = MM as second
output, but outM(00,M0) = ∗{00, 10} = M0 is correct.
i 0 1 2 3 4
gihi 00 M0 11 00
s
(i)
M = s
(i−1)
M M gihi 00 00 M0 1M 1M
outM(s
(4)
M , gihi) 00 MM 11 00
outM(s
(i−1)
M , gihi) 00 M0 11 00
TABLE 8: The M operator. The first operand is the current
state, the second are the next input bits.
M 00 0M 01 M1 11 1M 10 M0 MM
00 00 0M 01 M1 11 1M 10 M0 MM
0M 0M 0M 01 M1 M1 MM MM MM MM
01 01 01 01 01 01 01 01 01 01
M1 M1 MM MM MM 0M 0M 01 M1 MM
11 11 1M 10 M0 00 0M 01 M1 MM
1M 1M 1M 10 M0 M0 MM MM MM MM
10 10 10 10 10 10 10 10 10 10
M0 M0 MM MM MM 1M 1M 10 M0 MM
MM MM MM MM MM MM MM MM MM MM
(P2) Show that repeated application of M computes s(i)M .
(P3) Show that applying outM to s
(i−1)
M and gihi results for
all valid strings in maxrgM {g, h}i minrgM {g, h}i.
Regarding the first point, we note the statement that M
is associative does not depend on B. In other words, it can
be verified by checking for all possible x, y, z ∈ B2M whether
(x M y) M z = x M (y M z). While it is tractable to manually
verify all 36 = 729 cases (exploiting various symmetries
and other properties of the operator), it is tedious and prone
to errors. Instead, we verified that both evaluation orders
result in the same outcome by a short computer program.
Theorem 4.3. (P1) holds, i.e., M is associative.
Apart from being essential for our construction, this
theorem simplifies notation; in the following, we may write(M)ji=1 gihi := g1h1 M g2h2 M . . . M gjhj ,
where the order of evaluation does not affect the result.
We stress that in general the closure of an associative
operator needs not be associative. A counter-example is
given by binary addition modulo 4:
(0M +M 01) +M 01 = MM 6= 1M = 0M +M (01 +M 01).
4.3 Determining s(i)M
For convenience of the reader, Table 8 gives the truth table
of M : B2M × B2M → B2M. We need to show that repeated
application of this operator to the input pairs gjhj , j ∈ [1, i],
actually results in s(i)M . This is closely related to the key
observation that if in a valid string there is a metastable
bit at position m, then the remaining B −m following bits
are the maximum codeword of a (B −m)-bit code.
Observation 4.4. For g ∈ SBrg, if there is an index 1 ≤ m < B
such that gm = M then gm+1,B = 10B−m−1.
Proof. List the codewords in order. By the recursive defi-
nition of the code, removing the first m − 1 bits of the
code leaves us with 2m−1 repetitions of (B − m + 1)-bit
code alternating between listing it in order and in reverse
(“reflected”) order. Also by the recursive definition, the mth
bit toggles only when the (B −m)-bit code resulting from
removing it is at its last codeword, 10B−m−1.
Our reasoning will be based on distinguishing two main
cases: one is that s(i)M contains at most one metastable bit, the
other that s(i)M = MM. For each we need a technical statement.
Observation 4.5. If
∣∣∣res(s(i)M )∣∣∣ ≤ 2 for any i ∈ [B + 1], then
res(s
(i)
M ) =ij=1 res(gjhj).
Proof. With S :=ij=1 res(gjhj), we have that res(s(i)M ) =
res(∗S). The claim thus follows from Observation 3.7.
Lemma 4.6. Suppose that for valid strings g, h ∈ SBrg, it holds
that s(i)M = MM for some i ∈ [1, B]. Then g = h and s(j)M = MM
for all j ∈ [i, B].
Proof. Because R := ik=1 res(gkhk) ⊆ B2 and s(i) = ∗R,
it must hold that (i) {00, 11} ⊆ R, or (ii) {01, 10} ⊆ R.
By Lemma 4.1, (i) implies that there are stabilizations
g′, g′′ ∈ res(g1,i) and h′, h′′ ∈ res(h1,i) such that g′ = h′,
par(g′) = 0, g′′ = h′′, and par(g′′) = 1, while (ii) implies
such g′, g′′, h′, h′′ with g′ ≺ h′ and g′′  h′′. Checking
Definition 3.9 (cf. Table 2), we see that both options neces-
sitate that g1,i = h1,i with some metastable bit. Denoting
by m ∈ [1, i − 1] the index such that gm = hm = M.
Observation 4.4 shows that gm+1,B = hm+1,B = 10B−m−1.
In particular, g = h, showing (again by Lemma 4.1 that
(i) or (ii) (in fact both) also apply to jk=1 res(gkhk) for
any j ∈ [i, B] (cf. the 00 and 11 columns in Table 6a). We
conclude that s(j)M = MM for any such j.
Equipped with these tools, we are ready to prove the
second statement.
Theorem 4.7. (P2) holds, i.e., for all g, h ∈ SBrg and i ∈ [1, B],
s
(i)
M =
(M)ij=1 gjhj .
Proof. We show the claim by induction on i. Trivially, we
have that s(0)M = s(0) = 00 and thus for i = 1 that
s
(1)
M = s
(0)
M M g1h1 = 00 M g1h1 = g1h1 =
(M)1j=1 g1h1 .
Hence, suppose that the claim has been established for
i− 1 ∈ [1, B − 1] and consider index i. If
∣∣∣res(s(i−1)M )∣∣∣ ≤ 2,
Observation 4.5 and the induction hypothesis yield that(M)ij=1 gjhj =
((M)i−1j=1 gjhj
)
M gihi
= s
(i−1)
M M gihi = ∗
(
res
(
s
(i−1)
M
)
 res(gihi)
)
= ∗ i
j=1
res(gihi) = s
(i)
M .
It remains to consider the case that s(i−1)M = MM. By
Lemma 4.6, s(i)M = MM, too. Thus,(M)ij=1 gjhj = s(i−1)M M gihi = MMM gihi = MM = s(i)M .
8TABLE 9: The outM operator. The first operand is the current
state, the second is the next input bits.
outM 00 0M 01 M1 11 1M 10 M0 MM
00 00 M0 10 1M 11 1M 10 M0 MM
0M 00 M0 10 1M 11 MM MM MM MM
01 00 M0 10 1M 11 M1 01 0M MM
M1 00 MM MM MM 11 M1 01 0M MM
11 00 0M 01 M1 11 M1 01 0M MM
1M 00 0M 01 M1 11 MM MM MM MM
10 00 0M 01 M1 11 1M 10 M0 MM
M0 00 MM MM MM 11 1M 10 0M MM
MM 00 MM MM MM 11 MM MM MM MM
4.4 Obtaining the Outputs from s(i)M
Recall that out : B2×B2 → B2 is the operator given in Table 5
computing maxrg{g, h}i minrg{g, h}i out of s(i−1) and gihi.
For convenience of the reader, we provide the truth table of
outM : B2M×B2M → B2M in Table 9. We derive the third property.
Theorem 4.8. (P3) holds, i.e., given valid inputs g, h ∈ SBrg and
i ∈ [1, B], outM(s(i−1)M , gihi) = maxrgM {g, h}i minrgM {g, h}i.
Proof. Assume first that
∣∣∣res(s(i−1)M )∣∣∣ ≤ 2. Then
outM(s
(i−1)
M (g, h), gihi)
= ∗ out(res(s(i−1)M (g, h)) , res(gihi))
Obs. 4.5
= ∗ out
(
i−1
j=1
res(gjhj), res(gihi)
)
Cor. 4.2
= ∗ (maxrg{res(g), res(h)}i minrg{res(g), res(h)}i)
= maxrgM {g, h}i minrgM {g, h}i .
Otherwise, s(i−1)M = MM. Then, by Lemma 4.6, g = h. In
particular, gi = hi. Checking Table 9, we see that for all
b ∈ BM, it holds that outM(MM, bb) = bb. Therefore,
outM(s
(i−1)
M (g, h), gihi) = gihi = max
rg
M {g, h}i minrg{g, h}i
in this case as well.
5 THE PPC FRAMEWORK
In order to derive a small circuit from the results of Section 4,
a straightforward approach would be to unroll the FSM. We
could design a circuit implementing the transition function
M and apply it B times to the starting state s(0) and each
input gihi. However, computing the sequence of states step
by step yields a (non-optimal) linear depth of at least B.
Hence, we make use of the PPC framework by Ladner
and Fischer [23]. They describe a generic method that is
applicable to any finite state machine translating a sequence
of B input symbols to B output symbols, to obtain cir-
cuits of size O(B) and depth O(logB). They observe that
each input symbol defines a restricted transition function.
Compositions of these functions evaluated on the starting
state yield the state of the machine after receiving corre-
sponding inputs. The major advantage of the technique is
that compositions of restricted transition functions can be
computed in parallel due to associativity, yielding a depth of
O(logB). This matches our needs, as we need to determine
s
(i)
M for each i ∈ [B]. However, their generic construction
11 00 11 M0 11 00 01 01 00
outM
00
outM outM outM outM outM outM outM outM
M1 M1 01 01
00 01 01
0M0M 10
11 1M 11
11 00 11 0M 11 00 10 10 00
Fig. 3: An example for a computation of the 2-sort(9) circuit
arising from our construction for fan-out f = 3. The inputs
are g = 101010110 and h = 101M10000; see Table 10 for
s
(i)
M (g, h) and the output. We labeled each M by its output.
Buffers and duplicated gates (here the one computing 0M)
reduce fan-out, but do not affect the computation. Grey
boxes indicate recursive steps of the PPC construction; see
also Figure 7 for a larger PPC circuit using the one here in
its “right” top-level recursion. For better readability, wires
not taking part in a recursive step are dashed or dotted.
TABLE 10: Example run of the FSM in Figure 2 on inputs
g = 101010110 and h = 101M10000. We drop s(9)M , as it is
not needed to compute g′9h
′
9.
i 0 1 2 3 4 5 6 7 8 9
gihi 11 00 11 0M 11 00 10 10 00
s
(i)
M 00 11 11 00 0M M1 M1 01 01
g′ih
′
i 11 00 11 M0 11 00 01 01 00
involves large constants. Fortunately, we have established
that M : B2M×B2M → B2M is an associative operator, permitting
us to directly apply the circuit templates for associative
operators they provide for computing s(i)M =
(M)ij=1 gjhj
for all i ∈ [B]. Accordingly, we discuss these templates
only. During discussion of the basic construction we show a
minor improvement on their results.
Before proceeding, the reader may want to take a look at
the example given in Figure 3, which shows how a 2-sort(9)
derived from our construction processes an input pair.
5.1 The Basic Construction
We revisit the templates for parallel computation of all
prefixes, i.e., the part of the framework relevant to our
construction. To this end, recall Definition 1.1. In our case,
⊕ = M and D = B2M. [23] provides a family of recursive
constructions of PPC⊕ circuits. They are obtained by com-
bining two different recursive patterns. The first pattern,
which optimizes for size of the resulting circuits, is depicted
in Figure 4a. We distinguish between even and odd number
of inputs. If B is even, we discard the rightmost gray wire
9and set B¯ := B; if B is odd, we set B¯ := B − 1 and include
the rightmost wire. In the following, denote by |C| the size
of a circuit C and by d(C) its depth.
Lemma 5.1. Suppose that C and P are circuits implementing
⊕ and PPC⊕(dB/2e) for some B ∈ N, respectively. Then
applying the recursive pattern given at the left of Figure 4 yields
a PPC⊕(B) circuit. It has depth 2d(C) + d(P ) and size at most
(B − 1)|C| + |P |. Moreover, the last output is at depth at most
d(C) + d(P ) of the circuit.
Proof. Observe that P receives as inputs d2i−1 ⊕ d2i for
i ∈ [1, bB/2c], and in addition dB in case B is odd. Thus,
it outputs pi′i =
⊕2i
j=1 dj for i ∈ [1, bB/2c], and also
pi′dB/2e =
⊕B
j=1 dj if B is odd. Hence, the circuit outputs
pii =
⊕i
j=1 dj if i ∈ [1, B] is even and
pii = pii−1 ⊕ di =
2bi/2c⊕
j=1
dj
⊕ di = i⊕
j=1
dj
if i ∈ [1, B] is odd, showing correctness. The depth of the
circuit is immediate from the construction, and the size
follows from the fact that there is exactly one instance of
C for each even i ∈ [1, B] before P and one for each odd
i ∈ [1, B] \ {1, B} after P . Output piB has a depth that is
smaller by d(C), as it is an output of P .
The second recursive pattern, shown in Figure 4c, avoids
to increase the depth of the circuit beyond the necessary
d(C) for each level of recursion. Assume for now that B is
a power of 2. We represent the recursion as a tree Tb, where
b := logB, given in the center of Figure 4. It has depth
b with all leaves (filled in white) in this depth, and there
are two types of non-leaf nodes: right nodes (filled in black)
have two children, a left and a right node, whereas left nodes
(filled in gray) have a single child, which is a right node. Tb
is essentially a Fibonacci tree in disguise.
Definition 5.2. T0 is a single leaf. T1 consists of the (right) root
and two attached leaves. For b ≥ 2, Tb can be constructed from
Tb−1 and Tb−2 by taking a (right) root r, attaching the root of
Tb−1 as its right child, a new left node ` as the left child of r, and
then attaching the root of Tb−2 as (only) child of `.
The recursive construction is now defined as follows. A
right node applies the pattern given in Figure 4 to the right.
R` is the circuit (recursively) defined by the subtree rooted
at the left child and Rr is the circuit (recursively) defined
by the subtree rooted at the right child. Furthermore, B¯ =
2b−d−1, where d ∈ [b] is the depth of the node. A left child
applies the pattern on the left. Rc is (recursively) defined by
the subtree rooted at its child and B¯ = 2b−d, where d ∈ [b]
is the depth of the node.
The base case for a single input and output is simply a
wire connecting the input to the output, for both patterns.
As b = logB and each recursive step cuts the number of
inputs and outputs in half, the base case applies if and only
if the node is a leaf. Note that the figure shows the recursive
patterns at the root and its left child, where B¯ = 2b−1 is
always even (i.e., in this recursive pattern, the gray wire
with index B¯ + 1 is never present); when applying the
patterns to nodes further down the tree, B¯ and B are scaled
down by a factor of 2 for every step towards the leaves.
In the following, denote by PPC(C, Tb) the circuit that
results from applying the recursive construction described
above to the base circuit C implementing ⊕. Moreover, we
refer to the ith input and output of the subcircuit corre-
sponding to node v ∈ Tb as dvi and pivi , respectively.
Lemma 5.3. If C implements ⊕, PPC(C, Tb) is a PPC⊕(2b)
circuit.
Proof. We show the claim by induction on b. For b = 0, the
circuit correctly wires the input to the output, as we have
only one leaf. For b = 1, the first output equals the first
input and the second output is the result of feeding both
inputs into a copy of C .
For b ≥ 2, by the induction hypothesis the circuit Rc
used in the construction at the left child of the root is a
PPC⊕(2b−2) circuit. By Lemma 5.1, the circuitR` in the con-
struction at the root is thus a PPC⊕(2b−1) circuit, showing
that it outputs pi`i =
⊕i
j=1 dj = pii for all i ∈ [1, 2b−1]. From
the induction hypothesis for b − 1, we get that the circuit
Rr used in the construction at the root is a PPC⊕(2b−1)
circuit, showing that it outputs piri =
⊕2b−1+i
j=2b−1+1 dj for all
i ∈ [1, 2b−1]. By construction of the right recursion pattern
we conclude that for i ∈ [2b−1 + 1, 2b], we get the outputs
pi2b−1 ⊕ piri−2b−1 =
⊕i
j=1 di = pii.
Lemma 5.4. PPC(C, Tb) has depth b · d(C).
Proof. We prove the claim by induction on b; it trivially
holds for b = 0, as we have only one leaf. For b = 1, Tb is
a right node with two leaves. The two leaves have depth 0;
clearly, applying the right pattern from Figure 4 then results
in depth d(C). For b ≥ 2, the subcircuit Rr at the root has
depth (b − 1) · d(C) by the induction hypothesis. For the
subcircuit R` at the root, consider its subcircuit Rc. By the
induction hypothesis it has depth (b − 2) · d(C). Hence, by
Lemma 5.1, R` has depth b · d(C), but its rightmost output
pi`2b−1 has depth only (b − 1) · d(C). Thus, by construction
the root’s circuit has depth b · d(C).
It remains to bound the size of the circuit. Denote by
Fi, i ∈ N, the ith Fibonacci number, i.e., F1 = F2 = 1 and
Fi+1 = Fi + Fi−1 for all 2 ≤ i ∈ N.
Lemma 5.5. PPC(C, Tb) has size (2b+2 − Fb+5 + 1)|C|.
Proof. Denote by sb the number of copies of |C| in the circuit
PPC(C, Tb). We show by induction that sb = 2b+2−Fb+5+1
for all b ∈ N0. We have that s0 = 0 = 22 − F5 + 1 and that
s1 = 1 = 2
3 − F6 + 1. For b ≥ 2, we have that sb = sb−1 +
sb−2+sr+s`, where sr and s` denote the size contribution of
the recursive steps at the root and its left child, respectively.
Checking the recursive patterns in Figure 4, we see that sr =
B − B¯ = 2b−1 and s` = B¯ − 1 = 2b−1 − 1. Thus, sb =
sb−1 + sb−2 + 2b − 1, which the induction hypothesis yields
sb = 2
b+1 − Fb+4 + 1 + 2b − Fb+3 + 1 + 2b − 1
= 2b+2 − Fb+5 + 1 .
Asymptotically, the subtractive term of Fb+5 is negligi-
ble, as Fb+5 ∈ (1/
√
5 + o(1))((1 +
√
5)/2)b+5 ⊆ O(1.62b);
however, unless B is large, the difference is substantial. We
also get a simple upper bound for arbitrary values of B. To
this end, we “split” in the recursion such that the left branch
10
⊕ ⊕ ⊕
⊕ ⊕ ⊕
Rc
. . .
. . .
dv1 d
v
2 d
v
3 d
v
4 d
v
B¯
dv
B¯−1
piv1 pi
v
2 pi
v
3 pi
v
4 pi
v
B¯
piv
B¯−1 pi
v
B¯+1
dv
B¯+1
(a) Recursion pattern of left nodes.
R`
Rc
Rr
(b) Recursion tree T4.
⊕ ⊕ ⊕
R` Rr
. . .. . .
. . . . . .
dv1 d
v
2 d
v
B¯
dv
B¯+2
dv
B¯+1
dv
B¯+2 d
v
B
piv1 pi
v
2 pi
v
B¯
piv
B¯+1
piv
B¯+2 pi
v
B
(c) Recursion pattern of right nodes.
Fig. 4: The recursion tree T4 (center). Right nodes are depicted black, left nodes gray, and leaves white. The recursive
patterns applied at left and right nodes are shown on the left and right, respectively. At the root and its left child, we have
that B¯ = B/2; for other nodes, B¯ gets halved for each step further down the tree (where the leaves simply wire their single
input to their single output). The left pattern comes in different variants. The gray wire with index B¯ + 1 is present only
if B is odd; this never occurs in PPC(C, Tb), but becomes relevant when initially applying the left pattern exclusively for
k ∈ N steps (see Theorem 5.7), reducing the size of the resulting circuit at the expense of increasing its depth by k.
is “complete” (i.e. the number of inputs is a power of 2),
while applying the same splitting strategy on the right. This
is where our construction differs from and improves on [23].
They perform a balanced split and obtain an upper bound
of 4B on the circuit size.
Corollary 5.6. For B ∈ N and circuit C implementing ⊕, set
b := dlogBe. Then a PPC⊕(B) of depth dlogBed(C) and size
smaller than (5B − 2b − Fb+3)|C| ≤ (4B − Fb+3) exists.
Proof. If B is a power of 2, the claim follows from Lem-
mas 5.3, 5.4, and 5.5. In particular, for B = 1 and B = 2,
respectively, PPC⊕(C, T0) and PPC⊕(C, T1) meet the re-
quirements. For B > 2 that is not a power of 2, set
b := dlogBe and perform the same construction as for
PPC(C, Tb), but replace Rr at the root by the (recursively
given) PPC⊕(B − 2b−1) circuit.
Correctness is immediate from the recursive construction
and Lemma 5.3. Similarly, the depth bound follows from
Lemma 5.4 and the recursive construction. Regarding size,
we show by induction that sB , the number of copies of
C required for a circuit for B inputs, satisfies the claimed
bound. This is already established for the base cases of
B = 1 and B = 2. For B > 3, we apply Lemma 5.1 to
R` in the root’s circuit and Lemma 5.5 to its subcircuit Rc,
while applying the induction hypothesis to the subcircuit
Rr in the root’s circuit. We get that
sB < s2b−2 + |PPC⊕(C, Tb−1)|+ (B − 1)
<
(
2b − Fb+3 + 1 + 4
(
B − 2b−1
)
+B − 1
)
=
(
5B − 2b − Fb+3
)
.
We remark that one can give more precise bounds
by making case distinctions regarding the right recursion,
which for the sake of brevity we omit here. Instead, we
computed the exact numbers for B ≤ 70, see Figure 5.
The construction derived from iterative application of
Lemma 5.1 can be combined with PPC(C, Tb), achieving
the following trade-off; note that if B = 2b for b ∈ N, then
FdlogBe−k+3 can be replaced by Fb−k+5.
Theorem 5.7 (improving on [23]). Suppose C implements ⊕.
For all k ∈ [0, dlogBe] and B ∈ N, there is a PPC⊕(B) circuit
of depth (dlogBe+ k)d(C) and size at most((
2 +
1
2k−1
)
B − FdlogBe−k+3
)
|C| .
Proof. For k steps, we apply Lemma 5.1, where in the
final recursive step we use the circuit from Corollary 5.6.
Correctness is immediate from Lemma 5.1 and Corollary 5.6.
Denote by Bi, i ∈ [k+ 1], the number of in- and outputs
of the (sub)circuit at depth i of the recursion, i.e., B0 = B
and Bi+1 = dBi/2e for all i ∈ [k]. We have that Bi ≤
B/2i+
∑i
j=1 2
−j < B/2i+1, which follows inductively via
Bi+1 =
⌈
Bi
2
⌉
≤
⌈
B0/2
i +
∑i
j=1 2
−j
2
⌉
≤ B0
2i+1
+
 i∑
j=1
2−j−1
+ 1
2
=
B
2i+1
+
i+1∑
j=1
2−j .
Observe that dlogBi+1e = dlogBie − 1 for all i ∈ [k]. By
Lemma 5.1 and Corollary 5.6, the size of the resulting circuit
is thus bounded by(
4B
2k
− FdlogBe−k+3 +
k−1∑
i=0
(Bi − 1)
)
|C|
<
(
4B
2k
− FdlogBe−k+3 +
k−1∑
i=0
B
2i
)
|C|
=
((
2
2k−1
+ 2− 1
2k−1
)
B − FdlogBe−k+3
)
|C|
=
((
2 +
1
2k−1
)
B − FdlogBe−k+3
)
|C| .
Finally, Lemmas 5.1 and 5.4 show that the circuit has depth
(2k + dlogBke)d(C) = (dlogBe+ k)d(C) .
11
 50
 100
 150
 200
 250
 10  20  30  40  50  60  70
n
u
m
b
e
r 
o
f 
c
o
p
ie
s
 o
f 
C
input width B
upper bound
Ladner & Fischer, fan-out: 3
we, fan-out: 3
Ladner & Fischer, fan-out: unbounded
we, fan-out: unbounded
Fig. 5: Comparison of the balanced recursion from [23] and
ours. The curves for unbounded fan-out are the exact sizes
obtained, whereas “upper bound” refers to the bound from
Corollary 5.6; the fan-out 3 curves show that the unbalanced
strategy performs better also for the construction from The-
orem 5.17 (for f = 3 and k = 0) we derive next.
5.2 Constant Fan-out at Optimal Depth
The optimal depth construction incurs an excessively large
fan-out of Θ(B), as the last output of left recursive calls
needs to drive all the copies of C that combine it with
each of the corresponding right call’s outputs. This entails
that, despite its lower depth, it will not result in circuits
of smaller physical delay than simply recursively applying
the construction from Figure 4a. Naturally, one can insert
buffer trees to ensure a constant fan-out (and thus constantly
bounded ratio between delay and depth), but this increases
the depth to Θ(log2B + d(C) logB).
We now modify the recursive construction to ensure a
constant fan-out, at the expense of a limited increase in size
of the circuit. The result is the first construction that has size
O(B), optimal depth, and constant fan-out.
In the following, we denote by f ≥ 3 the maximum fan-
out we are trying to achieve, where we assume that gates or
memory cells providing the input to the circuit do not need
to drive any other components. For simplicity, we consider
C to be a single gate, i.e., a gate driving two C components
has exactly fan-out 2.
We proceed in two steps. First, we insert 2B buffers
into the circuit, ensuring that the fan-out is bounded by 2
everywhere except at the gate providing the last output of
each subcircuit corresponding to a left node. In the second
step, we will resolve this by duplicating these gates suffi-
ciently often, recursively propagating the changes down the
tree. Neither of these changes will affect the output (i.e. the
correctness) of the circuit or its depth, so the main challenges
are to show our claim on the fan-out and bounding the size
of the final circuit.
5.2.1 Step 1: Almost Bounding Fan-out by 2
Before proceeding to the construction in detail, we need
some structural insight on the circuit.
Definition 5.8. For node v ∈ Tb, define its range Rv and left-
count αv recursively as follows.
• If v is the root, then Rv = [1, 2b] and αv = 0.
• If v is the left child of p with Rp = [i, i + j], then Rv =
[i, i+ (j + 1)/2] and αv = αp.
• If v is the right child of right node p with Rp = [i, i + j],
then Rv = [i+ (j + 1)/2 + 1, i+ j] and αv = αp.
• If v is the right child of left node p, then Rv = Rp and
αv = αp + 1.
Hence, the left-count αv tells us for every node v ∈ Tb
the number of left recursion steps preceding v, whereas
Rv gives us information about the range of inputs used at
node v. We observe that each recursion halves the number
of inputs and that the range is only cut in half if αv does
not increase. Combining these observations with structural
insights on the recursion patterns in Figures 4a and 4c, we
state the following four properties of PPC(C, Tb).
Lemma 5.9. Suppose the subcircuit of PPC(C, Tb) represented
by node v ∈ Tb in depth d ∈ [b + 1] has range Rv = [i, i + j].
Then
(i) it has 2b−d inputs,
(ii) j = 2b−d+αv − 1,
(iii) if v is a right node, all its inputs are outputs of its childrens’
subcircuits, and
(iv) if v is a left node or leaf, only its even inputs are provided
by its child (if it has one) and for odd k ∈ [1, 2b−d], we have
that dvk =
⊕i+k2αv−1
k′=i+(k−1)2αv dk′ .
Proof. Property (i) is immediate from the fact that with
each step of recursion, the number of input and output
wires is cut in half. Checking the above definition, we see
that the range stays the same whenever αv increases, and
otherwise is halved on each recursive step; Property (ii)
follows. Property (iii) can be readily verified from Figure 4c.
The final property is shown by induction on b. It is im-
mediate for b = 0 and b = 1. For b ≥ 2, the subcircuit of the
left child ` of the root has 2b−1 inputs, the odd ones of which
are inputs to the overall circuit (cf. Figures 4a and 4c). As we
have αv = 0, we get that d`k = di+k =
⊕i+k2αv−1
k′=i+(k−1)2αv dk′
and the node satisfies the claim. For the subcircuit Rr
corresponding to the subtree rooted at the right child of the
root, the claim holds by the induction hypothesis applied
to b − 1. For the subcircuit R` of the left child, we see
from Figure 4a that the subcircuit Rc corresponding to the
subtree rooted at its child c receives inputs dck = d2k−1⊕d2k,
k ∈ [1, 2b−1]. Combining this with the induction hypothesis
for b−2, the induction step is completed also in this case.
Lemma 5.9 leads to an alternative representation of the
circuit PPC(C, Tb), see Figure 6, in which we separate
gates in the recursive pattern from Figure 4a that occur
before the subcircuit Rc. Adding the buffers we need in
our construction, this results in the modified patterns given
in Figure 6b. The separated gates appear at the bottom of
Figure 6a: for each leaf v of Tb, there is a tree of depth αv
aggregating all of the circuit’s inputs from its range. Each
non-root node in an aggregation tree provides its output
to its parent. In addition, one of the two children of an
inner node in the tree must provide its output as an input
to one of the subcircuits corresponding to a node of Tb,
cf. Property (iv) of Lemma 5.9.
From this representation, we will derive that the fol-
lowing modifications of PPC(C, Tb) result in a PPC⊕(2b)
circuit PPC(C, Tb)′, for which a fan-out larger than 2 exclu-
sively occurs on the last outputs of subcircuits correspond-
ing to nodes of Tb.
12
R`
Rc
Rr
R`′ Rr′
⊕
⊕
d1 d2
⊕
d3 d4
⊕
d5 d6
⊕
d7 d8
⊕
d9 d10
⊕
d11 d12
⊕
d13 d14 d15 d16
d`1, d
`
3, d
`
5, d
`
7
d`
′
1 , d
`′
3
d`1
d`3
d`5 d
`
7 d
`′
1 d
`′
3
(a) Recursion tree T4 with separated aggregation trees and added buffers.
⊕ ⊕ ⊕ ⊕
R`′ Rr′
⊕
Rr
⊕ ⊕ ⊕
Rc
d`1 d
`
3 d
`
5 d
`
7
R`
(b) Recursive patterns Rr and R`.
Fig. 6: Construction of PPC(C, T4)′. On the left, we see the recursion tree, with the aggregation trees separated and shown
at the bottom. Inputs are depicted as black triangles. On the right, the application of the recursive patterns at the children
of the root is shown. Parts marked blue will be duplicated in the second step of the construction that achieves constant
fan-out; this will also necessitate to duplicate some gates in the aggregation trees.
1) Add a buffer on each wire connecting a non-root node
of any of the aggregation trees to its corresponding
subcircuit (see Figure 6a).
2) For the subcircuit corresponding to left node ` with
range R` = [i, i + j], add for each even k ≤ j (i.e.,
each even k but the maximum of j + 1) a buffer before
output pi`k (see bottom of Figure 6b).
3) For each right node r with range [i, i+ j], add a buffer
before output pir(j+1)/2 (see top of Figure 6b).
Lemma 5.10. With the exception of gates providing the last
output of subcircuits corresponding to nodes of Tb (blue in
Figure 6b), fan-out of PPC(C, Tb)′ is 2. Buffers or gates driving
an output of the circuit drive nothing else.
Proof. First, we prove the following invariant: If each input
to a subcircuit of PPC(C, Tb)′ corresponding to a node of
Tb is driven by a gate or buffer driving no other wires, the
same holds true for the outputs of the subcircuit. Suppose
for contradiction that this invariant is violated and consider
a minimal subcircuit doing so. There are three cases.
• The subcircuit corresponds to a leaf. This is a contradic-
tion, as then the subcircuit simply is a wire connecting
its sole input to its output.
• The subcircuit corresponds to a right node r (cf. top
of Figure 6b) with range [i, i + j]. As the invariant
applies to the subcircuit corresponding to its left child,
its outputs pir1, . . . , pi
r
(j−1)/2 satisfy the invariant. Its
output pir(j+1)/2 satisfies the invariant due to the in-
serted buffer. As the remaining outputs are driven by
gates that drive nothing else, this case also leads to a
contradiction.
• The subcircuit corresponds to a left node ` (cf. bottom
of Figure 6b) with range [i, i + j]. As d`1 is simply
wired to pi`1 (and nothing else), it satisfies the invariant.
The last output pi`j+1 satisfies the invariant, because
the recursively used subcircuit does. The remaining
outputs are driven by gates or buffers driving nothing
else, again resulting in a contradiction.
As all cases result in a contradiction, the invariant holds.
Next, observe that, by construction, the aggregation trees
have fan-out 2 after buffer insertion. Each buffer or gate
from this part of the circuit drives exactly one wire con-
necting it to the remainder of the circuit. Thus, the above
invariant shows that all subcircuits corresponding to nodes
of Tb satisfy that each of their outputs is driven by gate
or a buffer driving nothing else. Checking Figure 6b, we can
thus conclude that indeed no gate or buffer drives more than
two others, unless it provides the last output to one of the
recursively used subcircuits in the construction at a right
node; gates or buffers driving an output of PPC(C, Tb)′
drive only this output.
It remains to count the inserted buffers. We do so by com-
puting a closed form expression from the linear recurrence
that describes the number of nodes of a given type (left,
right, leaf) in a given depth as function of the previous one.
The following helper statement will be useful for this, but
also later on.
Lemma 5.11. Denote by Lb ⊆ Tb the set of leaves of Tb. Then
|Lb| = Fb+2 and
∑
v∈Lb 2
αv = 2b.
Proof. We have that |L0| = 1 = F2, |L1| = 2 = F3, and, by
Observation 5.2, for b ≥ 2 that |Lb| = |Lb−1| + |Lb−2|. This
recurrence has solution |Lb| = Fb+2.
Next, consider the recurrence given by L′0 = 1, L
′
1 = 2,
and L′b = L
′
b−1 + 2L
′
b−2 for b ≥ 2; the factor of 2 assigns
twice the weight to the subtree rooted at the child of the
root’s left child, thereby ensuring that each leaf is accounted
with weight 2αv . This recurrence has solution 2b.
Lemma 5.12. Denote by s the size of a buffer. Then
|PPC(C, Tb)′| = |PPC(C, Tb)|+
(
2b + 2b−1 − Fb+3
)
s .
Proof. To count the number c(b) of buffers in subcircuits cor-
responding to nodes of Tb, we observe that c(0) = 0, c(1) =
1, and for b ≥ 2 it holds that c(b) = 2b−2 +c(b−1)+c(b−2):
1 buffer at the root (see top of Figure 6b), 2b−2− 1 buffers at
its left child (see bottom of Figure 6b), c(b−1) buffers for the
subtree rooted at its right child, and c(b− 2) buffers for the
13
subtree rooted at the child of its left child. This recurrence
relation has the solution c(b) = 2b − Fb+1.
To count the number of buffers attached to the aggrega-
tion trees, recall that from depth d 6= 0 of each tree, exactly
half of the nodes’ output is required by a buffer connected to
some node in Tb (this follows from Lemma 5.9 and the fact
that ranges of nodes in the same depth of Tb form a partition
of [1, 2b]). Thus, this number equals
∑
v∈Lb 2
αv−1 − 1. By
Lemma 5.11, the total number of buffers thus equals
2b − Fb+1 + 2b−1 − Fb+2 = 2b + 2b−1 − Fb+3 .
Similar arguments serve later as well. The main reason
why we will define the function a(v) in the next section
without rounding is to ensure that we again obtain linear
recurrences, which can be solved using standard techniques
from linear algebra. As a downside, this results in slightly
overestimating the size of circuits, as we may ask for more
copies of gates from children than are actually needed.
5.2.2 Step 2: Bounding Fan-out by f
In the second step, we need to resolve the issue of high fan-
out of the last output of each recursively used subcircuit in
PPC(C, Tb)
′. Our approach is straightforward. Starting at
the root of Tb and progressing downwards, we label each
node v with a value a(v) that specifies a sufficient number
of additional copies of the last output of the subcircuit
represented by v to avoid fan-out larger than f . At right
nodes, this is achieved by duplicating the gate computing
this output sufficiently often, marked blue in Figure 6b
(top). For left nodes, we simply require the same number
of duplicates to be provided by the subcircuit represented
by their child (i.e., we duplicate the blue wire in the bottom
recursive pattern shown in Figure 6b). Finally, for leaves,
we will require a sufficient number of duplicates of the root
of their aggregation tree; this, in turn, may require to make
duplicates of their descendants in the aggregation tree.
We define a(v) and then utilize it to describe our fan-out
f circuit. Afterwards, we will analyze the increase in size of
the circuit compared to PPC(C, Tb)′.
Definition 5.13 (a(v)). Fix b ∈ N0. For v ∈ Tb in depth d ∈
[b+ 1], define
a(v) :=

0 if v is the root
a(p)+2b−d
f if v is the left child of p
a(p)
f if v is the right child of right node p
a(p) if v is the (only) child of left node p.
Lemma 5.14. Suppose that for each leaf v ∈ Tb, there are ba(v)c
additional copies of the root of the aggregation tree, and for each
right node v ∈ Tb, we add ba(v)c gates that compute (copies of)
the last output of their corresponding subcircuit of PPC(C, Tb)′.
Then we can wire the circuit such that all gates that are not in
aggregation trees have fan-out at most f , and each output of the
circuit is driven by a gate or buffer driving only this output.
Proof. We prove the claim by induction on the depth of
nodes, starting at the leaves. The base case holds by assump-
tion, as each leaf is provided with sufficiently many copies
of its (single) input. For the induction step from depth d+ 1
to d ∈ [b], fix some v ∈ Tb in depth d. By Lemma 5.10,
we only need to consider the last output of the subcircuit
corresponding to the node; there are in total 1+ba(v)c gates
providing it. We distinguish four cases.
• v is a left node. Thus, its child c is a right node with
1 + ba(c)c = 1 + ba(v)c gates providing its last output.
As the parent p of v is a right node, these need to drive
1 + 2b−d + ba(p)c gates (cf. Figure 6b). We have that
f(1 + ba(v)c) = f
(
1 +
⌊
a(p) + 2b−d
f
⌋)
≥ f + a(p) + 2b−d − (f − 1)
≥ 1 + 2b−d + ba(p)c .
• v is a right node with left parent p. This was already
covered in the previous case from the viewpoint of p.
• v is a right node with right parent p. As p is also a right
node, we need to drive 1 + ba(p)c gates (cf. Figure 6b).
We have that
f(1 + ba(v)c) = f
(
1 +
⌊
a(p)
f
⌋)
≥ f + a(p)− (f − 1) ≥ 1 + ba(p)c .
• v is the root. Thus, it provides the outputs to the circuit,
and the claim is immediate from Lemma 5.10.
It remains to modify the aggregation trees so that suffi-
ciently many copies of the roots’ output values are available.
Lemma 5.15. Consider an aggregation tree corresponding to leaf
v ∈ Tb and fix f ≥ 3. We can modify it such that the fan-
out of all its non-root nodes becomes at most f , there are ba(v)c
additional gates computing the same output as the root, and at
most (fa(v))/(f − 2) + (2αv−1)/(f − 1) gates are added.
Proof. We recursively assign a value ad to each depth
d ∈ [αv + 1] of the aggregation tree that bounds the number
of additional gates in this depth from above. Accordingly,
a0 := a(v). To determine a suitable recurrence, recall that
for each node in the tree, one child needs to also drive a
buffer, while the other does not (Lemma 5.9). Fix one child
c in depth d and its parent p, and denote by a(c) and a(p)
the number of additional gates required. If c also needs to
drive a buffer, it is sufficient that a(c) ≥ b(a(p) + 1)/fc, as
f(1 + b(a(p) + 1)/fc) ≥ 2 +a(p); similarly, a(c) ≥ ba(p)/fc
is sufficient, if the child does not need to drive an additional
buffer. As aggregation trees are complete balanced binary
trees, summing over all nodes in depth d we thus get that
ad+1 = (2ad + 2
d)/f for all d ∈ [αv] is sufficient. This
recurrence has solution
ad =
(
2
f
)d
a(v) +
2d−1
f − 1
(
1− 1
fd
)
.
The total number of additional gates up to depth αv − 1 is
thus bounded by
αv−1∑
d=0
ad =
αv−1∑
d=0
(
2
f
)d
a(v) +
2d−1
f − 1
(
1− 1
fd
)
<
fa(v)
f − 2 +
2αv−1
f − 1 ,
matching the claimed bound.
14
pi1 pi2 pi3 pi4 pi5 pi6 pi7 pi8 pi9 pi10 pi11 pi12 pi13 pi14 pi15 pi16
⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕⊕⊕ ⊕ ⊕ ⊕
⊕ ⊕⊕ ⊕
d1
⊕
d2 d3
⊕⊕
⊕
d4 d5
⊕
d6 d7
⊕
d8 d9
⊕
d10 d11
⊕
d12 d13
⊕
d14 d15 d16
Fig. 7: PPC(3)(C, T4). Right recursion steps Rr are marked with dark gray, left recursion steps with light gray. The step at
the root (above) and aggregation trees (below) are not marked explicitly. Duplicated gates are depicted in a layered fashion.
Dashed lines indicate that a wire is not participating in a recursive step.
It remains to show that no gates need to be provided
at the leaves of the aggregation tree; beside showing the
claimed bound on the increase in size of the circuit, this
is also necessary, because we cannot make copies of inputs
without increasing the depth of the circuit. As f ≥ 3 and
nodes in aggregation trees have fan-out 2 in PPC(C, Tb)′,
we can at least double the number of copies of each node
with each level of the aggregation tree. As the tree at leaf v
has depth αv , it is hence sufficient to show that a(v) ≤ 2αv .
To bound a(v) from above, we again exploit the recur-
sive structure of the construction. Denote by A(b, x) an up-
per bound on a(v)/2αv for all leaves v ∈ Tb when defining
the values a(v) as usual, except that we set a(r) = x for
some 0 ≤ x ≤ 2b−1 at the root r ∈ Tb. For b ≥ 2, we get that
A(b, x) ≤ max
{
A
(
b− 1, x
f
)
, A
(
b− 2, 2
b−1 + x
2f
)}
< max
{
A
(
b− 1, 2b−2
)
, A
(
b− 2, 2b−2
)}
.
Moreover, we have that A(0, x) = x ≤ 1/2 and A(1, x) =
(x+ 1)/f ≤ 2/3 for feasible values of x, where we use that
f ≥ 3. Hence, A(b, x) ≤ 2/3 for all b and feasible values of
x. In particular, A(b, 0) ≤ 2/3, implying that indeed a(v) <
2αv for all leaves v ∈ Tb.
Finally, we need to count the total number of gates we
add when implementing these modifications to the circuit.
Lemma 5.16. For f ≥ 3, define PPC(f)(C, Tb) by modifying
PPC(C, Tb)
′ according to Lemmas 5.14 and 5.15. Then, with
λ1 := (1 +
√
5)/4, |PPC(f)(C, Tb)| is bounded by
|PPC(C, Tb)′|+ 2b
(
1
2f − 2 +
2
f − 2 +O
(
λb1
f2
))
|C| .
Proof. Denote by Rb ⊂ Tb the set of right nodes in Tb. By
Lemma 5.14, we add at most
∑
v∈Rb a(v) gates to the part
of the circuit corresponding to Tb. Recall that Lb ⊆ Tb is the
set of leaves of Tb. By Lemmas 5.11 and 5.15, we add at most
∑
v∈Lb
fa(v)
f − 2 +
2αv−1
f − 1 =
2b−1
f − 1 +
∑
v∈Lb
fa(v)
f − 2
gates to the aggregation trees.
To bound these numbers, denote for d ∈ [b] by xd and
yd the sum over right and left nodes’ a(v) values in depth
d, respectively. Moreover, let ld and rd denote 2b−d times
the number of left and right nodes in depth d, respectively.
Thus, we seek to bound f(xb + yb)/(f − 2) +
∑b−1
d=0 xd. We
have that x0 = y0 = 0, l1 = r1 = 2b−1, and
xd
yd
ld+1
rd+1
 =

f−1 1 0 0
f−1 0 f−1 0
0 0 0 1/2
0 0 1/2 1/2

d
x0
y0
l1
r1

for all d ∈ [b−1]. The recurrence matrix has the eigenvalues
λ1 =
1
4
(1 +
√
5) , λ3 =
1
2f
(1 +
√
1 + 4f) ,
λ2 =
1
4
(1−
√
5) , λ4 =
1
2f
(1−
√
1 + 4f) .
The recurrence has solution rd = 2b−dFd+1, ld = 2b−dFd,
xd =
2b+1
f2 − 10f + 20 ·
(
f2 − 3f − 2√
1 + 4f
(
−λd3 + λd4
)
−
(f − 2)
(
λd1 + λ
d
2 − λd3 − λd4
)
+
3f − 10√
5
(
λd1 − λd2
))
∈ O
(
2bλd1
f
)
,
15
and yd = xd+1 − xd/f , where the asymptotic bound on xd
uses that for f ≥ 3 and all i ∈ [1, 4], λ1 ≥ |λi|. Therefore,
f(xb + yb)
f − 2 +
b−1∑
d=0
xd ∈ O
(
xb + xb+1
f
)
+
∞∑
d=0
xd
= O
(
λd1xd+1
f2
)
+
∞∑
d=0
xd ,
Observe that for f ≥ 3, 0 6= |λi| < 1 for all i; thus,∑∞
d=0 λ
d
i = 1/(1 − λi), yielding with some calculation that∑∞
d=0 xd = 2
b+1/(f−2). Summation of the individual terms
and multiplication by |C| proves claim of the theorem.
As an example for the overall resulting construction,
we show PPC(3)(C, T4) in Figure 7. We summarize our
findings in the following theorem.
Theorem 5.17. Suppose that C implements ⊕, buffers have size
s and depth at most d(C), and set λ1 := (1 +
√
5)/4. Then for
all k ∈ [b+ 1], b ∈ N0, and f ≥ 3, there is a PPC⊕(2b) circuit
of fan-out f , depth (b+ k)d(C), and size at most(
2b+1 + 2b−k
(
2 +
5f − 6
2f2 − 6f + 4 +O
(
λb1
f2
)))
|C|
+
(
2b + 2b−k−1
)
s .
Proof. We argue as for Theorem 5.7, but replace PPC(C, Tb)
by PPC(f)(C, Tb), and need to make sure that we modify
the first k steps of the recursion, where we apply the
construction from Figure 4a, such that the fanout is at most
f . In fact, we will ensure a fanout of 2 for this part of the
construction. To this end, we simply add a buffer before each
output that is not directly driven by a copy of C , as already
indicated in the figure. This guarantees the invariant that
all inputs to and outputs of subcircuits are driven by an
element not driving anything else; for the PPC(f)(C, Tb)
subcircuit, this invariant holds by Lemma 5.10.
This adds in total 2b − 2b−k buffers to the circuit: one
for each output wire minus one for each output wire of
PPC(f)(C, Tb). Thus, by Theorem 5.7, the size of the circuit
is bounded by ∆ +
(
2b − 2b−k) s + (2b+1 + 2b−k+1) |C|,
where ∆ := |PPC(f)(C, Tb)| − |PPC(C, Tb)|. By Lem-
mas 5.12 and 5.16, ∆ is bounded by
2b−k
(
1
2f − 2 +
2
f − 2 +O
(
λb−k1
f2
))
|C|+ 3 · 2
b−ks
2
.
Summation yields the claimed bound on the size of the cir-
cuit. The depth bound and that we indeed get a PPC⊕(2b)
circuit follow as in Theorem 5.7, as the modifications to the
construction affected neither its depth nor its output.
We refrain from analyzing the size of the construction for
values of B that are not powers of 2. However, in Figure 8
we plot the exact bounds (without buffers) for k = 0 and
selected values of f against B.
6 SIMULATION
Separate from and in addition to the proofs from the pre-
vious sections, we verify the correctness of our circuits by
VHDL simulation. To this end, we first need to specify
implementations of the subcircuits computing M and outM.
 50
 100
 150
 200
 250
 10  20  30  40  50  60  70
n
u
m
b
e
r 
o
f 
c
o
p
ie
s
 o
f 
C
input width B
upper bound
fan-out: 3
fan-out: 4
fan-out: 5
fan-out: 10
fan-out: unbounded
Fig. 8: Dependence of the size of the modified construction
on f . For comparison, the upper bound from Corollary 5.6
on the circuit with unbounded fan-out is shown as well.
6.1 Gate-Level Implementation of Operators
From Tables 6a and 6b, for s, b ∈ B2 we can extract the
Boolean formulas
(s  b)1 = s1s¯2 + s1b¯1 + s¯2b1
(s  b)2 = s¯1s2 + s¯1b2 + s2b¯2
out(s, b)1 = s¯1b2 + s¯2b1 + b1b2
out(s, b)2 = s1b2 + s2b1 + b1b2 .
In general, realizing a Boolean formula f by replacing
negation, multiplication, and addition by inverters, AND,
and OR gates, respectively, does not result in a circuit
implementing fM.1 However, we can easily verify that the
above formulas are disjunctions of all prime implicants of
their respective functions. As shown in [10] (see also [16]),2
in this special case the resulting circuits do implement the
closure — provided the gates behave as in Table 3, which
the implementations given in Figure 1 do by Theorem 3.14.
Using distributive laws (recall that these also hold in Kleene
logic), the above formulas can be rewritten as
(s  b)1 = s1(s¯2 + b¯1) + s¯2b1
(s  b)2 = s2(s¯1 + b¯2) + s¯1b2
out(s, b)1 = b1(b2 + s¯2) + b2s¯1
out(s, b)2 = b2(b1 + s1) + b1s2 .
We see that, in fact, a single circuit with suitably wired (and
possibly negated) inputs can implement all four operations.
As for sel1 = sel2 the circuit implements a multiplexer with
select bit sel1, we refer to it as extended multiplexer, or XMUX
for short. Its functionality is specified by
XMUX(sel1, sel2, x, y) := y(x+ sel2) + x sel1 .
Figure 9 shows the resulting circuit, and Table 11 lists how
to map inputs to compute M and outM.
We note that this circuit is not a particularly efficient
XMUX implementation; a transistor-level implementation
would be much smaller. However, our goal here is to verify
correctness and give some initial indication of the size of the
resulting circuits — a fully optimized ASIC circuit is beyond
1. For instance, (s  b)1 = s1b¯1 + s¯2b1 as Boolean formula, but the
two expressions differ when evaluated on s1 = s¯2 = 1 and b1 = M. The
circuits resulting from the different formulas are implementations of a
multiplexer (with select bit b1) and its closure, respectively.
2. Alternatively, one can manually verify that these formulas evaluate
to the truth tables given in Tables 8 and 9.
16
x y
sel1
sel2
Fig. 9: XMUX circuit, used to implement M and outM.
TABLE 11: Wiring an XMUX to compute the various opera-
tors.
sel1 sel2 x y XMUX(sel1, sel2, a, b)
b1 b¯1 s¯2 s1 (s M b)1
b2 b¯2 s¯1 s2 (s M b)2
s¯1 s¯2 b2 b1 outM(s, b)1
s2 s1 b1 b2 outM(s, b)2
the scope of this article. In [4], the size of the implementation
is slightly reduced by moving negations. Due to space
limitations, we refrain from detailing this modification here,
but note that Figure 12 and Table 12 take it into account.
6.2 Putting it All Together
We now have all the pieces in place to assemble a containing
2-sort(B) circuit. By Theorem 4.3, M is associative. Thus,
from a given implementation of M (e.g., two copies of the
circuit from Figure 9 with appropriate wiring and negation,
cf. Table 11) we can construct PPCM(B−1) circuits of small
depth and size, as shown in Section 5. We can combine such
a circuit with an outM implementation (again, two XMUXes
with appropriate wiring and negation will do) as shown in
Figure 10 to obtain our 2-sort(B) circuit.
6.3 Simulation Setup
We implemented the design given in Figure 10 on register-
transfer-level using the PPCM(B − 1) circuit given by
Theorem 5.7 for k = 0.3 Quartus by Altera is used for
design entry, which in our case mainly consists of checking
correct implementation. After design entry we use ModelSim
by Altera for behavioral simulation. Note that we must
not simulate the preprocessed Quartus output, because pro-
cessing may compromise metastability-containing behav-
ior. Instead, we simulate pure VHDL. Metastable signals
are simulated using VHDL signal X , because its behavior
matches the worst-case behavior assumed for M.
The correctness of this construction follows from Theo-
rems 4.7 and 4.8, where we can plug in any PPCM(B − 1)
circuit, cf. Section 5. For the circuits derived by relying
on the XMUX circuit from Figure 9, we independently con-
firmed this via simulation.
6.4 Results
For the implementation of PPCM(B − 1) we used the
circuits from Theorem 5.7, i.e., we did not make use of the
3. For k > 0, fan-out becomes an issue, requiring the more in-
volved constructions provided by Theorem 5.17. However, the resulting
numbers would be inaccurate, and a detailed comparison based on
optimized ASIC implementations is beyond the scope of this work.
g1h1 gB−2hB−2 gB−1hB−1
PPCM(B − 1)
d1 . . . dB−2 dB−1
pi1 . . . piB−2 piB−100
g1h1 g2h2 gB−1hB−1 gBhB
outM outM . . . outM outM
g′1h
′
1 g
′
2h
′
2 g
′
B−1h
′
B−1 g
′
Bh
′
B
Fig. 10: Constructing 2-sort(B) from PPCM(B − 1) and
outM.
010X 0X10 01X1 1011
0101 0011 010X 00X1
010X 0X10 01XX 1011
010X 0011 01XX 00X1
010X 0X10 010X 1011
0101 0011 01X1 00X1
+6 0.228 ns
/comparison/g
/comparison/h
/comparison/max_nonmc
/comparison/min_nonmc
/comparison/max_mc
/comparison/min_mc
Entity:comparison  Architecture:testbench  Date: Tue Sep 11 15:32:34 CEST 2018   Row: 1 Page: 1
Fig. 11: Excerpt from a simulation for 4-bit inputs, where
X = M . The rows show (from top to bottom) the inputs g
and h, both outputs of the simple non-containing circuit,
and both outputs of our design. As inputs g and h we
randomly generated valid strings. Columns 1 and 3 show
that the simpler design fails to implement a 2-sort(4) circuit.
extension to constant fan-out. We exhaustively checked the
design from Figure 10 for B up to 12 (and all feasible k).
Simulation shows that the design works correctly for several
levels of recursion, e.g., when regarding B = 1 and B = 2
as simple base cases, B = 12 implies 3 levels of recursion
for both patterns. We refrained from simulating the constant
fan-out construction, because it simply replicates intermedi-
ate results without changing functionality.
6.5 Comparison to Baseline
After behavioral simulation, we continue with a com-
parison of our design and a standard sorting approach
Bin-comp(B). As mentioned earlier, the 2-sort(B) imple-
mentation given in Figure 10 is slightly optimized by pulling
out a negation from the operators in every recursive step [4].
After design entry as described above, we use Encounter
RTL Compiler for synthesis and Encounter for place and
route. Both tools are part of the Cadence tool set and in
both steps we use NanGate 45 nm Open Cell Library as a
standard cell library.
Since metastability-containing circuits may include ad-
ditional gates that are not required in traditional Boolean
logic, Boolean optimization may compromise metastability-
containing properties [3]. Accordingly, we were forced to
disable optimization during synthesis of the circuits.
17
input width B
Fig. 12: Comparison of our solution PPC Sort to a standard
non-containing one. For the latter, the unexpected delay
reduction at B = 16 is the result of automatic optimization
with more powerful gates, which our solution does not use.
Binary Benchmark Bin-comp: In short, Bin-comp
consists of a simple VHDL statement comparing two bi-
nary encoded inputs and outputting the maximum and the
minimum, accordingly. It follows the same design process
as 2-sort, but then undergoes optimization using a more
powerful set of basic gates. For example, the standard cell
library provides prebuild multiplexers. These multiplexers
are used by Bin-comp, but not by 2-sort, as they are not
metastability-containing. We stress that these more power-
ful gates provide optimized implementations of multiple
Boolean functions, yet each of them is still counted as
a single gate. Thus, comparing our design to the binary
design in terms of gate count, area, and delay disfavors
our solution. Moreover, we noticed that the optimization
routine switches to employing more powerful gates when
going from B = 8 to B = 16 (cf. Figure 12), resulting in a
decrease of the delay of the Bin-comp implementation.
Nonetheless, our design performs comparably to the
non-containing binary design in terms of delay, cf. Figure 12
and Table 12. This is quite notable, as further optimization
is possible by optimizing our design on the transistor level,
with significant expected gains. The same applies to gate
count and area, where a notable gap remains. Recall, how-
ever, that the Bin-comp design hides complexity by using
more advanced gates and does not contain metastability.
We emphasize that we refrained from optimizing the
design by making use of all available gates or devising
transistor-level implementations, as such an approach is tied
to the utilized library or requires design of standard cells.
7 CONCLUSIONS
In this work, we demonstrated that efficient metastability-
containing sorting circuits are possible. Our results indicate
that optimized implementations can achieve the same delay
as non-containing solutions, without a dramatic increase in
circuit size. This is of high interest to an intended appli-
cation motivating us to design MC sorting circuits: fault-
tolerant high-frequency clock synchronization. Sorting is a
key step in envisioned implementations (cf. [10], [15]) of
the Lynch-Welch algorithm [30] with improved precision of
synchronization. The complete elimination of synchronizer
delay is possible due to the efficient MC sorting networks
presented in this article; enabling an increment of the rate at
which clock corrections are applied, significantly reducing
the negative impact of phase drift of local clock sources on
the precision of the algorithm (cf. [18]).
This goal will necessitate to devise optimized ASIC
implementations of our circuits. The novel PPC circuits we
devised in Section 5 are an important contribution towards
this end. Note that it is crucial to take into account both
depth and fan-out for devising low-delay circuits. Hence,
follow-up work needs to compare the existing and our novel
design based on suitable metrics that take both into account
to reliably predict the achieved trade-offs between delay,
area, and energy consumption of circuits. Note that this is
of relevance beyond the specific application of MC sorting:
PPC circuits lie at the heart of adder designs, implying that
even a minor improvement can have significant impact on
the overall performance of computing devices!
MC Control Loops: More generally speaking, MC
circuits like those presented here are of interest in mixed-
signal control loops whose performance depends on very
short response times. When analog control is not desirable,
traditional solutions incur synchronizer delay before being
able to react to any input change. Using MC logic saves the
time for synchronization, while metastability of the output
corresponds to the initial uncertainty of the measurement;
thus, the same quality of the computational result can be
achieved in shorter time. Note that our circuits are purely
combinational, so they can be used in both clocked and
asynchronous control logic.
Obvious examples of such control loops are clock syn-
chronization circuits, but MC has been shown to be useful
for adaptive voltage control [13] and fast routing with an
acceptable low probability of data corruption [29] as well.
This type of application suggests to explore whether effi-
cient circuits exist for a wider range of arithmetic operations,
like e.g. addition or (possibly approximate) multiplication.
Redundant Encoding and Addition: On the theoreti-
cal side, our results are to be contrasted with the exponential
gap between the size of non-containing and MC circuits
shown in [17]. This work raised the question for which
classes of functions small MC circuits exist. Given that
Ladner and Fischer proved that the PPC task can be solved
efficiently for any constant-sized state machine [23], it was
natural to ask whether this result can be extended to MC
computations. In follow-up work, we show that indeed this
holds true for any constant-sized FSM [5]. However, when
applying this result to addition, unlike for sorting (where
the underlying operations are max and min) uncertainty
of inputs adds up. This means that Gray code can support
meaningful computations only if the total uncertainty of all
addends is at most 1.
Accordingly, in [5] we also consider redundant encod-
ings, showing that using k (roughly) redundant bits, an
uncertainty of b(k + 1)/2c can be tolerated without loss of
precision. Combined with the above result on transducers,
this yields a meaningful notion of MC addition that allows
for efficient circuits. As, essentially, the redundant bits are
used as a unary code, it should be straightforward to apply
the techniques from this article to obtain efficient sorting
circuits with the encoding from [5]. We remark that the
encoding from [5] turns out to be identical to that of the
output of suitable time-to-digital converters [12], so relax-
ing their output constraints to achieve better average-case
18
TABLE 12: Simulation results for metastability-containing sorting networks with n ∈ {4, 7, 10} for B-bit inputs. 10-sort#
optimizes gate count [7], 10-sortd optimizes depth [6]; for n ∈ {4, 7}, the sorting networks are optimal w.r.t. both measures.
Simulation results are: (i) number of gates, (ii) postlayout area [µm2] and (iii) prelayout delay [ps].
B Circuit 4-sort 7-sort 10-sort# 10-sortdgates area delay gates area delay gates area delay gates area delay
2
our work 65 87.402 357 208 279.741 714 377 506.912 912 403 541.968 833
Bin-comp 40 77.91 478 128 249.326 953 232 451.815 1284 248 483 1145
4
our work 275 368.641 640 880 1179.528 1014 1595 2137.905 1235 1705 2285.514 1133
Bin-comp 95 172.935 906 304 553.28 1810 551 1002.848 2429 589 1072.099 2143
8
our work 845 1136.184 1396 2704 3636.08 1921 4901 6590.283 2179 5239 7044.541 2059
Bin-comp 205 368.641 1475 656 1179.528 2948 1189 2137.905 3945 1271 2285.514 3470
16
our work 2035 2739.961 2069 6512 8767.374 3396 11803 15891.12 4030 12617 16987.194 3844
Bin-comp 405 530.67 1298 1296 2425.99 2600 2349 4397.085 3474 2511 4700.304 3050
performance would provide valid input for sorting circuits
that accept inputs encoded in this manner.
We believe that these results suggest applicability of our
techniques to a wide range of mixed-signal control loops
and call for future work further exploring to which extend
basic arithmetics can be realized by efficient MC circuits.
Acknowledgments: We thank Attila Kinali and the
anonymous reviewers for valuable input. This project has
received funding from the European Research Council
(ERC) under the European Union’s Horizon 2020 research
and innovation programme (grant agreement 716562).
REFERENCES
[1] M. Ajtai, J. Komlo´s, and E. Szemere´di. An O(n logn) Sorting
Network. In STOC, 1983.
[2] R. P. Brent and H. T. Kung. A Regular Layout for Parallel Adders.
Transactions on Computers, C-31(3):260–264, 1982.
[3] Johannes Bund, Christoph Lenzen, and Moti Medina. Near-
Optimal Metastability-Containing Sorting Networks. In DATE,
2017.
[4] Johannes Bund, Christoph Lenzen, and Moti Medina. Optimal
Metastability-Containing Sorting Networks. In DATE, 2018.
[5] Johannes Bund, Christoph Lenzen, and Moti Medina. Small
Hazard-free Transducers. CoRR, abs/1811.12369, 2018.
[6] Daniel Bundala and Jakub Za´vodny`. Optimal sorting networks.
In LATA, pages 236–247. Springer, 2014.
[7] Michael Codish, Luı´s Cruz-Filipe, Michael Frank, and Peter
Schneider-Kamp. 25 comparators is optimal when sorting 9 inputs
(and 29 for 10). In ICTAI, 2014.
[8] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduc-
tion to Algorithms, 3rd Edition. MIT Press, 2009.
[9] Stephan Friedrichs. Metastability-Containing Circuits, Parallel Dis-
tance Problems, and Terrain Guarding. PhD thesis, Saarland Univer-
sity, Saarbru¨cken, Germany, 2017.
[10] Stephan Friedrichs, Matthias Fu¨gger, and Christoph Lenzen.
Metastability-Containing Circuits. Transactions on Computers, 67,
2018.
[11] Stephan Friedrichs and Attila Kinali. Efficient Metastability-
Containing Multiplexers. In ISVLSI, pages 332–337, 2017.
[12] Matthias Fu¨gger, Attila Kinali, Christoph Lenzen, and Thomas
Polzer. Metastability-aware Memory-efficient Time-to-Digital
Converters. In ASYNC, 2017.
[13] Matthias Fu¨gger, Attila Kinali, Christoph Lenzen, and Ben Wieder-
hake. Fast All-Digital Clock Frequency Adaptation Circuit for
Voltage Droop Tolerance. In ASYNC, pages 68–77, 2018.
[14] R. Ginosar. Metastability and Synchronizers: A Tutorial. Design
Test of Computers, 28(5):23–35, 2011.
[15] Florian Huemer, Attila Kinali, and Christoph Lenzen. Fault-
tolerant Clock Synchronization with High Precision. In ISVLSI,
2016.
[16] D. A. Huffman. The Design and Use of Hazard-Free Switching
Networks. JACM, 4(1):47–62, 1957.
[17] C. Ikenmeyer, B. Komarath, C. Lenzen, V. Lysikov, A. Mokhov,
and K. Sreenivasaiah. On the complexity of hazard-free circuits.
In STOC, pages 878–889, 2018.
[18] P. Khanchandani and C. Lenzen. Self-Stabilizing Byzantine Clock
Synchronization with Optimal Precision. Theory of Computing
Systems, 2018.
[19] David J. Kinniment. Synchronization and Arbitration in Digital
Systems. Wiley Publishing, 2008.
[20] Stephen Cole Kleene. Introduction to Metamathematics. North
Holland, 1952.
[21] Donald E. Knuth. The Art of Computer Programming Vol. 3:
Sorting and Searching, 1998.
[22] P. M. Kogge and H. S. Stone. A Parallel Algorithm for the Efficient
Solution of a General Class of Recurrence Equations. Transactions
on Computers, C-22(8):786–793, 1973.
[23] Richard E Ladner and Michael J Fischer. Parallel Prefix Computa-
tion. JACM, 27(4):831–838, 1980.
[24] Christoph Lenzen and Moti Medina. Efficient metastability-
containing gray code 2-sort. In ASYNC, pages 49–56, 2016.
[25] Leonard Marino. General Theory of Metastable Operation. Trans-
actions on Computers, C-30(2):107–115, 1981.
[26] J. Sklansky. Conditional-Sum Addition Logic. IRE Transactions on
Electronic Computers, EC-9(2):226–231, 1960.
[27] Earl E. Swartzlander and Carl E. Lemonds. Computer Arithmetic,
volume 1. World Scientific, 2015.
[28] G. Tarawneh and A. Yakovlev. An RTL Method for Hiding Clock
Domain Crossing Latency. In ICECS, pages 540–543, 2012.
[29] Ghaith Tarawneh, Matthias Fu¨gger, and Christoph Lenzen. Meta-
stability Tolerant Computing. In ASYNC, pages 25–32, 2017.
[30] Jennifer Lundelius Welch and Nancy A. Lynch. A New Fault-
Tolerant Algorithm for Clock Synchronization. Information and
Computation, 77(1), 1988.
[31] Reto Zimmermann. Binary Adder Architectures for Cell-Based VLSI
and their Synthesis. PhD thesis, ETH Zurich, 1997.
