showed that metastability can be contained when sorting inputs arising from time-to-digital converters, i.e., measurement values can be correctly sorted without resolving metastability using synchronizers first. However, this work left open whether this can be done by small circuits. We show that this is indeed possible, by providing a circuit that sorts Gray code inputs (possibly containing a metastable bit) and has asymptotically optimal depth and size. Our solution utilizes the parallel prefix computation (PPC) framework (JACM 1980). We improve this construction by bounding its fan-out by an arbitrary f ! 3, without affecting depth and increasing circuit size by a small constant factor only. Thus, we obtain the first PPC circuits with asymptotically optimal size, constant fan-out, and optimal depth. To show that applying the PPC framework to the sorting task is feasible, we prove that the latter can, despite potential metastability, be decomposed such that the core operation is associative. We obtain asymptotically optimal metastability-containing sorting networks. We complement these results with simulations, independently verifying the correctness as well as small size and delay of our circuits. Proofs are omitted in this version; the article with full proofs is provided online at http://arxiv.org/abs/1911.00267. Ç J. Bund and C. Lenzen are with the
INTRODUCTION
M ETASTABILITY is a fundamental obstacle when crossing clock domains, potentially resulting in soft errors with critical consequences [14] . As it has been shown that metastability cannot be avoided deterministically [25] , synchronizers [19] are employed to reduce the error probability to tolerable levels. This approach trades precious time for reliability: the more time is allocated for metastability resolution, the smaller the probability of metastability-induced faults.
Recently, a different approach has been proposed, coined metastability-containing (MC) circuits [10] . It accepts a limited amount of metastability in the input to a digital circuit and ensures limited metastability of its output, so that the result is still useful. In a series of works [3] , [4] , [24] , we applied this approach to a fundamental primitive: sorting. The circuit given in [4] is asymptotically optimal in depth and size.
Our Contribution. In this article, we present the machinery used to obtain the circuit from [4] in detail. We prove that CMOS implementations of basic gates realize Kleene logic (cf. [20, section 64]), justifying the computational model introduced in [10] and used in this article.
The task of sorting an arbitrary number of inputs can be reduced to sorting two inputs by using sorting networks [21] . The 0-1-principle (cf. Section 2) shows that plugging an MC 2-sortðBÞ circuit (for B-bit inputs) into a sorting network (for n values) readily yields an MC circuit that is capable of sorting n inputs. Hence, we need to design a 2-sortðBÞ circuit sorting two inputs in an MC way.
As the choice of the encoding matters a lot for MC circuits, we characterize the set of input strings we want to sort ("valid strings"). A valid string is either a (standard) Gray code string or a string obtained from a Gray code string by replacing the unique bit that would change on the up-count to the "next" codeword by M for metastability (the third logic value in Kleene logic) . When using nonredundant codes, the use of Gray codes is mandatory: when converting an analog value to a digital one, continuously changing the input can force any circuit (that uses the value in a non-trivial way) into metastability [25] . Moreover, for combinational circuits in the abstraction of Kleene logic, all output bits that change when flipping a given input bit must become unstable when the input bit is unstable, cf. [10] . For instance, encoding a value unknown to be 11 or 12 in standard binary code would result in a string that, once metastability has been resolved, may represent any number in the interval from 8 to 15, cf. Section 3.
Valid strings arise naturally when stopping a Gray code counter asynchronously [12] or, more generally, whenever performing analog-to-digital conversion; respective circuits may risk multiple metastable bits to achieve better averagecase precision, but for the best worst-case precision one can stick to guaranteeing valid strings as output. Exploiting the structure of Gray code and the restriction to valid strings, we show how to reliably sort all inputs despite the uncertainty about the represented value arising from metastability.
We formally specify the 2-sortðBÞ circuit and then prove that the task of comparing two valid strings can be decomposed into first performing a four-valued comparison on each prefix pair of the two valid input strings, and then inferring the corresponding output bits. This reduces the design of 2-sortðBÞ to a parallel prefix computation (PPC) problem, which for our purposes can be phrased as follows. Fast PPC circuits that are simultaneously (asymptotically) optimal in depth and size are known due to a celebrated result by Ladner and Fischer [23] . Going beyond [4] , we present the full range of solutions that can be derived using their framework, which allows for a trade-off between depth and size of the 2-sort circuit. Most prominently, optimizing for depth reduces the depth of the circuit by a factor of 2 compared to [4] to optimal dlog Be, at the expense of increasing the size by a factor of up to 2.
However, relying on the construction from [23] as-is results in a very large fan-out. We present a modification reducing fan-out to any number f ! 3 without affecting depth, increasing the size by a factor of only 1 þ Oð1=fÞ (plus at most 3B=2 buffers). In particular, our results imply that the depth of an MC sorting circuit can match the delay of a non-containing circuit, while maintaining constant fanout and a constant-factor size overhead. Due to the fact that PPC circuits lie at the heart of fast adders [27] , we consider this result of independent interest.
We complement our theoretical findings by simulations confirming the correctness and small size of the devised circuits. Post-layout area and delay of the designed circuits compare favorably with a baseline provided by a straightforward non-containing implementation.
Organization of this Article. We discuss related work in Section 2. Some preliminaries, the computational model and its justification, as well as the problem specification are given in Section 3. Next, in Section 4, we break the task of designing a 2-sortðBÞ circuit down into comparing prefixes and subsequently generating the output bits out of the computed comparison values and the respective pair of input bits. The comparison can be further decomposed into sequential application of an associative operator, which enables application of the PPC framework to compute all prefixes efficiently in parallel with (asymptotically) optimal depth. In order to keep this article self-contained, we compactly review the PPC framework in Section 5. The section then proceeds to showing how to modify the construction for bounded fan-out and bounding the size of the resulting circuits. In Section 6, we implement the base operators by subcircuits and plug the pieces together to obtain complete circuits. We then simulate them up to an input width of B ¼ 16 to independently verify their correctness, and provide delay and area of the laid out circuits. We compare to a non-containing version as baseline, demonstrating the controlled increase in size of the circuit. We conclude the article in Section 7, where we also briefly discuss follow-up work that generalizes our results, demonstrating that higher-level concepts of this work like sorting networks and parallel prefix computation are applicable to further MC circuits.
RELATED WORK
Sorting Networks. Sorting networks (see, e.g., [21] ) sort n inputs from a totally ordered universe by feeding them into n parallel wires that are connected by 2-sort elements, i.e., subcircuits sorting two inputs; these can act in parallel whenever they do not depend on each other's output. A correct sorting network sorts all possible inputs, i.e., the wires are labeled 1 to n such that the ith wire outputs the ith element of the sorted list of inputs. The size of a sorting network is its number of 2-sort elements and its depth is the maximum number of 2-sort elements an input may pass through until reaching the output.
The 0-1-principle [21] states that a sorting networkassuming the 2-sort circuits are correct -is correct if and only if it sorts 0-1 inputs correctly. Thus, we obtain sorting networks for inputs that may suffer from metastability by constructing 2-sort circuits (w.r.t. a suitable order on such inputs) and plugging them into existing sorting networks.
Sorting networks have been extensively studied. Tight lower bounds of depth Vðlog nÞ (trivial) and size Vðn log nÞ (see, e.g., [8] ) are known and can be simultaneously asymptotically matched [1] . More practically, for small values of n optimal depth and/or size networks are known [6] , [7] , [21] . Accordingly, our task boils down to finding optimal (or close to optimal) metastability-containing 2-sort circuits. For B-bit inputs, our 2-sort circuits have depth and size Oðlog BÞ and OðBÞ, respectively, which is (trivially) optimal up to constants; as size and depth of our circuits are close to non-containing 2-sort circuits (cf . Table 12 ), we conclude that our approach yields MC sorting networks that are optimal up to small constant factors in both depth and size.
Prior Work on MC Circuits. Recent work [10] shows that for any Boolean function a combinational MC circuit implementing its metastable closure (see Definition 3.8) exists. The metastable closure can be seen as a best effort to contain metastability: when for an input with (some) metastable bits the stable input bits already determine a given output bit of the original Boolean function, the closure attains the respective value on this output bit; otherwise it is metastable.
Unfortunately, the proof from [10] , which uses a construction dating back to Huffman [16] , yields circuits of exponential size in the number of input bits B. The same is true for speculative computing [28] . Unconditional lower bounds on MC circuits [17] show that this cannot be avoided in general, even if the implemented function admits a small non-containing circuit. The same work provides, assuming that at most k input bits can be metastable, a construction with multiplicative B OðkÞ and additive Oðk log BÞ overheads in size and depth, respectively. For the 2-sort element, k ¼ 2 (each Gray code string may contain one metastable bit), but the resulting circuits are still far from optimal.
In [10] , an alternative construction relying on noncombinational logic is given, achieving (up to minor-order terms) factor 2k þ 1 increase in size and additive Qðlog kÞ increase in depth of the resulting circuit; for a 2-sort circuit, k ¼ 2, so these overheads are constant. Rule-of-thumb calculations suggest that optimized versions of the circuits presented here and derived by this method would have comparable performance. A fair and detailed comparison would require fully-fledged designs of both approaches, which is beyond the scope of this article. Note, however, that our design has the advantage of being purely combinational.
Parallel Prefix Computation. Ladner and Fischer [23] studied the parallel application of an associative operator to all prefixes of an input string of length ' (over an arbitrary alphabet). They give parallel prefix computation circuits of depth Oðlog 'Þ and size Oð'Þ (where the circuit implementing the operator is assumed to have size and depth 1). However, when requiring optimal depth of dlog 'e, their corresponding solution suffers from fan-out larger than '=2. An earlier construction by Kogge and Stone [22] simultaneously achieves optimal depth and fan-out of 2. This yields the fastest adder circuits to date (cf. [27] ), but at the expense of a large size of 'ðdlog 'e À 1Þ þ 1. A number of additional constructions have been developed for adders, including special cases ( [2] , [26] ) of the one by Ladner and Fischer, cf. [31] . However, no other construction achieves asymptotically optimal depth and size.
MODEL AND PROBLEM
In this section, we discuss how to model metastability in a worst-case fashion and formally specify the input/output behavior of our circuits. Our model is a simplified version of the one from [10] for combinational circuits (cf. [9, Chap. 7] ). This means to represent metastable "bits" by M and extend truth tables as in Kleene's 3-valued logic [20, Section 64].
Basic Notation. We set ½N :¼ f0; . . . ; N À 1g for N 2 N and ½i; j ¼ fi; i þ 1; . . . ; jg for i; j 2 N, i j. We denote B :¼ f0; 1g and B M :¼ f0; 1; Mg. For a B-bit string g 2 B B M and i 2 ½1; B, denote by g i its ith bit, i.e., g ¼ g 1 g 2 . . . g B . We use the shorthand g i;j :¼ g i . . . g j , where i; j 2 ½1; B and i j. Let parðgÞ denote the parity of g 2 B B , i.e, parðgÞ ¼
For a function f and a set A we abbreviate fðAÞ :¼ ffðyÞ j y 2 Ag.
Binary Reflected Gray Code
A standard binary representation of inputs is unsuitable: uncertainty of the input values may be arbitrarily amplified by the encoding. E.g., representing a value unknown to be 11 or 12, which are encoded as 1011 resp. 1100, would result in the bit string 1MMM, i.e., a string that is metastable in every position that differs for both strings. However, 1MMM may represent any number in the interval from 8 to 15, amplifying the initial uncertainty of being in the interval from 11 to 12. An encoding that does not lose precision for consecutive values is Gray code.
We use B-bit binary reflected Gray code, rg B : ½N ! B B , which is defined recursively. For simplicity (and without loss of generality) we set N :¼ 2 B . A 1-bit code is given by rg 1 ð0Þ ¼ 0 and rg 1 ð1Þ ¼ 1. For B > 1, we start with the first bit fixed to 0 and counting with rg BÀ1 ðÁÞ (for the first 2 BÀ1 codewords), then toggle the first bit to 1, and finally "count down" rg BÀ1 ðÁÞ while fixing the first bit again, cf. Table 1 . Formally, this yields for x 2 ½N
As each B-bit string is a codeword, the code is a bijection and the encoding function also defines the decoding function. Denote by hÁi : B B ! ½N the decoding function of a Gray code string, i.e., for x 2 ½N, hrg B ðxÞi ¼ x.
For two binary reflected Gray code strings g; h 2 B B , we define their maximum and minimum as 
Valid Strings
The inputs to the sorting circuit may have some metastable bits, which means that the respective signals behave out-ofspec from the perspective of Boolean logic. Such inputs, referred to as valid strings, are introduced with the help of the following operator. 
The operator Ã is associative and commutative. Hence, for a set S ¼ fx ð1Þ ; . . . ; x ðkÞ g of B-bit strings, we can use the shorthand ÃS :
We call ÃS the superposition of the strings in S.
Valid strings have at most one metastable bit. If this bit resolves to either 0 or 1, the resulting string encodes either x or x þ 1 for some x, cf. Table 2 . Then, the set of valid strings of length B is 
Resolution and Closure
To extend the specification of max rg and min rg to valid strings, we make use of the metastable closure [10] . The metastable closure is defined over the possible resolutions of metastable bits. À Á as follows:
Thus, resðxÞ is the set of all strings obtained by replacing all Ms in x by either 0 or 1: M acts as a "wild card." For any x and y, we have that resðxyÞ ¼ resðxÞresðyÞ.
We note two observations for later use. We observe that in general the reverse direction does not hold, i.e., resðÃSÞ~S. For example, consider S ¼ f01; 10g and thus ÃS ¼ MM such that resðÃSÞ ¼ f00; 01; 10; 11g ¼ B 2 . Hence, S resðÃSÞ but not resðÃSÞ S. In contrast, for jresðÃSÞj 2, we can see that the reverse direction holds. The metastable closure of an operator on binary inputs extends it to inputs that may contain metastable bits. This is done by considering all resolutions of the inputs, applying the operator, and taking the superposition of the results. The closure is the best one can achieve w.r.t. containing metastability with clocked logic using standard registers [10] , i.e., when f M ðxÞ i ¼ M, no such implementation can guarantee that the ith output bit stabilizes in a timely fashion.
Output Specification
We want to construct a circuit computing the maximum and minimum of two valid strings, enabling us to build sorting networks for valid strings. First, however, we need to answer the question what it means to ask for the maximum or minimum of valid strings. To this end, suppose a valid string is rg B ðxÞ Ã rg B ðx þ 1Þ for some x 2 ½N À 1, i.e., the string contains a metastable bit that makes it uncertain whether the represented value is x or x þ 1. If we wait for metastability to resolve, the string will stabilize to either rg B ðxÞ or rg B ðx þ 1Þ. Accordingly, it makes sense to consider rg B ðxÞ Ã rg B ðx þ 1Þ "in between" rg B ðxÞ and rg B ðx þ 1Þ, resulting in the following total order on valid strings (cf. Table 2 ). Definition 3.9 (0). We define a total order 0 on valid strings as follows. For g; h 2 B B , g 0 h , hgi < hhi. For each x 2 ½N À 1, we define rg B ðxÞ 0 rg B ðxÞ Ã rg B ðx þ 1Þ 0 rg B ðx þ 1Þ. We extend the resulting relation on S B rg Â S B rg to a total order by taking the transitive closure. Note that this also defines ", via g " h , ðg ¼ h _ g 0 hÞ.
We intend to sort with respect to this order. It turns out that implementing a 2-sort circuit w.r.t. this order amounts to implementing the metastable closure of max rg and min rg . In other words, max rg M and min rg M are the max and min operators w.r.t. the total order on valid strings shown in Table 2 , e.g.,
Hence, our task is to implement max rg M and min rg M . Definition 3.11 (2-sortðBÞ). For B 2 N, a 2-sortðBÞ circuit is specified as follows.
Computational Model and CMOS Logic
We seek to use standard components and combinational logic only. We use the model of [10] , which specifies the behavior of basic gates on metastable inputs via the metastable closure of their behavior on binary inputs, cf. Table 3 . We use the standard notational convention that a þ b ¼ OR M ða; bÞ and ab ¼ AND M ða; bÞ. Note that in this logic, most familiar identities hold: AND and OR are associative, commutative, and distributive, and DeMorgan's laws hold. However, naturally the law of the excluded middle becomes void. For instance, in general, ORðx;
xÞ 6 ¼ 1, as ORðM; MÞ ¼ M. We now argue that basic CMOS gates behave according to this logic, justifying the model. For the sake of an intuitive notation, we apply some slightly unusual conventions. In the following, let R 1 be a wildcard that can refer to any resistance that is "low", i.e., close to being negligible, as e.g., that of a transistor in its stable conducting state (i.e., any PMOS transistor subjected to a low gate voltage or any NMOS transistor subjected to a high gate voltage). Similar, denote by R 0 any resistance that is "high", i.e., large compared to R 1 , such as the resistance of a transistor in its stable non-conducting state. Thus, with a stable input b 2 B (where we identify 0 with low and 1 with high voltage), an NMOS transistor attains resistance R b , while a PMOS transistor attains resistance R b . We can extend this to unstable inputs M by making the conservative assumption that R M is an arbitrary (possibly time-dependent) resistance. With this notation, we can see that parallel and serial composition of transistors implements AND and OR in Kleene logic, respectively. Lemma 3.12. For k 2 N sufficiently small so that kR 1 ( R 0 , let a 1 ; . . . ; a k 2 B M be input signals fed to k NMOS transistors interconnected (i) in parallel or (ii) sequentially. Set s :¼ P k i¼1 a i and p :¼ Q k i¼1 a i , i.e., the OR resp. AND over all inputs. Then the resistance between input and output of the resulting subcircuit is (roughly
The same arguments apply to PMOS transistors.
, the OR resp. AND over all inputs. Then the resistance between input and output of the resulting subcircuit is (roughly) (i) R s resp. (ii) R p .
We remark that the factor of k reduction in the gap between R 1 and R 0 may imply that a gate's output signal needs to be regenerated using a buffer. However, this is the same behavior as for logic that assumes stable signals only, so standard CMOS design techniques account for this.
From the above observations, we can readily infer that standard CMOS gate implementations behave according to Kleene logic in face of potentially metastable signals, justifying the model from [10] .
Theorem 3.14. The CMOS gates depicted in Fig. 1 implement the truth tables given in Table 3 .
Similar reasoning applies to many gates, e.g., NAND and NOR gates. We stress, however, that the property of implementing the closure of the function computed by the gate on stable values is not universal for CMOS logic. For instance, standard transistor-level multiplexer implementations do not handle metastability well, cf. [11] .
DECOMPOSITION OF THE TASK
In this section, we show that computing max rg M fg; hg and min rg M fg; hg for valid strings g; h 2 S B rg can be broken down into composing simple operators in In each step of processing inputs g; h 2 B B , it is fed the pair of ith input bits g i h i . In the following, we denote by s ðiÞ ðg; hÞ the state of the machine after i steps, where s ð0Þ ðg; hÞ :¼ 00 is the starting state. For ease of notation, we will omit the arguments g and h of s ðiÞ whenever they are clear from context. Table 4 shows an example of a run of the finite state machine. Because the parity keeps track of whether the remaining bits are to be compared w.r.t. the standard or "reflected" order, the state machine performs the comparison correctly w.r.t. the meaning of the states indicated in Fig. 2 . This lemma gives rise to a sequential implementation of 2-sortðBÞ based on the given state machine, for input strings in B B . Table 5 lists the ith output bit as function of s ðiÀ1Þ and the pair g i h i . Correctness of this computation follows immediately from Lemma 4.1.
Comparing Stable Gray Codes via an FSM
We can express the transition function of the state machine as an (as easily verified) associative operator Å 
Our goal in this section is to extend this approach to potentially metastable inputs.
Dealing with Metastable Inputs
Our strategy is to replace all involved operators by their metastable closure: for i 2 ½1; B (i) compute s ðiÞ M , (ii) determine max rg M fg; hg i and min rg M fg; hg i according to 
While it is tractable to manually verify all 3 6 ¼ 729 cases (exploiting various symmetries and other properties of the operator), it is tedious and prone to errors. Instead, we verified that both evaluation orders result in the same outcome by a short computer program. Apart from being essential for our construction, this theorem simplifies notation; in the following, we may write
where the order of evaluation does not affect the result. We stress that in general the closure of an associative operator needs not be associative. A counter-example is given by binary addition modulo 4: For convenience of the reader, Table 8 gives the truth table
We need to show that repeated application of this operator to the input pairs g j h j , j 2 ½1; i, actually results in s ðiÞ M . This is closely related to the key observation that if in a valid string there is a metastable bit at position m, then the remaining B À m following bits are the maximum codeword of a ðB À mÞ-bit code. Å  00  01  11  10  00  00  01  11  10  01  01  01  01  01  11  11  10  00  01  10  10  10 00  01  11  10  00  00  10  11  10  01  00  10  11  01  11  00  01  11  01  10 00 01 11 10
The first operand is the current state, the second is the next input. 00  00  0M  01  M1  11  1M  10  M0  MM  0M  0M  0M  01  M1  M1  MM  MM  MM  MM  01  01  01  01  01  01  01  01  01  01  M1  M1  MM  MM  MM  0M  0M  01  M1  MM  11  11  1M  10  M0  00  0M  01  M1  MM  1M  1M  1M  10  M0  M0  MM  MM  MM  MM  10  10  10  10  10  10  10  10  10  10  M0  M0  MM  MM  MM  1M  1M  10  M0  MM  MM  MM  MM  MM  MM  MM  MM  MM  MM  MM The first operand is the current state, the second are the next input bits. Equipped with these tools, we are ready to prove the second statement. Recall that out : B 2 Â B 2 ! B 2 is the operator given in Table 5 computing max rg fg; hg i min rg fg; hg i out of s ðiÀ1Þ and g i h i . For convenience of the reader, we provide the truth table of Table 9 . We derive the third property. 
THE PPC FRAMEWORK
In order to derive a small circuit from the results of Section 4, a straightforward approach would be to unroll the FSM. We could design a circuit implementing the transition function Å M and apply it B times to the starting state s ð0Þ and each input g i h i . However, computing the sequence of states step by step yields a (non-optimal) linear depth of at least B. Hence, we make use of the PPC framework by Ladner and Fischer [23] . They describe a generic method that is applicable to any finite state machine translating a sequence of B input symbols to B output symbols, to obtain circuits of size OðBÞ and depth Oðlog BÞ. They observe that each input symbol defines a restricted transition function. Compositions of these functions evaluated on the starting state yield the state of the machine after receiving corresponding inputs. The major advantage of the technique is that compositions of restricted transition functions can be computed in parallel due to associativity, yielding a depth of Oðlog BÞ. This matches our needs, as we need to determine s ðiÞ M for each i 2 ½B. However, their generic construction involves large constants. Fortunately, we have established that
is an associative operator, permitting us to directly apply the circuit templates for associative operators they provide for computing s ðiÞ M ¼ ð ÅM Þ i j¼1 g j h j for all i 2 ½B. Accordingly, we discuss these templates only. During discussion of the basic construction we show a minor improvement on their results.
Before proceeding, the reader may want to take a look at the example given in Fig. 3 , which shows how a 2-sortð9Þ derived from our construction processes an input pair.
The Basic Construction
We revisit the templates for parallel computation of all prefixes, i.e., the part of the framework relevant to our construction. To this end, recall Definition 1.1. In our case,
[23] provides a family of recursive constructions of PPC È circuits. They are obtained by combining two different recursive patterns. The first pattern, which optimizes for size of the resulting circuits, is depicted in Fig. 4a . We distinguish between even and odd number of inputs. If B is even, we discard the rightmost gray wire and set B :¼ B; if B is odd, we set B :¼ B À 1 and include the rightmost wire. In the following, denote by jCj the size of a circuit C and by dðCÞ its depth.
Lemma 5.1. Suppose that C and P are circuits implementing È and PPC È ðdB=2eÞ for some B 2 N, respectively. Then applying the recursive pattern given at the left of Fig. 4 yields a PPC È ðBÞ circuit. It has depth 2dðCÞ þ dðP Þ and size at most ðB À 1ÞjCj þ jP j. Moreover, the last output is at depth at most dðCÞ þ dðP Þ of the circuit.
The second recursive pattern, shown in Fig. 4c , avoids to increase the depth of the circuit beyond the necessary dðCÞ for each level of recursion. Assume for now that B is a power of 2. We represent the recursion as a tree T b , where b :¼ log B, given in the center of Fig. 4 . It has depth b with all leaves (filled in white) in this depth, and there are two types of non-leaf nodes: right nodes (filled in black) have The first operand is the current state, the second is the next input bits. Fig. 3 . An example for a computation of the 2-sortð9Þ circuit arising from our construction for fan-out f ¼ 3. The inputs are g ¼ 101010110 and h ¼ 101M10000; see Table 10 for s ðiÞ M ðg; hÞ and the output. We labeled each Å M by its output. Buffers and duplicated gates (here the one computing 0M) reduce fan-out, but do not affect the computation. Grey boxes indicate recursive steps of the PPC construction; see also Fig. 7 for a larger PPC circuit using the one here in its "right" top-level recursion. For better readability, wires not taking part in a recursive step are dashed or dotted. two children, a left and a right node, whereas left nodes (filled in gray) have a single child, which is a right node. T b is essentially a Fibonacci tree in disguise. The recursive construction is now defined as follows. A right node applies the pattern given in Fig. 4 to the right. R ' is the circuit (recursively) defined by the subtree rooted at the left child and R r is the circuit (recursively) defined by the subtree rooted at the right child. Furthermore, B ¼ 2 bÀdÀ1 , where d 2 ½b is the depth of the node. A left child applies the pattern on the left. R c is (recursively) defined by the subtree rooted at its child and B ¼ 2 bÀd , where d 2 ½b is the depth of the node.
The base case for a single input and output is simply a wire connecting the input to the output, for both patterns. As b ¼ log B and each recursive step cuts the number of inputs and outputs in half, the base case applies if and only if the node is a leaf. Note that the figure shows the recursive patterns at the root and its left child, where B ¼ 2 bÀ1 is always even (i.e., in this recursive pattern, the gray wire with index B þ 1 is never present); when applying the patterns to nodes further down the tree, B and B are scaled down by a factor of 2 for every step towards the leaves.
In the following, denote by PPCðC; T b Þ the circuit that results from applying the recursive construction described above to the base circuit C implementing È. Moreover, we refer to the ith input and output of the subcircuit corresponding to node v 2 T b as d v i and p v i , respectively.
It remains to bound the size of the circuit. Denote by F i , i 2 N, the ith Fibonacci number, i.e.,
Asymptotically, the subtractive term of F bþ5 is negligible, as F bþ5 2 ð1= ffiffi ffi 5 p þ oð1ÞÞðð1 þ ffiffi ffi 5 p Þ=2Þ bþ5 Oð1:62 b Þ; however, unless B is large, the difference is substantial. We also get a simple upper bound for arbitrary values of B. To this end, we "split" in the recursion such that the left branch is "complete" (i.e. the number of inputs is a power of 2), while applying the same splitting strategy on the right. This is where our construction differs from and improves on [23] . They perform a balanced split and obtain an upper bound of 4B on the circuit size.
Corollary 5.5. For B 2 N and circuit C implementing È, set b :¼ dlog Be. Then a PPC È ðBÞ of depth dlog BedðCÞ and size smaller than ð5B À 2 b À F bþ3 ÞjCj ð4B À F bþ3 Þ exists.
We remark that one can give more precise bounds by making case distinctions regarding the right recursion, which for the sake of brevity we omit here. Instead, we computed the exact numbers for B 70, see Fig. 5 .
The construction derived from iterative application of Lemma 5.1 can be combined with PPCðC; T b Þ, achieving the following trade-off; note that if B ¼ 2 b for b 2 N, then F dlog BeÀkþ3 can be replaced by F bÀkþ5 .
Theorem 5.6 (improving on [23] ). Suppose C implements È.
For all k 2 ½0; dlog Be and B 2 N, there is a PPC È ðBÞ circuit of depth ðdlog Be þ kÞdðCÞ and size at most 2 þ 1 2 kÀ1 B À F dlog BeÀkþ3 jCj :
Constant Fan-Out at Optimal Depth
The optimal depth construction incurs an excessively large fan-out of QðBÞ, as the last output of left recursive calls needs to drive all the copies of C that combine it with each of the corresponding right call's outputs. This entails that, despite its lower depth, it will not result in circuits of smaller physical delay than simply recursively applying the construction from Fig. 4a . Naturally, one can insert buffer trees to ensure a constant fan-out (and thus constantly bounded ratio between delay and depth), but this increases the depth to Qðlog 2 B þ dðCÞlog BÞ. We now modify the recursive construction to ensure a constant fan-out, at the expense of a limited increase in size of the circuit. The result is the first construction that has size OðBÞ, optimal depth, and constant fan-out.
In the following, we denote by f ! 3 the maximum fanout we are trying to achieve, where we assume that gates or memory cells providing the input to the circuit do not need to drive any other components. For simplicity, we consider C to be a single gate, i.e., a gate driving two C components has exactly fan-out 2.
We proceed in two steps. First, we insert 2B buffers into the circuit, ensuring that the fan-out is bounded by 2 everywhere except at the gate providing the last output of each subcircuit corresponding to a left node. In the second step, we will resolve this by duplicating these gates sufficiently often, recursively propagating the changes down the tree. Neither of these changes will affect the output (i.e. the correctness) of the circuit or its depth, so the main challenges are to show our claim on the fan-out and bounding the size of the final circuit.
Step 1: Almost Bounding Fan-Out by 2
Before proceeding to the construction in detail, we need some structural insight on the circuit. If v is the root, then R v ¼ ½1; 2 b and a v ¼ 0.
If v is the left child of p with R p ¼ ½i; i þ j, then
If v is the right child of right node p with
If v is the right child of left node p, then
Hence, the left-count a v tells us for every node v 2 T b the number of left recursion steps preceding v, whereas R v gives us information about the range of inputs used at node v. We observe that each recursion halves the number of inputs and that the range is only cut in half if a v does not increase. Combining these observations with structural insights on the recursion patterns in Fig. 4a and 4c , we state the following four properties of PPCðC; T b Þ. 
if v is a right node, all its inputs are outputs of its childrens' subcircuits, and (iv) if v is a left node or leaf, only its even inputs are provided by its child (if it has one) and for odd k 2 ½1; 2 bÀd , we have that d v k ¼ È iþk2 av À1 k 0 ¼iþðkÀ1Þ2 av d k 0 . Lemma 5.8 leads to an alternative representation of the circuit PPCðC; T b Þ, see Fig. 6 , in which we separate gates in the recursive pattern from Fig. 4a that occur before the subcircuit R c . Adding the buffers we need in our construction, this results in the modified patterns given in Fig. 6b . The separated gates appear at the bottom of Fig. 6a : for each leaf v of T b , there is a tree of depth a v aggregating all of the circuit's inputs from its range. Each non-root node in an aggregation tree provides its output to its parent. In addition, one of the two children of an inner node in the tree must provide its output as an input to one of the subcircuits corresponding to a node of T b , cf. Property (iv) of Lemma 5.8.
From this representation, we will derive that the following modifications of PPCðC; T b Þ result in a PPC È ð2 b Þ circuit PPCðC; T b Þ 0 , for which a fan-out larger than 2 exclusively occurs on the last outputs of subcircuits corresponding to nodes of T b . 1) Add a buffer on each wire connecting a non-root node of any of the aggregation trees to its corresponding subcircuit (see Fig. 6a ). 2) For the subcircuit corresponding to left node ' with range R ' ¼ ½i; i þ j, add for each even k j (i.e., each even k but the maximum of j þ 1) a buffer before output p ' k (see bottom of Fig. 6b ). 3) For each right node r with range ½i; i þ j, add a buffer before output p r ðjþ1Þ=2 (see top of Fig. 6b ). Lemma 5.9. With the exception of gates providing the last output of subcircuits corresponding to nodes of T b (blue in Fig. 6b ), fan-out of PPCðC; T b Þ 0 is 2. Buffers or gates driving an output of the circuit drive nothing else.
It remains to count the inserted buffers. We do so by computing a closed form expression from the linear recurrence that describes the number of nodes of a given type (left, right, leaf) in a given depth as function of the previous one. The following helper statement will be useful for this, but also later on. 
Similar arguments serve later as well. The main reason why we will define the function aðvÞ in the next section without rounding is to ensure that we again obtain linear recurrences, which can be solved using standard techniques from linear algebra. As a downside, this results in slightly overestimating the size of circuits, as we may ask for more copies of gates from children than are actually needed.
Step 2: Bounding Fan-Out by f
In the second step, we need to resolve the issue of high fanout of the last output of each recursively used subcircuit in PPCðC; T b Þ 0 . Our approach is straightforward. Starting at the root of T b and progressing downwards, we label each Fig. 5 . Comparison of the balanced recursion from [23] and ours. The curves for unbounded fan-out are the exact sizes obtained, whereas "upper bound" refers to the bound from Corollary 5.5; the fan-out 3 curves show that the unbalanced strategy performs better also for the construction from Theorem 5.16 (for f ¼ 3 and k ¼ 0) we derive next.
node v with a value aðvÞ that specifies a sufficient number of additional copies of the last output of the subcircuit represented by v to avoid fan-out larger than f. At right nodes, this is achieved by duplicating the gate computing this output sufficiently often, marked blue in Fig. 6b (top) . For left nodes, we simply require the same number of duplicates to be provided by the subcircuit represented by their child (i.e., we duplicate the blue wire in the bottom recursive pattern shown in Fig. 6b ). Finally, for leaves, we will require a sufficient number of duplicates of the root of their aggregation tree; this, in turn, may require to make duplicates of their descendants in the aggregation tree.
We define aðvÞ and then utilize it to describe our fan-out f circuit. Afterwards, we will analyze the increase in size of the circuit compared to PPCðC; T b Þ 0 . baðvÞc additional copies of the root of the aggregation tree, and for each right node v 2 T b , we add baðvÞc gates that compute (copies of) the last output of their corresponding subcircuit of PPCðC; T b Þ 0 . Then we can wire the circuit such that all gates that are not in aggregation trees have fan-out at most f, and each output of the circuit is driven by a gate or buffer driving only this output.
It remains to modify the aggregation trees so that sufficiently many copies of the roots' output values are available.
Lemma 5.14. Consider an aggregation tree corresponding to leaf v 2 T b and fix f ! 3. We can modify it such that the fan-out of all its non-root nodes becomes at most f, there are baðvÞc additional gates computing the same output as the root, and at most ðfaðvÞÞ=ðf À 2Þ þ ð2 avÀ1 Þ=ðf À 1Þ gates are added.
Finally, we need to count the total number of gates we add when implementing these modifications to the circuit. 
As an example for the overall resulting construction, we show PPC ð3Þ ðC; T 4 Þ in Fig. 7 . We summarize our findings in the following theorem. 
We refrain from analyzing the size of the construction for values of B that are not powers of 2. However, in Fig. 8 we plot the exact bounds (without buffers) for k ¼ 0 and selected values of f against B.
SIMULATION
In addition to the formal statements from the previous sections, we verify the correctness of our circuits by VHDL simulation. To this end, we first need to specify implementations of the subcircuits computing Å M and out M .
Gate-Level Implementation of Operators
From Tables 6 a and 6 b, for s; b 2 B 2 we can extract the Boolean formulas ðs Å bÞ 1 ¼ s 1 On the left, we see the recursion tree, with the aggregation trees separated and shown at the bottom. Inputs are depicted as black triangles. On the right, the application of the recursive patterns at the children of the root is shown. Parts marked blue will be duplicated in the second step of the construction that achieves constant fan-out; this will also necessitate to duplicate some gates in the aggregation trees.
In general, realizing a Boolean formula f by replacing negation, multiplication, and addition by inverters, AND, and OR gates, respectively, does not result in a circuit implementing f M . 1 However, we can easily verify that the above formulas are disjunctions of all prime implicants of their respective functions. As shown in [10] (see also [16] ), 2 in this special case the resulting circuits do implement the closure-provided the gates behave as in Table 3 , which the implementations given in Fig. 1 do by Theorem 3.14. Using distributive laws (recall that these also hold in Kleene logic), the above formulas can be rewritten as
We see that, in fact, a single circuit with suitably wired (and possibly negated) inputs can implement all four operations. As for sel 1 ¼ sel 2 the circuit implements a multiplexer with select bit sel 1 , we refer to it as extended multiplexer, or xmux for short. Its functionality is specified by XMUXðsel 1 ; sel 2 ; x; yÞ :¼ yðx þ sel 2 Þ þ xsel 1 : Fig. 9 shows the resulting circuit, and Table 11 lists how to map inputs to compute Å M and out M .
We note that this circuit is not a particularly efficient XMUX implementation; a transistor-level implementation would be much smaller. However, our goal here is to verify correctness and give some initial indication of the size of the resulting circuits-a fully optimized ASIC circuit is beyond the scope of this article. In [4] , the size of the implementation is slightly reduced by moving negations. Due to space limitations, we refrain from detailing this modification here, but note that Fig. 12 and Table 12 take it into account.
Putting it All Together
We now have all the pieces in place to assemble a containing 2-sortðBÞ circuit. By Theorem 4.3, Å M is associative. Thus, from a given implementation of Å M (e.g., two copies of the circuit from Fig. 9 with appropriate wiring and negation, cf. Table 11 ) we can construct PPC Å M ðB À 1Þ circuits of small depth and size, as shown in Section 5. We can combine such a circuit with an out M implementation (again, two XMUX es with appropriate wiring and negation will do) as shown in Fig. 10 to obtain our 2-sortðBÞ circuit.
Simulation Setup
We implemented the design given in Fig. 10 on registertransfer-level using the PPC Å M ðB À 1Þ circuit given by Theorem 5.6 for k ¼ 0. 3 Quartus by Altera is used for design entry, which in our case mainly consists of checking correct implementation. After design entry we use ModelSim by Altera for behavioral simulation. Note that we must not simulate the preprocessed Quartus output, because processing may compromise metastability-containing behavior. Instead, we simulate pure VHDL. Metastable signals are simulated using VHDL signal X, because its behavior matches the worst-case behavior assumed for M.
The correctness of this construction follows from Theorems 4.7 and 4.8, where we can plug in any PPC Å M ðB À 1Þ circuit, cf. Section 5. For the circuits derived by relying on the 1. For instance, ðs Å bÞ 1 ¼ s 1 b 1 þ s 2 b 1 as Boolean formula, but the two expressions differ when evaluated on s 1 ¼ s 2 ¼ 1 and b 1 ¼ M. The circuits resulting from the different formulas are implementations of a multiplexer (with select bit b 1 ) and its closure, respectively.
2. Alternatively, one can manually verify that these formulas evaluate to the truth tables given in Tables 8 and 9. 3. For k > 0, fan-out becomes an issue, requiring the more involved constructions provided by Theorem 5.16. However, the resulting numbers would be inaccurate, and a detailed comparison based on optimized ASIC implementations is beyond the scope of this work.
XMUX circuit from Fig. 9 , we independently confirmed this via simulation.
Results
For the implementation of PPC Å M ðB À 1Þ we used the circuits from Theorem 5.6, i.e., we did not make use of the extension to constant fan-out. Fig. 11 shows how a non-containing implementation can fail. We exhaustively checked the design from Fig. 10 for B up to 12 (and all feasible k). Simulation shows that the design works correctly for several levels of recursion, e.g., when regarding B ¼ 1 and B ¼ 2 as simple base cases, B ¼ 12 implies 3 levels of recursion for both patterns. We refrained from simulating the constant fan-out construction, because it simply replicates intermediate results without changing functionality.
Comparison to Baseline
After behavioral simulation, we continue with a comparison of our design and a standard sorting approach Bin-compðBÞ. As mentioned earlier, the 2-sortðBÞ implementation given in Fig. 10 is slightly optimized by pulling out a negation from the operators in every recursive step [4] . After design entry as described above, we use Encounter RTL Compiler for synthesis and Encounter for place and route. Both tools are part of the Cadence tool set and in both steps we use NanGate 45 nm Open Cell Library as a standard cell library.
Since metastability-containing circuits may include additional gates that are not required in traditional Boolean logic, Boolean optimization may compromise metastabilitycontaining properties [3] . Accordingly, we were forced to disable optimization during synthesis of the circuits.
Binary Benchmark Bin-comp. In short, Bin-comp consists of a simple VHDL statement comparing two binary encoded inputs and outputting the maximum and the minimum, accordingly. It follows the same design process as 2-sort, but then undergoes optimization using a more powerful set of basic gates. For example, the standard cell library provides prebuild multiplexers. These multiplexers are used by Bin-comp, but not by 2-sort, as they are not metastabilitycontaining. We stress that these more powerful gates provide optimized implementations of multiple Boolean functions, yet each of them is still counted as a single gate. Thus, comparing our design to the binary design in terms of gate count, area, and delay disfavors our solution. Moreover, we noticed that the optimization routine switches to employing more powerful gates when going from B ¼ 8 to B ¼ 16 (cf. Fig. 12 ), resulting in a decrease of the delay of the Bin-comp implementation. Nonetheless, our design performs comparably to the non-containing binary design in terms of delay, cf. Fig. 12 and Table 12 . This is quite notable, as further optimization is possible by optimizing our design on the transistor level, with significant expected gains. The same applies to gate count and area, where a notable gap remains. Recall, however, that the Bin-comp design hides complexity by using more advanced gates and does not contain metastability.
We emphasize that we refrained from optimizing the design by making use of all available gates or devising transistor-level implementations, as such an approach is tied to the utilized library or requires design of standard cells.
CONCLUSIONS
In this work, we demonstrated that efficient metastabilitycontaining sorting circuits are possible. Our results indicate that optimized implementations can achieve the same delay as non-containing solutions, without a dramatic increase in circuit size. This is of high interest to an intended application motivating us to design MC sorting circuits: fault-tolerant high-frequency clock synchronization. Sorting is a key step in envisioned implementations (cf. [10] , [15] ) of the Lynch-Welch algorithm [30] with improved precision of synchronization. The complete elimination of synchronizer delay is possible due to the efficient MC sorting networks presented in this article; enabling an increment of the rate at which clock corrections are applied, significantly reducing the negative impact of phase drift of local clock sources on the precision of the algorithm (cf. [18] ). This goal will necessitate to devise optimized ASIC implementations of our circuits. The novel PPC circuits we devised in Section 5 are an important contribution towards this end. Note that it is crucial to take into account both depth and fanout for devising low-delay circuits. Hence, follow-up work needs to compare the existing and our novel design based on suitable metrics that take both into account to reliably predict the achieved trade-offs between delay, area, and energy consumption of circuits. Note that this is of relevance beyond the specific application of MC sorting: PPC circuits lie at the heart of adder designs, implying that even a minor improvement can have significant impact on the overall performance of computing devices! MC Control Loops. More generally speaking, MC circuits like those presented here are of interest in mixed-signal control loops whose performance depends on very short response times. When analog control is not desirable, traditional solutions incur synchronizer delay before being able to react to any input change. Using MC logic saves the time for synchronization, while metastability of the output corresponds to the initial uncertainty of the measurement; thus, the same quality of the computational result can be achieved in shorter time. Note that our circuits are purely combinational, so they can be used in both clocked and asynchronous control logic.
Obvious examples of such control loops are clock synchronization circuits, but MC has been shown to be useful for adaptive voltage control [13] and fast routing with an acceptable low probability of data corruption [29] as well. This type of application suggests to explore whether efficient circuits exist for a wider range of arithmetic operations, like e.g., addition or (possibly approximate) multiplication.
Redundant Encoding and Addition. On the theoretical side, our results are to be contrasted with the exponential gap between the size of non-containing and MC circuits shown in [17] . This work raised the question for which classes of functions small MC circuits exist. Given that Ladner and Fischer proved that the PPC task can be solved efficiently for any constant-sized state machine [23] , it was natural to ask whether this result can be extended to MC computations. In follow-up work, we show that indeed this holds true for any constant-sized FSM [5] . However, when applying this result to addition, unlike for sorting (where the underlying operations are max and min) uncertainty of inputs adds up. This means that Gray code can support meaningful computations only if the total uncertainty of all addends is at most 1. Accordingly, in [5] we also consider redundant encodings, showing that using k (roughly) redundant bits, an uncertainty of bðk þ 1Þ=2c can be tolerated without loss of precision. Combined with the above result on transducers, this yields a meaningful notion of MC addition that allows for efficient circuits. As, essentially, the redundant bits are used as a unary code, it should be straightforward to apply the techniques from this article to obtain efficient sorting circuits with the encoding from [5] . We remark that the encoding from [5] turns out to be identical to that of the output of suitable time-to-digital converters [12] , so relaxing their output constraints to achieve better average-case performance would provide valid input for sorting circuits that accept inputs encoded in this manner.
We believe that these results suggest applicability of our techniques to a wide range of mixed-signal control loops and call for future work further exploring to which extend basic arithmetics can be realized by efficient MC circuits.
