When setup/hold times of bistable elements are violated, they may become metastable, i.e., enter a transient state that is neither digital 0 nor 1 [1]. In general, metastability cannot be avoided, a problem that manifests whenever taking discrete measurements of analog values. Metastability of the output then reflects uncertainty as to whether a measurement should be rounded up or down to the next possible measurement outcome.
I. INTRODUCTION
Metastability is one of the basic obstacles when crossing clock domains, potentially resulting in soft errors with critical consequences [2] . As it has been shown that there is no deterministic way of avoiding metastability [1] , synchronizers [3] are employed to reduce the error probability to tolerable levels. Besides energy and chip area, this approach costs time: the more time is allocated for metastability resolution, the smaller is the probability of a (possibly devastating) metastabilityinduced fault.
Recently, a different approach has been proposed, coined metastability-containing (MC) circuits [4] . The idea is to accept (a limited amount of) metastability in the input to a digital circuit and guarantee limited metastability of its output, such that the result is still useful. The authors of [5] , [6] apply this approach to a fundamental primitive: sorting. However, the state-of-the-art [5] are circuits that are by a Θ(log B) factor larger than non-containing solutions, where B is the bit width of inputs. Accordingly, the authors pose the following question:
"What is the optimum cost of the 2-sort primitive?" We argue that answering this question is critical, as the performance penalty imposed by current MC sorting primitives is not outweighed by the avoidance of synchronizers.
Our Contribution: We answer the above question by providing a B-bit MC 2-sort circuit of depth O(log B) and O(B) gates. Trivially, any such building block with gates of constant fan-in must have this asymptotic depth and gate ; we compare our solution to [5] . count, and it improves by a factor of Θ(log B) on the gate complexity of [5] . Furthermore, we provide optimized building blocks that significantly improve the leading constants of these complexity bounds. See Figure 1 for our improvements over prior work; specifically, for 16-bit inputs, area and delay decrease by up to 71.58% and 48.46% respectively.
Plugging our circuit into (optimal depth or size) sorting networks [7] , [8] , [9] , we obtain efficient combinational metastability-containing sorting circuits, cf. Table VIII. In general, plugging our 2-sort circuit into an n-channel sorting network of depth O(log n) with O(n log n) 2-sort elements [10] , we obtain an asymptotically optimal MC sorting network of depth O(log B log n) and O(Bn log n) gates.
Further Related Work: Ladner and Fischer [11] studied the problem of computing all the prefixes of applications of an associative operator on an input string of length n. They designed and analyze a recursive construction which computes all these prefixes in parallel. The resulting parallel prefix computation (PPC) circuit has depth of O(log n) and gate count of O(n) (assuming that the implementation of the associative operator has constant size and constant depth). We make use of their construction as part of ours.
II. MODEL AND PROBLEM
In this section, we discuss how to model metastability in a worst-case fashion and formally specify the input/output behavior of our circuits.
We use the following basic notation. For N ∈ N, we set [N ] := {0, . . . , N − 1}. For a binary B-bit string g, denote by g i its i-th bit, i.e., g = g 1 g 2 . . . g B . We use the shorthand g i,j := g i . . . g j . Let par(g) denote the parity of g, i.e, par(g) = B i=1 g i mod 2. Reflected Binary Gray Code: Due to possible metastability of inputs, we use Gray code. Denote by · the decoding function of a Gray code string, i.e., for x ∈ [N ], rg B (x) = x. As each B-bit string is a codeword, the code is a bijection and the decoding function also defines the encoding function rg B : [N ] → {0, 1} B . We define B-bit binary reflected Gray code recursively, where a 1-bit code is given by rg 1 (0) = 0 and rg 1 (1) = 1. For B > 1, we start with the first bit fixed to 0 and counting with rg B−1 (·) (for the first 2 B−1 −1 codewords), then toggle the first bit to 1, and finally "count down" rg B−1 (·) while fixing the first bit again, cf. Table I . Formally, this yields
We define the maximum and minimum of two binary reflected Gray code strings, max rg and min rg respectively, in the usual way, as follows. For two binary reflected Gray code strings g, h ∈ {0, 1} B , max rg and min rg are defined as
Valid Strings: In [6] , the authors represent metastable "bits" by M. The inputs to the sorting circuit may have some metastable bits, which means that the respective signals behave out-of-spec from the perspective of Boolean logic. Such inputs, referred to as valid strings, are introduced with the help of the following operator.
Definition 2.1 (The * operator [6] ): For B ∈ N, define the operator * :
Observation 2.2:
The operator * is associative and commutative. Hence, for a set S = {x (1) , . . . , x (k) } of B-bit strings, we can use the shorthand * S := * x∈S x := x (1) * x (2) * . . . * x (k) . We call * S the superposition of the strings in S.
Valid strings have at most one metastable bit. If this bit resolves to either 0 or 1, the resulting string encodes either x or x + 1 for some x, cf. 
As pointed out in [5] , inputs that are valid strings may, e.g., arise from using suitable time-to-digital converters for measuring time differences [12] . Resolution and Closure: To extend the specification of max rg and min rg to valid strings, we make use of the metastable closure [4] , which in turn makes use of the resolution.
Thus, res(x) is the set of all strings obtained by replacing all Ms in x by either 0 or 1: M acts as a "wild card." The metastable closure of an operator on binary inputs extends it to inputs that may contain metastable bits. This is done by considering all resolutions of the inputs, applying the operator, and taking the superposition of the results.
Output Specification: We want to construct a circuit that outputs the maximum and minimum of two valid strings, which will enable us to build sorting networks for valid strings. First, however, we need to answer the question what it means to ask for the maximum or minimum of valid strings. To this end, suppose a valid string is rg B (x) * rg B (x + 1) for some x ∈ [N −1], i.e., the string contains a metastable bit that makes it uncertain whether the represented value is x or x + 1. This means that the measurement the string represents was taken of a value somewhere between x and x+1. Moreover, if we wait for metastability to resolve, the string will stabilize to either rg B (x) or rg B (x+1). Accordingly, it makes sense to consider rg B (x) * rg B (x + 1) "in between" rg B (x) and rg B (x + 1), resulting in the total order on valid strings given by Table II. The above intuition can be formalized by extending max rg and min rg to valid strings using the metastable closure.
Definition 2.6 ( [5] , [6] ): For B ∈ N, a 2-sort(B) circuit is specified as follows.
As shown in [5] , this definition indeed coincides with the one given in [6] , and for valid strings g and h, max rg M {g, h} and min rg M {g, h} are valid strings, too. More specifically, max rg and min rg M are the max and min operators w.r.t. the total order on valid strings shown in Table II , e.g.,
We seek to use standard components and combinational logic only. We use the model of [4] , which specifies the behavior of basic gates on metastable inputs via the metastable closure of their behavior on binary inputs. For standard implementations of AND and OR gates, this assumption is valid: if M represents an arbitrary, possibly time-dependent voltage between logical 0 and 1, an AND gate will still output logical 0 if the respective other input is logical 0. Similarly, an OR gate with one input being logical 1 suppresses metastability at the other input, cf. Table III. As pointed out in [5] , any additional reduction of metastability in the output necessitates the use of non-combinational masking components (e.g., masking registers), analog components, and/or synchronizers, all of which are outside of our computational model. Moreover, other than the usage of analog components, these alternatives require to spend additional time, which we avoid in this paper.
III. PRELIMINARIES ON STABLE INPUTS

Comparing Stable Gray Code Strings via an FSM:
The following basic structural lemma leads to a straightforward way of comparing binary reflected Gray code strings.
Denote by i ∈ {1, . . . , B} the first index such that g i = h i . Then g i = 1 (i.e., h i = 0) if par(g 1,i−1 ) = 0 and g i = 0 (i.e., h i = 1) if par(g 1,i−1 ) = 1. THE FIRST OPERAND IS THE   CURRENT STATE, THE SECOND THE NEXT INPUT BITS.   00  01  11  10  00  00  01  11  10  01  01  01  01  01  11  11  10  00  01  10  10  10  10  10   out  00  01  11  10  00  00  10  11  10  01  00  10  11  01  11  00  01  11  01  10  00  01  11  10 Lemma 3.1 gives rise to a sequential representation of 2-sort(B) as a Finite state machine (FSM), for input strings in {0, 1} B . Consider the state machine given in Figure 2 . Its four states keep track of whether g 1,i = h 1,i with parity 0 (state encoding: 00) or 1 (state encoding: 11), respectively, g < h (state encoding: 01), or g > h (state encoding: 10). Denoting by s (i) its state after i steps (where s (0) = 00 is the initial state), Lemma 3.1 shows that the output given in Table IV is correct: up to the first differing bits g i = h i , the (identical) input bits are reproduced both for max rg and min rg , and in the i-th step the state machine transitions to the correct absorbing state.
The Operator and Optimal Sorting of Stable Inputs: We can express the transition function of the state machine as an operator taking the current state and input g i h i as argument and returning the new state. Then s (i) = s (i−1) g i h i , where is given in Table V .
We thus have that s (i) = i j=1 g j h j := g 1 h 1 g 2 h 2 . . . g i h i , regardless of the order in which the operations are applied.
An immediate consequence is that we can apply the results by [11] on parallel prefix computation to derive an O(B)gate circuit of depth O(log B) computing all s i , i ∈ [B], in parallel. Our goal in the following sections is to extend this well-known approach to potentially metastable inputs.
IV. DEALING WITH METASTABLE INPUTS
Our strategy is the same as outlined in Section III for stable inputs, where we replace all involved operators by their metastable closure: (i) compute s (i) for i ∈ [B], (ii) determine max rg {g, h} i and min rg {g, h} i according to Table IV for i ∈ {1, . . . , B}, and (iii) exploit associativity of the operator computing the s (i) to determine all of them concurrently with O(log B) depth and O(B) gates (using [11] ). To make this work for inputs that are valid strings, we simply replace all involved operators by their respective metastable closure. Thus, we only need to implement M and the closure of the operator given in Table IV (both of constant size) and immediately obtain an efficient circuit using the PPC framework [11] .
Unfortunately, it is not obvious that this approach yields correct outputs. There are three hurdles to take: (i) Show that first computing s (i) M and then the output from this and the input yields correct output for all valid strings. (ii) Show that M behaves like an associative operator on the given inputs (so we can use the PPC framework). (iii) Show that repeated application of M actually computes s
Killing two birds with one stone, we first show the second and third point in a single inductive argument. We then proceed to prove the first point.
A. Determining s (i) M
Note that for any x and y, we have that res(xy) = res(x) × res(y). Hence, for valid strings g, h ∈ S B rg and i ∈ {1, . . . , B}, we have that s The following theorem shows that the desired decomposition is feasible.
regardless of the order in which the M operators are applied. We remark that we did not prove that M is an associative operator, just that it behaves associatively when applied to input sequences given by valid strings. Moreover, in general the closure of an associative operator needs not be associative. Since M behaves associatively when applied to input sequences given by valid strings, we can apply the results by [11] on parallel prefix computation to any implementation of M .
B. Obtaining the Outputs from s (i) M
Denote by out : {0, 1} 2 × {0, 1} 2 → {0, 1} 2 the operator given in Table IV 
THE COMPLETE CIRCUIT Section IV breaks the task down to using the PPC framework to compute s
, using M and then out M to determine the outputs. Thus, we need to provide implementations of M and out M , and apply the template from [11] .
A. Implementations of Operators
We provide optimized implementations based on fan-in 2 AND and OR gates and inverters here, cf. Section II. Depending on target architecture and available libraries, more efficient solutions may be available. 
Implementing M : We operate with the inverted first bits of the output of M . To this end, define Nx := x 1 x 2 for x ∈ {0, 1, M} 2 and set xˆ M y := N (Nx M Ny). We computeḡ and work with inputsḡ and h using operatorˆ M . Theorem 4.1 and elementary calculations show that, for valid strings g and h,
i.e., the order of evaluation ofˆ M is insubstantial, just as for M . Moreover, as intended we get for all
We concisely express operator (Table IV) by the following logic formulas, where we already negate the first output bit.
This gives rise to depth-3 circuits containing in total 4 AND gates, 4 OR gates, and 2 inverters. 1 From the gate behavior specified in Table III , one can readily verify that the circuit also implementsˆ M correctly. 2 Since these circuits are identical to the ones used to compute out M , we give the implementation of such a selecting circuit once in Figure 3 and describe how to use it in Table VI . We remark that with identical select bits (sel 1 = sel 2 ), this circuit implements a CMUX (a MUX M in our terminology) as defined in [4] .
Implementing out M : The multiplication table of out M , which is equivalent to Table IV, is given in Table V. We can concisely express the output function given in Table V by the following logic formulas.
As mentioned before, instead of computing s 1 , we determine and use as input s 1 . Thus, the above formulas give rise to depth-3 circuits that contain in total 4 AND gates, 4 OR gates, and 2 inverters (see Figure 3 and Table VI ); in fact, the circuit is identical to the one used forˆ M with different inputs. From the gate behavior specified in Table III , one can readily verify that the circuit indeed also implements out M .
B. Implementation of s (i) M
We make use of the Parallel Prefix Computation (PPC) framework [11] to efficiently compute s
. This framework requires an associative operator OP . In our case, OP =ˆ M , which by Theorem 4.1 is associative on all relevant inputs. Given an implementation of OP , the circuit is recursively constructed as shown in Figure 4 , where the base case n = 1 is trivial. For n that is a power of 2, the depth and gate counts are given as [13] delay(P P C OP (n)) = (2 log 2 n − 1) · delay(OP ), cost(P P C OP (n)) = (2n − log 2 n − 2) · cost(OP ) .
(3)
C. Putting it All Together
Theorem 5.1: The circuit depicted in Figure 5 implements 2-sort(B) according to Definition 2.6. Its delay is O(log B) and its gate count is O(B).
VI. SIMULATION RESULTS
Design Flow: Our design flow makes use of the following tools: (i) design entry: Quartus, (ii) behavioral simulation: ModelSim, (iii) synthesis: Encounter RTL Compiler (part of Cadence tool set) with NanGate 45 nm Open Cell Library, (iv) place & route: Encounter (part of Cadence tool set) with NanGate 45 nm Open Cell Library.
Design Flow adaptations for MC: During synthesis the VHDL description of a circuit is automatically mapped to standard cells provided by a standard cell library. The standard cell library used for the experiments provides besides simple AND, OR or Inverter gates also more powerful AOI (And-Or-Invert) gates, which combine multiple boolean connectives and optimize them on transistor level. Since we did not analyze the behaviour of more complex AOI gates in face of metastability, we restrict our implementation to use only AND, OR and Fig. 4 . Recursive construction of P P C OP (n) for odd n, computing π i = δ 0 OP . . . OP δ i . For even n the rightmost input (δ n−1 ) and output (π n−1 ) are not present. Dashed lines are not connected to P P C OP ( n/2 ).
(g1, h1) (gB−2, hB−2) (gB−1, hB−1) Figure 4 , and we use the implementations of M and outM specified by Figure 3 and Table VI . For input Ns (0) = (1, 0), outM reduces to an AND and an OR gate. Inverter gates. To ensure this, we performed the mapping to standard cells by hand. The following standard cells have been used to map the logic gates to hardware: (i) INV X1: Inverter gate, (ii) AND2 X1: AND gate, (iii) OR2 X1: OR gate. In the documentation of the NanGate 45 nm Open Cell Library it can be seen that these cells in fact compute the metastable closure of the respective Boolean connective. After mapping the design by hand, we can disable the optimization in the synthesis step and go on with place and route. This prevents the RTL Compiler from performing Boolean optimization on the design, which may destroy the MC properties of our circuits.
The binary benchmark: Bin-comp: Following [5] , we also compare our sorting networks to a standard (noncontaining!) sorting design. Bin-comp uses a simple VHDL statement to compare both inputs: Each output is connected to a standard multiplexer, where the signal greater is used as the select bit for both multiplexers. The binary design follows a standard design flow, which uses the tools listed above. In short, Bin-comp follows the same design process as 2-sort, but then undergoes optimization using a more powerful set of basic gates.
We emphasize that the more powerful AOI gates combine multiple boolean functions and optimize them on gate level, yet each of them is still counted as one gate. Thus, comparing our design to the binary design in terms of gate count, area, and delay disfavors our solution. Moreover, the optimization routine switches to employing more powerful gates when going from B = 8 to B = 16 (See Table VIII ) resulting in a decrease of the delay of the binary implementation.
Nonetheless, our design performs comparably to the noncontaining binary design in terms of delay, cf. Table VII . This is quite notable, as further optimization on the transistor level or using more powerful AOI gates is possible, with significant expected gains. The same applies to gate count and area, where a notable gap remains. Recall, however, that the binary design hides complexity by using more advanced gates and does not contain metastability.
We remark that we refrained from optimizing the design by making use of all available gates or devising transistorlevel implementations for two reasons. First, such an approach is tied to the utilized library or requires design of standard cells. Second, it would have been unsuitable for a comparison with [5] , which does not employ such optimizations either.
Comparison to State of the Art: Our circuits show large improvements over [5] in all performance measures. Delays, gate counts, and area are all smaller by factors between roughly 1.5 and 3.5. In particular, for B = 16 delay is roughly cut in half, while gate count and area decrease by factors of 3 or more.
VII. DISCUSSION
In this paper, we provide asymptotically optimal MC sorting primitives. We achieve this by applying results on parallel prefix computation [11] , which requires to establish that the involved operators behave associative on the relevant inputs.
Our circuits are purely combinational and are glitch-free (as they are MC). Compared to standard sorting networks, we roughly match delay, but fall behind on gate count and area. However, we used gate-level implementations of out M and M restricted to AND and OR gates and inverters. Transistor-level implementations, which are a straightforward optimization, would decrease size and delay of the derived circuits further. We expect that this will result in circuits that perform on par with standard sorting networks. In light of these properties, we believe our circuits to be of wide applicability.
