Abstract-High speed architectures for finding the first two maximum/minimum values are of paramount importance in several applications, including iterative (e.g., turbo and low-density-parity-check) decoders. In this brief, stemming from a previous work, based on radix-2 solutions, we propose higher and mixed radix implementations that improve the architecture latency. Post place and route results on a 180-nm CMOS standard cell technology show that the proposed architectures achieve lower latency than radix-2 solutions with a moderate area increase.
I. INTRODUCTION
Recently, several simplified algorithms for channel decoding have been proposed for low-density-parity-check (LDPC) [1] and turbo codes [2] . The -min algorithm [3] , min-sum, and its improved versions [4] , are widely used in LDPC decoders, [5] , [6] . Similarly, in [7] a novel multi-input max 3 approximation for turbo and turbo trellis-coded-modulation (TCM) coding is proposed. All these works share the need for finding the first two maximum/minimum (max/min) values in a set of M elements. As an example, in the min-sum algorithm [4] the magnitude of the i-th output of a check node having degree dc is given by the first min of the dc inputs' magnitudes, unless this equals the ith input's magnitude, in which case the second min is employed. In [7] , the Jacobian logarithm of n inputs is computed as the first max (max1) plus a correction term that depends on max1-max2, where max2 is the second max. A similar problem can be found also in K-best multiple-input-multiple-output (MIMO) detectors [8] - [10] , non-binary LDPC decoders [11] and turbo product codes [12] , [13] where the computation of the first W max/min values is required. For the sake of brevity the extension to the search of the first W max/min values is not investigated in this brief.
All the architectures proposed in [14] - [16] for finding the first two max/min values are based on radix-2 tree structures or a proper blend of radix-2 and radix-3 blocks. Even if these architectures are remarkably small in terms of area, they require a relevant number of stages (especially for large M ) which negatively affects the delay. However, some applications as iterative decoders (e.g., turbo codes [2] , [7] ), require high throughput and include feedback loops in the processing: in these cases pipelining does not provide any throughput improvement and different solutions are necessary.
In this brief, stemming from [16] we analyze the implementation of a generic radix-K solution and we further extend the design space considering mixed radix architectures (MRAs). In [17] a similar work is presented. However, it focuses only on using binary and trinary trees. On the contrary, this work offers a systematic treatment including MRAs and, to the best of our knowledge, it is the first work addressing MRAs for finding the first two max/min values. 
B. Fixed Radix Architecture (FRA)
A delay O(log K (M)) is obtained by using a tree structure made of radix-K CSs with K < M (see Fig. 1 ). At l = 1 there are M=K concurrent CSs, each of which finds the first two max values out of its K inputs with a delay O(1). Thus, each of these CSs contains K 1 (K 0 1)=2 comparators working concurrently. The total number of comparators at l = 1 is
Then, we define X K
[i] the set of K elements processed by the ith CS (0 i M=K 0 1) at l = 1 ( Inspired by [16] 
From (1) and (2) the total number of comparators is
C. Mixed Radix Architecture
The solution detailed in the previous paragraphs can be applied as is only when log K (M ) 2 . However, several cases of practical interest where M is not a power of K can give better area/latency trade-offs.
To that purpose we propose to use MRA, namely different levels in the tree use different radix. In the following we will refer to this solution as inter-level-MRA (IR-MRA). Further flexibility could be achieved by using a different radix for each CS at level l, namely, K l [i] is the radix of the ith CS at level l. This solution will be referred to as intralevel-MRA (IA-MRA).
1) IR-MRA:
If N is the number of levels we have
with 8 n () = n01 i= K i and K l the radix at level l.
we can find several arrays KN = fK1; . . . ;KNg that satisfy (5) , where the elements of KN are taken from the set of the dividers of M. To that purpose we introduce D N as the set of all the arrays KN that satisfy (5) and N, the cardinality of DN. If we impose that M is a power of two and we take the logarithm of (5) we obtain = log 2 (M) = N l=1 log 2 (K l ). As a consequence, the problem of finding N simplifies to finding the set of N positive integers (log 2 (K l )) whose sum is . Thus, we obtain N = 01 N01 . As an example, for M = 32 and N = 3 there are six possible IR-MRA (3 = 6). To find the solution that requires the minimum C we consider (4) and impose @C=@K l = 0 for each l
with (5) as a constraint, so that KN is chosen to satisfy (5).
We can rewrite indeed (6) as
Finally, by substituting (7) and (8) in (5) we obtain M = K 2 1 3(K 2 0 1)
Unfortunately, both (9) and (10) can be written as polynomial Diophantine equations with integer coefficients that do not always admit solutions in [18] . Usually in IR-MRAs N is of the order of few tens; thus, D N can be explored exhaustively as shown in Section IV.
2) IA-MRA:
The formalization proposed in Section II-C-I to compute C can be extended to the case of IA-MRA as
where O l is the number of CSs at level l and l=1 = 1=2, l>1 = 3=2.
This implies that (5) should be rewritten as
where K l is the average radix at level l. The minimization of (11) is Diophantine, as for the IR-MRA case, as a consequence, it is not always possible to explicitly find optimal solutions. Even if N tends to be large, we explored exhaustively DN, as for the IR-MRA case. Experimental results show that in different cases IR-MRAs and IA-MRAs achieve the same C for a given N. For the sake of brevity,
we will concentrate on IR-MRAs. However, in Section IV we will show IA-MRAs results when they have lower complexity than the corresponding IR-MRA.
III. ARCHITECTURAL DESCRIPTION
Each CS is made of three main parts: an array of comparators, some one-hot index generators (OHIGs), and some mux-like structures (MLSs).
A. Level l = 1 Comparing Stages
Note that in this section we consider input values belonging to one CS, namely x j with 0 j K 0 1. Thus, indices do not correspond to the ones given in Section II when X M and X K [i] were defined.
Let us define x p and x q as two inputs out of the possible K ones, s p;q as the sign of x p 0x q and s q;p = s p;q where ( 1) is the one-complement operator [see Fig. 2(a) ]. 1 If input xn is max1 then sq;n = 1 for every q such that 0 q K 0 1 and q 6 = n. Now we build an array N containing K elements, where the pth element is N p = K01 q=0;q6 =p s q;p (13) and stays for the logic-and operation. As it can be observed, if n is the index of max1 of the ith CS, namely n = arg(max(X K [i])), N is the One-hot (OH) binary representation of n. Finally, N is used as the selection signal of an MLS [see Fig. 2(b) ]. According to (13) we concurrently compute all the bits of the OHIG resorting to K and-gates each of which receives K 01-input sq;p signals [see the dashed box in Fig. 2(b) ]. The MLS can be described as follows. Given x u 2 X K [i], 0 u K 01 and being x u;v the vth bit of x u we obtain y K (14) where is the logic-or operation. According to (14) a 1-bit MLS is made of a K-input or-gate and K 2-input and-gates [see Fig. 2(c)] . As a consequence, we concurrently obtain each bit of y K = , is not relevant as choosing either or is the same. 
CS architecture (radix-).
max1. 2 Let us definet z;w = s z;w^Nw , wheret z;w = 1 when x w xz and xw 6 = y K 0 [i] . If input xn is max1 sn;w = 0 andtn;w = 0. As a consequence, to identify max2 we can not simply computeM w = K01 z=0;z6 =wt z;w as for max1. We can avoid this problem by introducing t z;w =t z;w _ N z : if input x n is max1, N n = 1 and t n;w = 1 [see Fig. 2(d) 
According to (15) and (16) [i]. To complete the architecture (see Fig. 3 ) we need to infer if y 
IV. EXPERIMENTAL RESULTS
The analytical approach proposed in Section II can be used to identify the set of solutions with minimum C. This strategy is useful when the design space tends to be large, as in the IA-MRA case. In Table II area and latency for a 180-nm CMOS standard cell technology of one comparator (data represented on six bits) and one radix-2 CS at l = 1 and l > 1 are shown; the complexity of the comparator(s) is about half the complexity of the CS. Moreover, data in Table II allow for a reasonable estimation of the area and latency of radix-2 architectures summarized in Table I C . Similar formulas can be obtained to estimate A and L for higher radix and MRAs.
To highlight the area/latency trade-off we define 3 Results shown in Table I To obtain more accurate results [19] , VHDL developed architectures have been synthesized with Synopsys Design Compiler for shortest delay, Placed and Routed (P&R) with Cadence Encounter using a 180-nm CMOS standard cell technology at 0 C and with supply voltage 1.95 V.
Even if the expression of M w (15) can be optimized by-hand (each element t z;w contains the common term N w ), we prefer to leave the task of logic minimization to the logic synthesizer to explore a larger space of complexity/performance trade-offs. Experimental results shown in the following for [14] - [16] have been reproduced for a fair comparison with the proposed solutions for six bit data width. We show in Table III The proposed architecture can also be employed to reduce L when M is not a power of two. As an example, if M = 9 a radix-2 solution imposes an unbalanced tree structure with N = 4. The implementation of such a structure leads to A = 21025 m and to L = 2 ns. On the other hand, with a FRA-3, corresponding to N = 2, we obtain A = 14426 m and L = 1.6 ns. It is worth noting that there is no IA-MRA for M = 9, N = 2 that performs better than K1 = K2 = 3. Similarly, M = 24 as a radix-2 solution has an unbalanced tree structure with N = 5. On the contrary, with N = 3 we have nine possible MRAs. As it can be observed, the 4/2/3 MRA improves the latency of 25% with respect to the FRA-2 and requires an area overhead of less than 1.1%.
For M = 24, N = 3 IA-MRAs exist, however they require C = 54 as the best solution reported in Table III .
In Table IV we compare our MRAs with other approaches proposed for finding the first two min values in LDPC decoders [14] , [15] : 3 considered cases are M = 8 for [14] , M = 6, M = 7 for [15] . For the cases M = 6 and M = 7 we consider also the unbalanced radix-2 tree proposed in [16] for a generic M, whereas, for the case M = 7 we use a IA-MRA: one radix-4 and one radix-3 at l = 1 (k 1 = 4; 3) and K2 = 2.
For MRAs in Table III we observe that when M = 32 the best solution with N = 2 (8/4) leads to 3 32 > 1 whereas the best solution for N = 3 (4/2/4) achieves 
V. CONCLUSION
In this brief high speed architectures for finding the first two max/min values are presented. The proposed solution extends previous works based on radix-2 and radix-3 solutions to both higher and mixed radix solutions. As shown by experimental results MRAs achieve lower latency than radix-2 architectures with a limited area increase. Moreover, MRAs show better figures than other solutions proposed for LDPC decoders.
I. INTRODUCTION
Lately there has been intense interest in 3-D ICs in the semiconductor industry. 3-D integration is being hailed as a "Beyond Moore" driver which promises to provide further increase in integration density. A 3-D IC is made up of an IC stack with very short vertical interconnections between adjacent dies by means of through-silicon vias (TSVs). There are a lot of potential advantages to go 3-D including smaller footprint, reduced interconnect delay, higher system performance, and lower power consumption. Moreover, heterogeneous technologies can be comfortably integrated in a die stack. One can choose the most suitable process to manufacture each die to optimize the cost and performance. The simplest example is stacking memory and CPU.
