High speed architectures for finding the firsttwo maximum/minimum values by Amaru, L.G. et al.
Politecnico di Torino
Porto Institutional Repository
[Article] High speed architectures for finding the firsttwo maximum/minimum
values
Original Citation:
L.G. Amaru; M. Martina; G. Masera (2012). High speed architectures for finding the firsttwo
maximum/minimum values. In: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION
(VLSI) SYSTEMS, vol. 20 n. 12, pp. 2342-2346. - ISSN 1063-8210
Availability:
This version is available at : http://porto.polito.it/2497944/ since: June 2012
Publisher:
IEEE
Published version:
DOI:10.1109/TVLSI.2011.2174166
Terms of use:
This article is made available under terms and conditions applicable to Open Access Policy Article
("Public - All rights reserved") , as described at http://porto.polito.it/terms_and_conditions.
html
Porto, the institutional repository of the Politecnico di Torino, is provided by the University Library
and the IT-Services. The aim is to enable open access to all the world. Please share with us how
this access benefits you. Your story matters.
(Article begins on next page)
High speed architectures for finding the first
two maximum/minimum values
Luca G. Amaru`, Maurizio Martina, Member IEEE, Guido Masera, Senior Member IEEE
Abstract
High speed architectures for finding the first two maximum/minimum values are of paramount importance in
several applications, including iterative (e.g. turbo and LDPC) decoders. In this brief, stemming from a previous
work, based on radix-2 solutions, we propose higher and mixed radix implementations that improve the architecture
latency. Post place and route results on a 180 nm CMOS standard cell technology show that the proposed architectures
achieve lower latency than radix-2 solutions with a moderate area increase.
Index Terms
Turbo decoder, LDPC decoder, minimum values generator, tree structure approach.
I. INTRODUCTION
Recently, several simplified algorithms for channel decoding have been proposed for Low-Density-Parity-Check
(LDPC) [1], and turbo codes [2]. The -min algorithm [3], min-sum and its improved versions [4], are widely used in
LDPC decoders, [5], [6]. Similarly, in [7] a novel multi-inputmax approximation for turbo and turbo Trellis-Coded-
Modulation (TCM) coding is proposed. All these works share the need for finding the first two maximum/minimum
(max/min) values in a set of M elements. As an example, in the min-sum algorithm [4] the magnitude of the
i-th output of a check node having degree dc is given by the first min of the dc inputs’ magnitudes, unless this
equals the i-th input’s magnitude, in which case the second min is employed. In [7], the Jacobian logarithm of
n inputs is computed as the first max (max1) plus a correction term that depends on max1-max2 where max2 is
the second max. A similar problem can be found also in K-best Multiple-Input-Multiple-Output (MIMO) detectors
[8]–[10], non-binary LDPC decoders [11] and Turbo Product Codes [12], [13] where the computation of the first
W max/min values is required. For the sake of brevity the extension to the search of the first W max/min values
is not investigated in this brief.
All the architectures proposed in [14]–[16] for finding the first two max/min values are based on radix-2 tree
structures or a proper blend of radix-2 and radix-3 blocks. Even if these architectures are remarkably small in
terms of area, they require a relevant number of stages (especially for large M ) which negatively affects the delay.
However, some applications as iterative decoders (e.g. turbo codes [2], [7]), require high throughput and include
feedback loops in the processing: in these cases pipelining does not provide any throughput improvement and
different solutions are necessary.
1In this brief, stemming from [16] we analyze the implementation of a generic radix-K solution and we further
extend the design space considering Mixed Radix Architectures (MRAs). In [17] a similar work is presented.
However, it focuses only on using binary and trinary trees. On the contrary, this work offers a systematic treatment
including MRAs and, to the best of our knowledge, it is the first work addressing MRAs for finding the first two
max/min values.
II. FINDING THE FIRST TWO MAX/MIN VALUES
A. Problem formulation
Given a set XM = fx0; : : : ; xM 1g made of M elements we want to find the first two max/min values, namely
yM0 = max(XM ) and yM1 = max(XM n fyM0 g) (similarly substituting max with min). For the sake of simplicity
in the following we will discuss only the max case as the min equivalent solution can be straightforwardly derived.
According to the tree-based solution proposed in [16], the procedure to find the first two max values out of M
assigned ones can be decomposed in log2(M) levels where level l contains M=2
l Comparing Stages (CSs) with
1  l  log2(M). CSs at the first level (l = 1) receive two input values and sort them, so each CS is made of one
comparator. Each CS at higher levels (l > 1) receives two couples of sorted values from the previous level and
outputs one sorted couple. If we approximate the delay of a CS with the delay of a comparator, we can infer that
the delay of such a structure is O(log2(M)). This assumption will be used along this section. However, in section
IV we will give area and delay results obtained via real implementation of the proposed architectures. On the other
hand, we can implement a radix-M solution that is able to find the first two max values with a delay O(1). This
reduced delay is paid in terms of complexity as we need to compare every input value with each other indeed. Thus,
the total number of comparators (avoiding duplicating comparisons) is M + (M   1) +   + 1 = M  (M   1)=2.
B. Fixed Radix Architecture (FRA)
A delay O(logK(M)) is obtained by using a tree structure made of radix-K CSs with K < M (Fig. 1). At
l = 1 there are M=K concurrent CSs, each of which finds the first two max values out of its K inputs with a
delay O(1). Thus, each of these CSs contains K  (K   1)=2 comparators working concurrently. The total number
of comparators at l = 1 is
Cl=1 = [M  (K   1)]=2: (1)
Then, we define XK1 [i] the set of K elements processed by the i-th CS (0  i  M=K   1) at l = 1
(
SM=K 1
i=0 XK1 [i] = XM ) and yK10 [i] = max(XK1 [i]), yK11 [i] = max(XK1 [i] n fyK10 [i]g). For 1 < l  logK(M),
the i-th CS receives K couples yKl 10 [i K + j]; yKl 11 [i K + j] with 0  i  M=Kl   1 and 0  j  K   1.
Inspired by [16] we can infer that yKl0 [i], the first max output by the i-th CS at level l, is obtained comparing the
K max1 values from the previous level, yKl 10 [i K+ j]. Thus, K  (K 1)=2 comparators are required to compute
yKl0 [i]. Similarly, to compute y
Kl
1 [i] we need to compare y
Kl 1
0 [i K+p] and yKl 11 [i K+q] with 0  p; q  K 1
2and p 6= q. As a consequence, to complete the computation of yKl1 [i] further K  (K   1) comparators are required,
as detailed in section III-B. The total number of comparators for 1 < l  logK(M) is
Cl>1 =
logK(M)X
l=2
M
Kl
 3K  (K   1)
2
=
3(M  K)
2
: (2)
From (1) and (2) the total number of comparators is
C = Cl=1 + Cl>1 = (M K + 2M   3K)=2: (3)
second
maximummaximum
first
l = 2
y
Kl
0 [0] y
Kl
1 [0]
yM
0
yM
1
comparing stage M
K
− 1comparing stage 0
comparing stage M
K2
− 1comparing stage 0
l = 1
xM−1xM−KxK−1
K K
x0
yK11 [
M
K
− 1]yK10 [0] y
K1
0 [
M
K
− 1]yK11 [0]
yK21 [
M
K2
− 1]yK21 [0]y
K2
0 [0] y
K2
0 [
M
K2
− 1]
y
Kl−1
0 [0] y
Kl−1
1 [
M
Kl−1
− 1]
comparing stage 0l = logK(M)
Figure 1. Radix-K tree structure
C. Mixed Radix Architecture
The solution detailed in the previous paragraphs can be applied as is only when logK(M) 2 N. However, several
cases of practical interest where M is not a power of K can give better area/latency trade-offs. To that purpose we
propose to use MRA, namely different levels in the tree use different radix. In the following we will refer to this
solution as InteR-level-MRA (IR-MRA). Further flexibility could be achieved by using a different radix for each CS
at level l, namely, Kl[i] is the radix of the i-th CS at level l. This solution will be referred to as IntrA-level-MRA
(IA-MRA).
31) IR-MRA: If N is the number of levels we have
C =
M  (K1   1)
2
+
3M
2
NX
n=2
Kn   1
n(1)
(4)
with n() =
Qn 1
i= Ki and Kl the radix at level l. Imposing
M =
NY
l=1
Kl M;Kl 2 N+ (5)
we can find several arrays KN = fK1; : : : ;KNg that satisfy (5), where the elements of KN are taken from the
set of the dividers of M . To that purpose we introduce DN as the set of all the arrays KN that satisfy (5) and
N , the cardinality of DN . If we impose that M is a power of two and we take the logarithm of (5) we obtain
 = log2(M) =
PN
l=1 log2(Kl). As a consequence, the problem of finding N simplifies to finding the set of N
positive integers (log2(Kl)) whose sum is . Thus, we obtain N =
 
 1
N 1

. As an example, for M = 32 and N = 3
there are six possible IR-MRA (3 = 6). To find the solution that requires the minimum C we consider (4) and
impose @C=@Kl = 0 for each l:
K1=
vuut3 NX
n=2
Kn   1
n(2)
Klj1<l<N=
vuut NX
n=l+1
Kn   1
n(l + 1)
(6)
with (5) as a constraint, so that KN is chosen to satisfy (5).
We can rewrite indeed (6) as
K1 =
p
3(K2   1) if N = 2 (7)
K1 =
p
3(2K2   1)
Kl =
p
2Kl+1   1 1 < l < N   1
KN 1 =
p
KN   1
9>>>=>>>; if N > 2: (8)
Finally, by substituting (7) and (8) in (5) we obtain
M=K2 
p
3(K2   1) (9)
M=
p
3(2K2   1) 
N 2Y
l=2
p
2Kl+1   1 
p
KN   1 KN : (10)
Unfortunately, both (9) and (10) can be written as polynomial Diophantine equations with integer coefficients
that do not always admit solutions in N [18]. Usually in IR-MRAs N is of the order of few tens; thus, DN can
be explored exhaustively as shown in section IV.
2) IA-MRA: The formalization proposed in section II-C1 to compute C can be extended to the case of IA-MRA
as
C =
NX
l=1
Cl Cl =
OlX
i=0
l Kl[i]  (Kl[i]  1) (11)
where Ol is the number of CSs at level l and l=1 = 1=2, l>1 = 3=2. This implies that (5) should be rewritten as
M =
NY
l=1
Kl M 2 N+ Kl = 1
Ol
OlX
i=0
Kl[i] (12)
4(b)
(a) (d)(c)
xK−1x0
s0,K−1
s1,2 s1,q
xqx1
s1,K−1
xK−1x1x2x1
xqxp
sp,q
xp xq
sp,q
comparator
xK−1xK−2
sK−2,K−1
mux-like structure yK10 [i]xp
x0
xK−1
sK−2,K−1 s0,K−1
NK−1
sK−1,0 s1,0
Np N0
sq,p
one-hot index generatorxqx0
s0,q
x2x0
s0,2
x0 x1
s0,1
tˆz,w Nz
Nwsz,w
tz,w
N0xK−1,v xu,vNK−1 Nu x0,v
y
K1
0,v [i]
Figure 2. l = 1 CS architecture (radix-K): array of comparators (a), one-hot index generator for N (b), 1-bit mux-like structure (c), generation
of tz;w (d)
where Kl is the average radix at level l. The minimization of (11) is Diophantine, as for the IR-MRA case, as a
consequence, it is not always possible to explicitly find optimal solutions. Even if N tends to be large, we explored
exhaustively DN , as for the IR-MRA case. Experimental results show that in different cases IR-MRAs and IA-
MRAs achieve the same C for a given N . For the sake of brevity, we will concentrate on IR-MRAs. However, in
section IV we will show IA-MRAs results when they have lower complexity than the corresponding IR-MRA.
III. ARCHITECTURAL DESCRIPTION
Each CS is made of three main parts: an array of comparators, some One-Hot Index Generators (OHIGs), and
some Mux-Like Structures (MLSs).
A. Level l = 1 comparing stages
Note that in this section we consider input values belonging to one CS, namely xj with 0  j  K   1. Thus,
indices do not correspond to the ones given in section II when XM and XK1 [i] were defined.
Let us define xp and xq as two inputs out of the possible K ones, sp;q as the sign of xp   xq and sq;p = sp;q
where () is the one-complement operator (see Fig. 2 (a))1. If input xn is max1 then sq;n = 1 for every q such that
0  q  K   1 and q 6= n. Now we build an array N containing K elements, where the p-th element is
Np =
K 1^
q=0;q 6=p
sq;p (13)
and
V
stays for the logic-and operation. As it can be observed, if n is the index of max1 of the i-th CS, namely
n = arg(max(XK1 [i])), N is the One-Hot (OH) binary representation of n. Finally, N is used as the selection
1If xp = xq , sp;q is not relevant as choosing either xp or xq is the same.
5signal of an MLS (see Fig. 2 (b)). According to (13) we concurrently compute all the bits of the OHIG resorting
to K and-gates each of which receives K   1-input sq;p signals (see the dashed box in Fig 2 (b)). The MLS can
be described as follows. Given xu 2 XK1 [i], 0  u  K   1 and being xu;v the v-th bit of xu we obtain yK10;v [i],
the v-th bit of yK10 [i], as
yK10;v [i] =
K 1_
u=0
xu;v ^Nu (14)
where
W
is the logic-or operation. According to (14) a 1-bit MLS is made of a K-input or-gate and K 2-input
and-gates (see Fig. 2 (c)). As a consequence, we concurrently obtain each bit of yK10 [i] by replicating the 1-bit
MLS.
A similar approach is used to obtain yK11 [i], namely combining sz;w signals with a blinding circuit that acts as a
mask for the position of max12. Let us define t^z;w = sz;w ^Nw where t^z;w = 1 when xw  xz and xw 6= yK10 [i].
If input xn is max1 sn;w = 0 and t^n;w = 0. As a consequence, to identify max2 we can not simply compute
M^w =
VK 1
z=0;z 6=w t^z;w as for max1. We can avoid this problem by introducing tz;w = t^z;w _ Nz: if input xn is
max1, Nn = 1 and tn;w = 1 (see Fig. 2 (d)). We can now build a second array M that contains K elements, where
the w-th element is
Mw =
K 1^
z=0;z 6=w
tz;w: (15)
As for max1, m is the index of max2 in the i-th CS, m = arg(yK11 [i] = max(XK1 [i] n fyK10 [i]g), M is the OH
binary representation of m and the v-th bit of yK11 [i] is obtained as
yK11;v [i] =
K 1_
u=0
xu;v ^Mu: (16)
According to (15) and (16), yK11 [i] is obtained with an OHIG and an MLS, (Fig. 2 (b) and (c)) by substituting sq;p
and N with tz;w and M respectively.
B. level l > 1 comparing stages
A radix-K CS at l > 1 computes the first two max values among its K input couples yKl 10 [i K + j]; yKl 11 [i 
K + j], 0  i  M=Kl   1, 0  j  K   1. The max1 output by the i-th CS (yKl0 [i]) is obtained by processing
the K input values yKl 10 [i K + j] with the same architecture used for l = 1.
Let us define YKl 10 [i] and YKl 11 [i] as the two sets containing yKl 10 [i  K + j] and yKl 11 [i  K + j] values
respectively (each set contains K values). As far as max2 is concerned, two cases are possible: i) yKl1 [i] 2 YKl 10 [i],
namely yKl1 [i] is one of the max1 received as inputs. ii) y
Kl
1 [i] 2 YKl 11 [i], yKl1 [i] is one of the max2 values. Case
i) can be solved with the same architecture used for l = 1. On the other hand, case ii) requires to compare
y
Kl 1
0 [i K + p] and yKl 11 [i K + q] with 0  p; q  K   1 and p 6= q. Let us define yKl 10 [a] and yKl 11 [b] with
a 6= b as one element of YKl 10 [i] and YKl 11 [i] respectively. Then, the sign of yKl 10 [a]  yKl 11 [b] is s0;1a;b. If f is
2The formulation proposed in the following should ease understanding the underlying idea even if from a formal point of view it has some
redundancy.
6MLSMLS
y
Kl
0 [i]
OHIG
MLS
C array
OHIG
y
Kl−1
0 [a] y
Kl−1
1 [b]
N
Y0
C array
Y1
sq,p
y
Kl
1 [i]
s
0,1
a,b
F
r
y
Kl−1
0 [p] y
Kl−1
0 [q]
M
Figure 3. l > 1 CS architecture (radix-K)
the index of the element in YKl 11 [i] that is max2 of the i-th CS, F is its OH binary representation, where the b-th
element is Fb =
VK 1
a=0;a 6=b s
0;1
a;b. A further MLS selects y
Kl
1 [i] as the f -th element of YKl 11 [i]. To complete the
architecture (see Fig. 3) we need to infer if yKl1 [i] belongs to YKl 10 [i] or to YKl 11 [i]. This selection is accomplished
by observing that if yKl1 [i] 2 YKl 11 [i] then r =
WK 1
b=0 Fb = 1.
Table I
FRA POST SYNTHESIS RESULTS: A [M2] AND L [NS]
M = 8 M = 16 M = 32 M = 64
Kl = 2 11026/1.7 23105/2.2 51613/2.6 97906/3.2
Kl = 4 - 23904/1.7 - 115310/2.4
Kl = 8 16494/1.2 - - 172474/2.3
Kl = 16 - 70250/1.5 - -
Kl = 32 - - 263270/1.8 -
Kl = 64 - - - 819798/2.3
IV. EXPERIMENTAL RESULTS
The analytical approach proposed in section II can be used to identify the set of solutions with minimum C. This
strategy is useful when the design space tends to be large, as in the IA-MRA case. In Table II area and latency for
a 180 nm CMOS standard cell technology of one comparator (data represented on six bits) and one radix-2 CS at
l = 1 and l > 1 are shown; the complexity of the comparator(s) is about half the complexity of the CS. Moreover,
data in Table II allow for a reasonable estimation of the area and latency of radix-2 architectures summarized in
Table I as A(2)M = A
(2)
Cl=1
 M2 + A(2)Cl>1 
 
M
2   1

and L(2)M = L
(2)
Cl=1
+ (N   1)  L(2)Cl>1 . Similar formulas can be
obtained to estimate A and L for higher radix and MRAs.
7Table II
COMPARATOR AND RADIX-2 CS AT l = 1 AND l > 1 POST SYNTHESIS RESULTS: A [M2] AND L [PS]
comparator CS @ l = 1 CS @ l > 1
AC=LC A
(2)
Cl=1
=L
(2)
Cl=1
A
(2)
Cl>1
=L
(2)
Cl>1
467/400 880/500 2503/600
Table III
POST P&R FRA AND MRA RESULTS: A [M2] AND L [NS]
FR (Kl) MR N=2 (K1/K2) MR N=3 (K1/K2/K3)
M=32
(16/2) (2/16) (4/8) (4/4/2) (4/2/4) (2/4/4)
(2) 157643/2.0 281772/2.0 96211/2.0 75780/2.2 72651/2.2 99709/2.2
73355/2.6 (8/4) (8/2/2) (2/8/2) (2/2/8)
94130/2.0 929011/2.2 155873/2.2 121188/2.2
FR (Kl) MR N=4 (K1/K2/K3/K4) FR (Kl) MR N=3 (K1/K2/K3)
M=64
(4/4/2/2) (4/2/4/2) (4/2/2/4) (2/4/2/4)
M=24
(6/2/2) (2/6/2) (2/2/6)
(2) 155423/2.6 150217/2.6 151835/2.6 190267/2.6 56451/2.0 94163/2.0 74957/2.0
137261/3.2 (2/4/4/2) (2/2/4/4) (2) (2/3/4) (2/4/3) (3/2/4)
(4) 206287/2.6 178323/2.6 50868/2.5 64922/2.0 73040/2.0 56480/2.0
160219/2.4 (8/2/2/2) (2/8/2/2) (2/2/8/2) (2/2/2/8) (3/4/2) (4/2/3) (4/3/2)
184369/2.6 304103/1.6 221290/2.6 194228/2.6 58376/2.0 51560/2.0 55276/2.0
To highlight the area/latency trade-off we define M = (A

M LM )=(A(2)M L(2)M ) where AM and LM are the area
and the latency of a generic radix architecture with M inputs. An architecture that halves the latency at the expense
of doubling the area with respect to the radix-2 solution has M = 1. Thus, architectures of particular interest are
the ones with M < 1.
Results shown in Table I highlight that FRAs do not permit to tune the A/L trade-off: for M = 32 there are
only two solutions: radix-2 and radix-32. In the radix-2 solution area is 80% lower and latency 30% larger than
radix-32 case. Moreover, the only FRAs with M < 1 are the radix-4 ones: 
(4)
16 = 0:8 and 
(4)
64 = 0:88.
To obtain more accurate results [19], VHDL developed architectures have been synthesized with Synopsys Design
Compiler for shortest delay, Placed and Routed (P&R) with Cadence Encounter using a 180 nm CMOS standard
cell technology at 0oC and with supply voltage 1.95V.
Even if the expression of Mw (15) can be optimized by-hand (each element tz;w contains the common term
Nw), we prefer to leave the task of logic minimization to the logic synthesizer to explore a larger space of
complexity/performance trade-offs. Experimental results shown in the following for [14]–[16] have been reproduced
for a fair comparison with the proposed solutions for six bit data width. We show in Table III the experimental
results obtained for (M = 32, N = 2), (M = 32, N = 3), (M = 64, N = 4) and (M = 24, N = 3) respectively,
where the best result is highlighted in bold. It is interesting to observe that IA-MRAs for the cases show in Table
III exist. For the case M = 32, N = 2 the best solution requires C = 130 and 94130 m2. However, with four
8radix-5 and two radix-6 CSs at l = 1 and K2 = 6 we obtain C = 115. Even if this solution minimizes C it requires
5-input and 6-input and/or-gates leading to an area of 104903 m2, about the 10% more than the best IR-MRA.
IA-MRAs for M = 32, N = 3 require in the best case C = 78 and 72651 m2, the same result obtained with
K1 = 4, K2 = 2, K3 = 4. When M = 64, N = 4 the best solution requires C = 159 and 150217 m2. However,
with two radix-2 and twenty radix-3 at l = 1, K2 = 2, one radix-2 and three radix-3 at l = 3 and K4 = 4 we
obtain C = 143 and A = 145801 m2, about the 3% less than the best IR-MRA.
Table IV
POST P&R EXPERIMENTAL RESULTS: A [M2] AND L [NS] COMPARISONS
M [14]–[16] Proposed
6
[15] [16] Kl = 6
K1 = 2 K1 = 3
K2 = 3 K2 = 2
13451/1.5 10158/1.6 13197/1.1 11329/1.4 8827/1.4
6 > 1 - 

6 = 0:89 

6 = 0:98 

6 = 0:76
7
[15] [16] Kl = 7
K1 = 4; 3
K2 = 2
13718/1.7 13390/1.7 15184/1.1 13472/1.3
7 > 1 - 

7 = 0:73 

7 = 0:77
8
[14] [16] Kl = 8
K1 = 2 K1 = 4
K2 = 4 K2 = 2
17617/2.1 13640/1.7 23343/1.2 21938/1.4 13799/1.4
8 > 1 - 

8 > 1 

8 > 1 

8 = 0:83
The proposed architecture can also be employed to reduce L when M is not a power of two. As an example, if
M = 9 a radix-2 solution imposes an unbalanced tree structure with N = 4. The implementation of such a structure
leads to A = 21025 m2 and to L = 2 ns. On the other hand, with a FRA-3, corresponding to N = 2, we obtain
A = 14426 m2 and L = 1:6 ns. It is worth noting that there is no IA-MRA for M = 9, N = 2 that performs
better than K1 = K2 = 3. Similarly, M = 24 as a radix-2 solution has an unbalanced tree structure with N = 5.
On the contrary, with N = 3 we have nine possible MRAs. As it can be observed, the 4/2/3 MRA improves the
latency of 25% with respect to the FRA-2 and requires an area overhead of less than 1.1%. For M = 24, N = 3
IA-MRAs exist, however they require C = 54 as the best solution reported in Table III.
In Table IV we compare our MRAs with other approaches proposed for finding the first two min values in LDPC
decoders [14], [15]3: considered cases are M = 8 for [14], M = 6, M = 7 for [15]. For the cases M = 6 and
M = 7 we consider also the unbalanced radix-2 tree proposed in [16] for a generic M , whereas, for the case
M = 7 we use a IA-MRA: one radix-4 and one radix-3 at l = 1 (K1 = 4; 3) and K2 = 2.
For MRAs in Table III we observe that when M = 32 the best solution with N = 2 (8/4) leads to 32 > 1
whereas the best solution for N = 3 (4/2/4) achieves 32 = 0:84. When M = 64, N = 4 the best solution (4/2/4/2)
3Even if in [14] and [15] min values are found we will refer to max values.
9gives 64 = 0:89 that is slightly worse than the FRA Kl = 4 (
(4)
64 = 0:88). Finally, for M = 9, Kl = 3 we have

(3)
9 = 0:55; for M = 24 the best solution (4/2/3) achieves 

24 = 0:81.
V. CONCLUSIONS
In this brief high speed architectures for finding the first two max/min values are presented. The proposed solution
extends previous works based on radix-2 and radix-3 solutions to both higher and mixed radix solutions. As shown by
experimental results MRAs achieve lower latency than radix-2 architectures with a limited area increase. Moreover,
MRAs show better figures than other solutions proposed for LDPC decoders.
REFERENCES
[1] M. M. Mansour and N. R. Shanbhang, “High-throughput LDPC decoders,” IEEE Trans. on VLSI, vol. 11, no. 6, pp. 976–996, Dec 2003.
[2] O. Muller, A. Baghdadi, and M. Jezequel, “Exploring parallel processing levels for convolutional turbo decoding,” in IEEE International
Conf. on Information and Communications Tech.: from Theory to Applications, 2006, pp. 2353–2358.
[3] F. Guilloud, E. Boutillon, and J. L. Danger, “-min decoding algorithm of regular and irregular LDPC codes,” in International Symposium
on Turbo Codes and Related Topics, 2003, pp. 451–454.
[4] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, and X. Y. Hu, “Reduced-complexity decoding of LDPC codes,” IEEE Trans. on
Communications, vol. 53, no. 8, pp. 1288–1299, Aug 2005.
[5] D. Oh and K. K. Parhi, “Min-sum decoder architectures with reduced word length for LDPC codes,” IEEE Trans. on Circuits and Systems
I, vol. 57, no. 1, pp. 105–115, Jan 2010.
[6] A. Cevrero, Y. Leblebici, P. Ienne, and A. Burg, “A 5.35 mm2 10GBASE-T ethernet LDPC decoder chip in 90 nm CMOS,” in IEEE
Asian Solid-State Circuits Conf., 2010, pp. 1–4.
[7] S. Papaharalabos, M. Sybis, P. Tyczka, and P. T. Mathiopoulos, “Modified Log-MAP algorithm for simplified decoding of turbo and turbo
TCM codes,” in IEEE Vehicular Tech. Conf. (Spring), 2009, pp. 1–5.
[8] Y. Sun and J. R. Cavallaro, “Low-complexity and high-performance soft MIMO detecion based on distributed M-algorithm through
trellis-diagram,” in IEEE International Conf. on Acoustics, Speach and Signal Processing, 2010, pp. 3398–3401.
[9] D. Patel, V. Smolyakov, M. Shabany, and P. G. Gulak, “VLSI implementation of a WiMAX/LTE compliant low-complexity high-throughput
soft-output K-Best MIMO detector,” in IEEE International Symposium on Circuits and Systems, 2010, pp. 593–596.
[10] P. Tsai, W. Chen, X. Lin, and M. Huang, “A 44 64-QAM reduced-complexity K-Best MIMO detector up to 1.5Gbps,” in IEEE International
Symposium on Circuits and Systems, 2010, pp. 3953–3956.
[11] D. Declercq and M. Fossorier, “Decoding algorithms for nonbinary LDPC codes over GF(q),” IEEE Trans. on Communications, vol. 55,
no. 4, pp. 633–643, Apr 2007.
[12] R. M. Pyndiah, “Near-optimum decoding of product codes: block turbo codes,” IEEE Trans. on Communications, vol. 46, no. 8, pp.
1003–1010, Aug 1998.
[13] C. Leroux, C. Jego, P. Adder, M. Jezequel, and D. Gupta, “A highly parallel turbo product code decoder without interleaving resource,”
in IEEE Workshop on Signal Processing Systems, 2008, pp. 1–6.
[14] K. Gunnam, G. Choi, and M. Yeary, “A parallel VLSI architecture for layered decoding for array LDPC codes,” in IEEE International
Conf. on VLSI Design, 2007, pp. 738–743.
[15] X. Y. Shih, C. Z. Zhan, C. H. Lin, and A. Y. Wu, “An 8.29 mm2 52 mw multi-mode LDPC decoder design for mobile WiMAX system
in 0.13 m CMOS process,” IEEE Journal of Solid-State Circuits, vol. 43, no. 3, pp. 672–683, Mar 2008.
[16] C. L. Wey, M. D. Shieh, and S. Y. Lin, “Algorithms of finding the first two minimum values and their hardware implementation,” IEEE
Trans. on Circuits and Systems I, vol. 55, no. 11, pp. 3430–3437, Dec 2008.
[17] K. Gunnam, G. Choi, W. Wang, and M. Yeary, “Parallel VLSI architecture for layered decoding,” Texas A&M University, Tech. Rep.,
May 2007, available online at http://dropzone.tamu.edu/TechReports.
[18] Y. Matijasevic, “Enumerable sets are diophantine,” Doklady Akademiky Nauk SSSR, vol. 11, pp. 279–282, 1970, English translation: Soviet
Math Doklady 11, 354-357.
10
[19] A. Pulimeno, M. Graziano, and G. Piccinini, “UDSM trends comparison: From technology roadmap to UltraSparc Niagara2,” IEEE Trans.
on VLSI, 10.1109/TVLSI.2011.2148183, to appear.
11
ACKNOWLEDGMENT OF FINANCIAL SUPPORT
This work is partially supported by the NEWCOM++ NoE.
AFFILIATION OF AUTHORS
The authors are with Dipartimento di Elettronica - Politecnico di Torino - Italy.
