Abstract-Very-large-scale integration (VLSI) architectures for finding the first W (W > 2) maximum (or minimum) values are required in the implementation of several applications such as nonbinary low-density-parity-check decoders, K-best multipleinput-multiple-output (MIMO) detectors, and turbo product codes. In this brief, a parallel radix-sort-based VLSI architecture for finding the first W maximum (or minimum) values is proposed. The described architecture, called Bit-Wise-And (BWA) architecture, relies on analyzing input data from the most significant bit to the least significant one, with very simple logic circuits. One key feature in the BWA architecture is its high level of scalability, which enables the adoption of this solution in a large spectrum of applications, corresponding to large ranges for both W and the size of the input data set. Experimental results, achieved by implementing the proposed architecture on a high--speed 90-nm CMOS standard-cell technology, show that BWA architecture requires significantly less area than other solutions available in the literature, i.e., less than or about 50% in all the considered cases and about 50% in the worst case. Moreover, the BWA architecture exhibits the lowest area-delay product among almost all considered cases.
S
ORTING IS a well-established problem in computer science [1] and is a key operation in several applications. Moreover, the hardware implementation of sorting networks has been addressed as well in the past [1] . On the other hand, very-large-scale integration (VLSI) architectures for partial sorting, which can also be derived from selection networks (SNs) [2] , are part of different algorithms in the communication field. Partial sorting is employed, for example, in [3] - [6] for the decoding of turbo and binary low-density parity check (LDPC) codes, in [7] for the maximum-likelihood decoding of arithmetic codes, and in [8] - [10] for K-best MIMO detectors, nonbinary LDPC decoders, and turbo product codes, respectively. Circuits for finding the first two minimum values are used in binary LDPC decoder architectures [11] - [13] to implement min-sum approximations [6] , [14] , and recently, they have also been proposed for the case of nonbinary LDPC decoders [15] . However, very few works, e.g., [16] and [17] , investigate the general problem of implementing parallel architectures for finding the first two maximum/minimum values with a sysManuscript received April 9, 2014 [18] , [19] ; 2) nonbinary LDPC decoders [20] , [21] ; and 3) turbo product codes [22] . Unfortunately, to the best of our knowledge, no papers in the open literature present a general analysis for the case of W > 2. Stemming from the work described in [2] for sorting networks, in [1] , a comparator-based SN is proposed. However, as argued in [1] , other approaches, such as the one referred to as radix sorting, can be used as well. Radix sorting algorithms rely on the bitwise analysis of the data to be sorted and can be extended to selection and partial sorting problems. This brief proposes a parallel VLSI architecture relying on the radix sorting approach for finding the first W > 2 maximum/minimum values in a set of M values. Namely, the proposed solution, referred to as Bit-Wise-And (BWA) architecture, works by analyzing the M candidates from the most significant bit (MSB) to the least significant bit (LSB).
The rest of this brief is structured as follows. The problem formulation and the proposed architecture are described in Section II. In particular, in Section II-A, an initial version of the BWA architecture is presented. Then, the complete version is developed in Section II-B. Section III deals with performance results and comparisons. Finally, in Section IV, the conclusion is drawn.
II. PROBLEM FORMULATION AND PROPOSED ARCHITECTURE
According to the authors in [17] , we can state the problem of finding the first W maximum/minimum values as follows. Given a set X (M ) = {x 0 , . . . , x M −1 } made of M elements, we want to find the first W maximum/minimum values, namely,
}) (similarly substituting max with min). For the sake of simplicity, in the following, we will discuss only the max case.
A. Initial BWA Architecture Description
Radix sorting relies on the analysis of X (M ) values bit by bit from the MSB to the LSB. In the following, for the sake of simplicity, we will assume that the values in X (M ) are nonnegative. It is worth noting that 2's complement values can be straightforwardly used as well. Indeed, any set of N -bit 2's complement values can be converted to nonnegative values, preserving the order relation, by flipping the MSB. Thus, let 
for j = N − 3, . . . , 0 with ∧ representing the logic-and operation. If the MSB of all the x i values is "1" and all the x i 's are monotonic sequences of bits, i.e., only a transition from "1" to "0" is allowed as in the four x i values of Example 1, then analyzing the content of h i for i = 0, . . . M − 1 from the LSB to the MSB allows to find the first W maximum values. Example 1:
As an example,
contains only distinct elements, the following can occur: 1) Moving from the MSB row to the LSB row, all rows of H are different up to a certain j, and then, for j < j, all the rows are zero (j = 0 in Example 1); and 2) when moving from the LSB row to the MSB row, after the first nonzero row, one additional "1" appears along a column. As a consequence, moving from the LSB row to the MSB row of H, the columns of the first W nonzero rows are the positions of the first W maximum values. Since, in general, x i is not a monotonic sequence and repeated elements can exist in X (M ) , modifications to effectively employ the BWA technique are required.
B. Completed BWA Architecture
As highlighted in Section II-A, the initial BWA principle can be employed on data that are monotonic sequences of bits whose MSB is "1." If the data in X (M ) do not meet this requirement, the architecture does not work correctly. As an example, the case of x i,j = '0' for a certain j and for every i = 0, . . . , M − 1 causes h i,j = '0' for every j ≤ j. In this case, the architecture cannot distinguish among different x i 's. A similar problem arises if two or more x i values are nonmonotonic sequences of bits. Thus, we add some gates to handle these cases, referred to as zero-row conditions. To this purpose, we modify (1) as h i,j =ĥ i,j+1 ∧ x i,j , wherê
∨ is the logic-or operation and detects a zero-row condition. An example is given hereinafter. Example 2:
Example 2 shows a simple case, where the modified H matrix (Ĥ), which is an N × M matrix, is given. Indeed, as explained in the next paragraphs, the maximum values are selected by checkingĥ i,0 values. Handling zero rows leads to the slice architecture depicted in light gray in Fig. 1 , where each slice corresponds to one row ofĤ. The bottom part of Fig. 1 shows the circuit to implement (2) and (3), whereĥ i,j acts as h i,j , but if a zero-row condition occurs, thenĥ i,j =ĥ i,j+1 . As it can be observed in the modified H matrix, the proposed structure ensuresĥ i,0 = '1' for at least one value of i = 0, .
Thus, the selection of the maximum values in the proposed architecture is performed by checkingĥ i,0 values. Let I be To simplify the selection, we use a circuit referred to as output generation circuit that, based onĥ i,0 values, is able to find the maximum among M elements and to produce a new set of M elements X = x 0 , . . . , x M −1 where the maximum value is replaced by zero. Thus, the complete architecture, shown in Fig. 2 , is made of W stages, where each stage contains one instance of the circuit to produce the modified H (light gray part) and one instance of the output generation circuit (dark gray part). As a consequence, the qth stage finds y (M ) q , which corresponds to the maximum value of the qth input set. This operation is accomplished by means of the output generation circuit shown in dark gray in Fig. 3 
and the terms y
q,j and χ q,i,j represent the jth bit of y (M ) q and χ q,i , respectively (see the selection block in the left part of Fig. 3) . As an example, for q = 0 and q = 1, we have χ 0,i = x i and χ 1,i = x i , respectively. Finally, the qth maximum is obtained
corresponding to the N combiners, each made of an M -input logic-or in the bottom part of Fig. 3 . Pipelining the proposed architecture improves the throughput but leads to an area overhead. As an example, adding one pipeline register between each of the W stages in Fig. 2 ( i.e., W − 1 pipeline registers) implies adding W − 1 − q registers to each y
to increase the throughput by about W times.
III. EXPERIMENTAL RESULTS AND COMPARISONS
Experimental results obtained in the context of K-best MIMO detectors, nonbinary LDPC decoder architectures, and turbo product code architectures are shown and compared with solutions presented in the literature. Since several works do not give complete synthesis results, we reimplement the solutions presented in [18] , [20] , [22] , and [23] as stand-alone units. Moreover, we include the partial-bitonic architecture proposed in [24] . Finally, the SN derived from [2] and proposed in [1] is summarized in the following paragraphs and included in the comparison as well. 
A. SN Review
The SN described in [1] is a special case of sorting network that moves the W largest values out of M = 2W inputs to the first W output lines (2W/W SN). It relies on two W -element sorters and a 2W -element pruned merger, depicted as two solidline boxes and one dashed-line box, respectively, in Fig. 4(a) . In this brief, the sorters are implemented as even-odd Butcher sorting networks [1] and the pruned merger is made of W comparators to select the W largest values. As argued by Knuth [1] , this network can be extended to the case of M = n · W by using n − 1 instances of the 2W/W SN. Unfortunately, the W maximum values obtained with this solution are not sorted. Thus, for a fair comparison with the proposed BWA architecture, the SN is connected to a further W -element sorter. The general block scheme of the SN is shown in Fig. 4(b) .
B. K-Best MIMO Detectors
In the K-best MIMO detectors detailed in [18] , [23] , and [25] - [27] , we observe that, for 16-QAM and 64-QAM modulations (Q = 16 and Q = 64), at least 5-best and 10-best nodes (W = 5 and W = 10) are required, respectively. Moreover, according to [18] , [23] , [26] , and [27] , a typical value for the data width is 16 bits, i.e., N = 16. Therefore, for realvalue detectors, we have M = W · √ Q, namely, M = 20 and M = 80 for 16-QAM and 64-QAM, respectively. Both the architectures proposed in [18] and [23] deal with a 4 × 4 64-QAM K-best MIMO detector. In particular, Shabany and Gulak [18] rely on the bubble-sort approach, whereas in [23] , the tree-sort approach is applied.
C. Nonbinary LDPC Decoder Architectures
The nonbinary LDPC decoder architectures proposed in [20] and [21] deal with codes in GF(32) and GF(64), i.e., M = 32 and M = 64, respectively. Moreover, they operate on a reduced number of messages, at least 16, i.e., W = 16, and we fix the data width to 5 bits (N = 5) as in [21] .
The bubble-check sorter proposed in [20] relies on a simplified extended min-sum (EMS) algorithm for check node processing that reduces the EMS original complexity from the order of W 2 to the order of W √ W . In [28] , it is implemented in several sequential rounds. On the other hand, in this brief, following the original description in [20] , we implemented it as parallel architecture relying on a matrix structure. It is worth pointing out that, since the data in each row of the matrix described in [20] are supposed to be in order, our reimplementation of the bubble-check architecture has been equipped with a presorting circuit. In [21] , a reduced complexity sorter for the check node unit of a nonbinary LDPC decoder is proposed. However, such an architecture relies on d c rounds, where the M inputs are sliced and analyzed sequentially round by round. Since, in this brief, we deal with parallel sorting only, the architecture proposed in [21] is not considered in the comparison.
D. Turbo Product Code Architectures
In the Chase-Pyndiah algorithm [10] , a selection of the least reliable bits is necessary. As an example, in [22] , a parallel implementation of turbo product codes that require parallel partial sorting is addressed. Thus, in this section, results for M = 32, 64 and W = 3, 4 are presented. The data width is 5 bits, i.e., N = 5, as in [22] . Since the sorter architecture in [22] is optimized for the case of M = 32 and W = 3, it cannot be straightforwardly extended to other cases.
E. Comparisons
The BWA architecture and the reference architectures in [1] , [18] , [20] , and [22] - [24] are all described in very high speed integrated circuits-hardware description language (VHSIC-HDL), simulated with ModelSim, synthesized using Synopsys Design Compiler, and then placed and routed (P&R) using Cadence SoC Encounter on a 90-nm CMOS standard-cell technology, where a two-input nand gate occupies 2.8 μm 2 [29] . Owing to its scalability, the BWA architecture can be easily adapted to the whole range of M , W , and N values of the three considered applications. In Table I , area (A) and critical path delay (C) for each architecture are compared. As it can be observed, the proposed BWA architecture features the lowest complexity among the solutions compared in Table I .
The area of the BWA architecture is indeed less than half the area of the other solutions, and the BWA architecture has about half the complexity of the work in [22] . Moreover, the critical path delay of the BWA architecture is almost comparable with that of other implementations. Finally, if we compute the area-delay product (P = A · C), the proposed BWA architecture is comparable with the work in [22] and is better than most of the other compared solutions. Further experiments adding pipelining have shown proportional throughput increase, and area overhead for each pipeline stage of BWA architectures is always less than 35%.
IV. CONCLUSION
In this brief, a parallel radix-sort-based VLSI architecture for finding the first W (W > 2) maximum/minimum values is presented. The proposed solution, which relies on bitwise analysis of the input data, is based on a W -stage architecture, where each stage is made of simple logic circuits referred to as modified H circuit and output generation circuit, respectively. The obtained results show that BWA architectures feature lower area than other solutions proposed in the literature and competitive area-delay product.
