Abstract: The K-best detector is considered as a promising technique in the MIMO-OFDM detection because of its good performance and low complexity. In this paper, a new K-best VLSI architecture is presented. In the proposed architecture, the metric computation units (MCUs) expand each surviving path only to its partial branches, based on the novel expansion scheme, which can predetermine the branches' ascending order by their local distances. Then a distributed sorter sorts out the new K surviving paths from the expanded branches in pipelines. Compared to the conventional K-best scheme, the proposed architecture can approximately reduce fundamental operations by 50% and 75% for the 16-QAM and the 64-QAM cases, respectively, and, consequently, lower the demand on the hardware resource significantly. Simulation results prove that the proposed architecture can achieve a performance very similar to conventional K-best detectors. Hence, it is an efficient solution to the K-best detector's VLSI implementation for high-throughput MIMO-OFDM systems.
Introduction
Nowadays, the need for high-throughput wireless broadband communication systems is growing. MIMO-OFDM combines multiple-input multiple-output (MIMO) and orthogonal frequency division multiplexing (OFDM) technologies. As a promising approach to realize high datarate wireless communications with larger spectral efficiency, MIMO-OFDM has been adopted in many standards, such as IEEE802.11n [1] . But a challenge for the MIMO-OFDM implementation is to design a low-complexity detector suitable for the efficient very large scale integration (VLSI) realization. Maximum likelihood detection (MLD) is optimal in theory, but it is not feasible in practical systems, because the computational complexity increases exponentially with the transmission rate [2] . K-best algorithms can achieve near-ML performance but greatly reduces the complexity compared to the MLD, so they are especially attractive for VLSI implementation and have been extensively studied recently [3−6] .
In this paper, we propose a new K-best VLSI architecture for high-throughput MIMO-OFDM systems, which use a large number of antennas with a high-order quadrature amplitude modulation (QAM) scheme. Using novel expansion and sorting methods, the proposed architecture can efficiently sort out the surviving paths with a lower demand on hardware resources.
System model
We consider a MIMO-OFDM system with N tx transmission and N rx reception antennas. The information bits are mapped into M-QAM symbols, where M is the constellation order. A vertical Bell laboratories layered space-time (V-BLAST) scheme is applied to transmit independent data streams simultaneously. On each subcarrier, the complex baseband equivalent model can be expressed as
where S is the N tx -dimensional transmit data symbol vector, H is the N rx N tx channel matrix, and V is the N rx -dimensional complex AWGN vector with zero mean and a variance of σ 2 = N 0 /2 (N 0 /2 is the background noise power spectral density) [4] . It is assumed that the channel matrix H is known by the receiver and remains constant throughout the data symbol duration.
K-best detection scheme

Conventional K-best detection algorithm
The optimal MLD algorithm can be expressed as [3] s = arg min
MLD exhaustively searches through all the candidate symbols to find the vectorŝ, which can minimize the Euclidian distance between the received symbol vector y and the symbol vectors Hs. K-best is a breadth-first approach with a performance close to the MLD, and only considers K paths with the shortest partial Euclidian distances (PEDs) when proceeding to the next layer. These paths are called surviving paths. The value of K is a trade-off between detection performance and the computational complexity.
It is assumed that N rx N tx , which is a necessary condition for the QR decomposition to yield an upper triangular matrix, and hence for the implementation of the K-best detection. [4] s = arg min
The above problem can be considered as a tree search problem with N tx layers, starting from the N tx layer due to the upper triangular structure of the matrix R. The symbol in each layer is detected based on the previously detected symbols. In the above process, the K-best algorithm searches through all the branches of each surviving path. Therefore, there are KM total candidates to be visited and sorted in every layer. This will incur heavy computational complexity for system using a large number of antennas together with a high order constellation.
Proposed K-best detection scheme
In the M-QAM case, the modulated symbols are complex-valued. Real valued decomposition (RVD) can be used to facilitate K-best detection [5] . By this means, Equation (1) can be expanded as
, which denote the real and imaginary parts of the variables, respectively. Using SVD, the N tx -dimensional complex signal can be decomposed into a 2N tx -dimensional real-valued signal. Thus, in the i-th layer of the tree, the accumulated PEDs of the surviving paths can be evaluated as [4] T
where
called the local distance, which denotes the distance increment between two successive nodes in the tree. According to Eq. (6), for s, any candidate of s i is:
Let e = e i (s),
, for each surviving path, Equation (7) can then be denoted as 
where u and R ii are constants. So, Equation (8) is a linear equation with a slope coefficient equal to -R ii . Let e = 0, then the root of Eq. (8) isŝ
So, we can predetermine the ascending order of s i by comparing |u| with the products of |R ii | and several certain numbers. In the 16-QAM case, we can compare |u| with |2R ii | and |R ii |, respectively, to determine the ascending order of s i . As illustrated in Fig. 1 
ii |, |R ii |}, respectively, to predetermine the ascending order of s i , which is listed in Table 1 .
Based on the ascending order of s i , we can expand each surviving path to its partial branches and update the accumulated PEDs. In order to carry out a limited tree search, we expand the surviving paths on demand. For example, in the case of 16-QAM (K= 4), we divide the K surviving paths into K/2 = 2 groups (G1, G2), where the 1st and 2nd surviving paths are in G1, while the 3rd and 4th are in G2. Then we expand each path in G1 into three branches, while we expand each path in G2 into two branches. The accumulated PEDs of the 075004-2 expanded branches are updated according to Eq. (5). Based on the above expansion results, we can sort out the new K surviving paths of the i-th layer in a distributed manner. We consider the above example of 16-QAM (K = 4). Firstly, we sort out three candidates from G1 (6 to 3), and two candidates from G2 (4 to 2) by the intercomparison of the branches in each group. Then we can sort out the new K = 4 surviving paths from five good candidates (5 to 4).
By the above novel expansion and sorting methods, the proposed K-best scheme can greatly reduce the computational complexity of the tree search. For a 2 × 2 16-QAM case, the proposed scheme only needs to be expanded to 28 branches to sort out the final surviving path; this is nearly half the number needed in the conventional K-best algorithm, which needs to be expanded to a total of 52 branches. During the sorting process, the proposed scheme performs 29 comparison operations at most, while the conventional K-best algorithm needs to perform 129 comparison operations.
Proposed K-best VLSI architecture
Based on our proposed K-best scheme, we present a new VLSI K-best architecture for 4 × 4 MIMO-OFDM systems modulated by 64-QAM. As shown in Fig. 2 , the architecture consists of a K = 8 metric computation unit (MCUs) and a K-best sorter. Each MCU expands a surviving path to its partial branches and updates the accumulated PEDs. Based on the MCUs' outputs, the distributed sorter sorts out new K-best surviving paths. In each stage, the surviving paths and their PEDs are stored in two groups of register lists. The key modules of the proposed architecture are described as follows.
Metric computation unit
In our proposed architecture, the K = 8 surviving paths are simultaneously processed by eight parallel MCUs, which are divided into K/2 = 4 groups (G1-G4). Each group includes two MCUs. Each path in G1-G4 is expanded to 6, 4, 3, and 2 branches, respectively, in order to reduce the complexity. The MCU's structure in G1 is shown in Fig. 3 , which expands a surviving path to six branches. In Fig. 3 , a group of multipliers M1 and an adder are used to compute u. Another six fixed multipliers M2 and six comparers are used to predetermine the ascending order of the candidates of s i . Another group of multipliers M3, which are followed by six subtracters and six square absolute value calculators, are used to compute the local distances of the expanded branches. Finally, six adders are used to update the accumulated PEDs. The MCUs export the expanded branches and their corresponding accumulated PEDs to the sorter.
Sorter
The proposed distributed sorter is illustrated in Fig. 4 , which can be divided into three stages. Each stage includes a group of register lists and some comparers. The sorter works in pipelines. Stage I sorts out 6, 4, 3, 2 candidates from G1 to G4 (12 to 6, 8 to 4, 6 to 3, 4 to 2). Based on the outputs of stage I, stage II sorts out eight (10 to 8) and three (5 to 3) shorter paths. Finally, stage III sorts out the new K = 8 surviving paths from all the eleven candidates yielded by stage II (11 to 8). For each comparison operation, the candidate with shorter accumulated PEDs is chosen as the output and replaced by its nearest sibling, while the longer one stays for the next comparison.
Complexity analysis and simulation
In the K-best scheme, the addition, the multiplication, and the comparison play critical roles in the computational complexity. Table 2 shows the comparison of the fundamental operations between the conventional K-best scheme and the proposed architecture. According to Table 2 , the proposed architecture can approximately reduce the fundamental opera-075004-3 tions by 50% and 75% for the 16-QAM and the 64-QAM, respectively, consequently significantly saving hardware resources. So, the proposed architecture is especially suitable for high-throughput MIMO-OFDM systems. The performance of the proposed K-best VLSI architecture is simulated using a computer and the Rayleigh fading channel model. The simulation results for both 16-QAM (K= 4) and 64-QAM (K= 8) schemes are presented in Fig. 5 , from which we can see that the bit error rate (BER) performance of the proposed architecture is very close to that of the conventional K-best scheme.
Conclusions
In this paper, a novel K-best VLSI architecture for the high-throughput MIMO-OFDM detector is presented. The proposed MCU predetermines the ascending order of each surviving path's branches by their local distances and expands the surviving paths only to partial branches. Then the distributed sorter sorts out the new surviving paths in pipelines. Complexity analysis and simulation results prove that the proposed architecture can significantly reduce the computational complexity while maintaining the similar performance of the conventional K-best scheme, yielding an efficient solution to the VLSI implementation of high-throughput MIMO-OFDM systems.
