This paper presents a VLSI implementation of reduced -complexity and reconfigurable MIMO(Multiple-Input Multiple-Output) 
Introduction
Multiple-Input Multiple-Output (MIMO) system provides increased data throughput and link range by using multiple transmit and receive antennas without additional bandwidth and transmit power [1] . MIMO plays a key role in every new wireless standard, such as HSDPA, IEEE 802.11n wireless LAN, IEEE 802.16e ,WiMax and 3 rd Generation Partnership Project (3GPP) Long Term Evolution (LTE) [2] . The main challenge in exploiting the potentials of MIMO technology [3] is the increase in data rates, higher capacity requirements and high computing power at the receiver end.
MIMO systems can be employed to improve the transmission quality by diversity techniques and spatial multiplexing methods, but separation of multiplexed streams of data is the main implementation challenge in terms of computational complexity and power consumption. Therefore an efficient VLSI implementation of the detector is the key to enable low power, high performance and low cost equipment. The MaximumLikelihood (ML) detection algorithm using exhaustive search gives the optimal solution. To reduce the exponential complexity in ML detection, sphere decoders [4] are proposed to achieve near-ML performance with reasonable complexity. QR decomposition technique is used to reduce the complexity of ML by the process of tree search and pruning. Tree search algorithms can achieve near-ML performance but require significantly less complexity compared to the optimal ML method [5] .
Algorithms like depth-first search, breadth -first search, fixed complexity sphere decoder and best first search are available for pruning the tree. In [6] , Burg implemented the hard output depth-first search algorithm and Guo implemented the soft -output K-best decoder [7] . In [8] , Wong proposed a pipelined VLSI architecture for the K-best algorithm. The parallel merge algorithm was proposed in [9] to increase the throughput of the existing conventional K-best architectures. In literature, some researchers divide the tree search into several parts [10, 11] and implement the K-best algorithm but still the complexity of the search does not decrease further.
Although K-best decoder presents constant throughput, the computation of accumulated branch metrics in each layer of the tree and the selection of K metrics with lowest values becomes a challenge. Hence appropriate sorting strategies are required in order to accomplish the selection tasks in K-best decoder. In [12] , the architectures for parallel sorting algorithms and their functionality with K-best sphere decoder were analyzed. These parallel sorting architectures were based on sorting strategies like Bubble sort [13] and Batcher-Sort. In [14] , parallel insertion sorting algorithm was presented.
In this paper, we present the FPGA implementation of a sphere detector using modified K-best breadth first search algorithm. The algorithm supports 4-QAM with a combination of 2 antennas since the baseline in 3GPP-LTE is 2x2 antennas. The tree search methodology is modified to decrease the latency and hardware complexity. A parallel and distributed sorting strategy is employed in the intermediate layers to exploit the parallelism of the FPGA in order to achieve high data rates. In the last layer, Batchers merge based bitonic sorting is used to sort the accumulated metrics and choose the transmitted symbol set.
In Section 2 and 3 of this paper MIMO system model, Maximum Likelihood detection and K-best algorithm for sphere detection are discussed in brief. The real value decomposition technique and the tree formation are briefly presented in Section 4. Section 5 and 6 presents the modified K-best algorithm and its VLSI architecture. In Section 7 the FPGA implementation and results are discussed. Section 8 concludes the paper.
MIMO System Model
Consider a MIMO system model with N T transmit and N R receive antennas. The equivalent complex-valued discrete -time baseband model of the MIMO channel between the transmitter and receiver antennas is described by an N T x N R dimensional matrix H. The N R -dimensional received signal vector is given by The ML solution can be obtained through an exhaustive search, which offers excellent performance in a MIMO system. However, it is highly impractical to perform an exhaustive search for a large MIMO system.
Sphere Decoding Algorithm
Sphere Decoding (SD) algorithms are proposed to reduce the computational complexity by converting the exhaustive search into a tree search problem. An efficient pruning criterion is used to decrease the number of visited nodes. SD takes into account only the lattice points that are inside a sphere of a given radius r. The following inequality is referred to as the sphere constraint:
Where C is the squared radius of the sphere and y is the center of the sphere. The channel matrix H can be decomposed by QR decomposition, and then equation (3) can be rewritten as
R is an upper triangular matrix with positive diagonal elements and Q is a matrix with orthogonal columns.
The basic idea behind the tree-search algorithms lies in the transformation of the ML detection problem into a tree search problem. In tree search algorithm the distance between the received vector y and the candidate received vector symbols Hs can be decomposed into partial Euclidean distances d i which depend only on s. The symbol s increases strictly when proceeding from a parent node to one of its children. The algorithm finds the leaf node that is associated with the smallest d i which corresponds to the ML solution.
K-Best Algorithm
Based on the search strategy, the sphere detector algorithm can be divided into depthfirst and breadth-first detection algorithms. The K-best algorithm [7] proceeds with the breadth-first search technique which does not require a sphere constraint. Tree pruning is performed by constraining the cardinality of the set of admissible nodes on each level of the tree to a parameter K.The breadth-first algorithm searches for the PEDs in the forward direction only and the best K candidate symbol sets based on PEDs are retained at each level in the tree. The candidates are selected from the symbol set by giving precedence to those children which yield the smallest associated PEDs.
The main advantage of the K-best algorithm over the depth-first algorithm is its fixed complexity, which is determined by the parameter K. The choice of parameter K in the algorithm also entails a tradeoff between complexity and BER performance. If K is chosen to be very large, complexity and memory usage requirements are high. But if K is small, there is a chance of accidentally excluding the ML solution from the list of candidate vector symbol set. The value of K should be kept as large as possible without compromising on the optimality, compared with ML detection algorithm. Limiting the value of K can reduce the complexity of the breadth-first search detection algorithm.
Real -valued Decomposition
The received signal vector in equation (1) will be in complex domain. The sphere decoding algorithm discussed in the previous section can be applied only when the real and imaginary components of y, H and s are decoupled, to form a real system with equations which will have twice the dimension of the complex system [15] . Therefore, the received N T dimensional complex-valued system model can be decomposed into an equivalent 2N T dimensional real-valued system model according to the following equation.
The real-valued decomposition technique transforms the original search tree into a tree with twice the depth. From the Table 1 it can be inferred that when operating directly on complex constellation points, || K  PEDs must be calculated in each step whereas applying RVD technique reduces the number to only || K  PEDs. Therefore, the overall silicon complexity and power consumption of the individual processing element is much lower with RVD. Figure 1 shows the decision tree structure for 4-QAM, 2x2 MIMO System. After real valued decomposition the tree contains four layers. The symbol estimation in equation (2) can be done by sequentially following the decision tree structure. At each node in the tree a decision has to be made whether to send +1 or a -1 , such that the uppermost path through the tree corresponds to a sent sequence(+1,+1,+1,+1).
Modified K-best Algorithm
The breadth-first search algorithm can be modified to decrease the latency, by calculating two PEDs in parallel and discarding the larger one. In our design, the processing element computes PEDs of all the children of a single parent node in one cycle.
We employ parallel and distributed sorting strategy in our algorithm [16, 17] . The steps involved are (1) . Distribute the parent nodes into two different sets.
(2). Parallel comparison for these set of nodes are done using comparators and the node with smallest PED is assigned the best node.
(3). The path extension is done based on best children nodes. In the last layer, the PEDs obtained are sorted using bitonic sorting technique and the best node with minimum PED and its symbol set are given as output. This represents the hard decision output of the decoder. The parallel implementation allows the algorithm to give a fixed throughput. 
VLSI Architecture
The block diagram for the modified K-best sphere decoder is shown in Figure 2 . The blocks were divided into separate units and therefore processing can be pipelined.
In the preprocessing unit, the inputs y and H are buffered, QR decomposition was performed and received vector y is multiplied with s s s are compared in parallel and four smallest PEDs and its symbol sets are given as output to the bitonic sorter. The output PEDs are sorted in the bitonic sorting unit and the smallest PED and its symbol set are taken as the optimal estimate of the received symbol.
Bitonic Sorter
It is an efficient merge-based algorithm for parallel sorting introduced by Batcher [18] .For many years, bitonic sorting was considered to be the most practical parallel sorting algorithm. Figure 3 shows the bitonic sorter unit, where 4 comparators are used in total and 2 comparators are serially connected in the critical path. Figure 4 shows the blocks present in each compare and sort unit. Table 2 shows the complexity of different sorters that are used in literature [19] for sorting of PEDs in the sphere decoder. It can be observed that as n increases the complexity of bitonic sorter shows minimal increase compared to bubble sorter and insertion sorter which increases proportionately. Therefore we apply bitonic sorting in order to reduce the overall delay and silicon complexity. 
Implementation and Results
The modified K-best algorithm with the sorter unit has been implemented using Xilinx Plan Ahead Design tool [20] . In Plan ahead software, the implementation and timing results can be viewed to analyze critical logic and to improve the design performance with floor planning and constraint modification. The Xilinx Plan Ahead Design tool is used to implement and verify the proposed modified K-best algorithm and its VLSI architecture on the Xilinx Spartan 6 FPGA with word length of 8-bits with N T = 2.
We had implemented a 2x2 MIMO system with 4-QAM modulation by taking Rayleigh fading channel environment. At the receiver end, the channel information is assumed to be known perfectly. The estimated complex channel matrix H is converted into real valued matrix through RVD, in order to reduce the complexity and silicon area of the algorithm. The resource utilization results were presented for the architecture in Table 3 . The PE1 and CS unit takes 19 clock cycles for PED calculation and sorting for layer 4 and layer 3. In layer 2, the PE2 and CS unit takes 25 clock cycles. In layer 1 the PE3 unit takes 27 clock cycles and bitonic sorter unit takes 2 clock cycles for merge and sorting as indicated in Table 4 . be 183.065MHz and for bitonic sorter unit it is 243.072MHz The power estimation of the design mapped on a FPGA device can be found using Xpower Analyzer Tool. The total on-chip power of the design is 213mW for PED unit and 108mW for bitonic sorter unit. The maximum throughput of the K-best detection with reference to [22] is calculated by the Eq. 
Where f c is the maximum clock frequency, M is the constellation size, N T is the antenna number and C is the number of clock cycles needed for calculating the PEDs in the last layer. For our K-best detector design, the parameter C = 27. Therefore, the maximum throughput achieved for the design would be 27Mbps. In Table 5 , our proposed work is compared with previous K-best Sphere Decoder implementation, it shows clearly that our design improves the throughput of the detector with a slight increase in the resource utilization.
Conclusion
In this paper, we presented a reconfigurable VLSI architecture for the proposed modified K-best algorithm targeting 3GPP-LTE standard. The 2x2, 4-QAM sphere decoder was implemented, which forms the baseline of LTE standard. The algorithm includes bitonic sorter in the last layer apart from parallel and distributed sorting in the previous layers. The hardware complexity reduces to a greater extent because less number of comparators is required in the critical path. In practice, the sphere detector will be attached with channel decoders to provide robustness against fading channels and noisy wireless environment. Our future work will be to implement the algorithm for higher constellation with different antenna configurations with further increase in throughput for emerging wireless communication standards.
