[5] V. G. Oklobdzija, D. Villeger, and S. S. Liu, "A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach," IEEE Trans. Comput., vol. 45, no. 3, pp. 294-306, Mar. 1996. [6] 
I. INTRODUCTION
Multiple-input-multiple-output (MIMO) wireless communication has shown great promise for future communication systems as they achieve very high spectral efficiency [1] . However, practical realizations of MIMO wireless communication systems have been limited by their difficulty of implementation. The major bottleneck is the computational complexity of the maximum likelihood (ML) detection problem, especially for arrays with a large number of transmit and receive elements. This reality motivated researchers to consider other suboptimal approaches for MIMO decoding, such as zero forcing (ZF), mmse, vertical Bell laboratories layered space-time architecture (VBLAST) [4] , etc., all of which vary in performance and complexity. where s i denotes the tree node chosen at level i, l ii and l ij are functions of the specific channel realization experienced by a transmitted vector. Most often, real-valued decomposition of the channel is used, such that each complex constellation point can be represented as two real constellation points, and the corresponding metric is distributed over two tree levels [2] , [3] .
The complete extension and selection process consists of several operations: the metric computation for newly extended paths (path extension), the comparison with previously extended paths (path comparison), and the removal of a path exceeding any predefined bound (path purge). The speed and power bottlenecks of the K-best algorithm arises mainly from the parallel execution of all said operations at each level.
A high-throughput MIMO detector for 16 quadrature-amplitude modulation (QAM) has been reported in [6] , and a detector plus decoder for 64 phase-shift keying (PSK) system has been reported in [7] . Both employ the K-best breadth-first algorithm to achieve a near constant throughput. However, the throughput in [6] degrades heavily with increasing K due to the increase in the number of parallel operations required to be executed simultaneously. The scheme reported in [7] attempts to reduce the number of simultaneous parallel operations by introducing feedback from the selection unit to the path extension unit. It used the Schnorr-Euchner (SE) strategy, reported in [9] , to achieve this; however, it suffers from high power consumption and large area. Alternatively, VLSI implementation of a sphere decoder for a 4 2 4, 16-QAM system, reported in [8] , is relatively power and area efficient, but suffers from nonuniform throughput. In this paper, we present the implementation of a compact, low-power K-best 4 2 4, 64-QAM system, which provides the benefits associated with a K-best approach, such as constant throughput and ease of pipelining, while maintaining low power and area. The main contributions of the paper are as follows.
1) A sort-free architecture is proposed that significantly reduces the computational complexity involved in finding and sorting the K nodes at each layer of the tree. The paper discusses the tradeoffs involved with this approach in terms of both the power delay product (PDP) and the bit error rate (BER) performance.
2) Traditionally, partial metrics are computed at each node of the tree and recomputed in full for each new received vector. In the proposed structure, a quantized look-up table (LUT) is constructed once per channel realization, and reused to calculate the partial results for each new received vector as a set of shifts and additions rather than multiplications, which results in lower area and power consumption. The paper discusses the tradeoffs associated with this approach in terms of power consumption, BER performance, and area.
3) A compact hardware architecture based on resource sharing is proposed and implemented, targeting both an field-programmable gate-array (FPGA) platform and an application-specific integrated circuit (ASIC) in a 65-nm technology. Functional verification results run on an FPGA platform are presented and compared to the simulation results to confirm performance. ASIC power and area results are compared to the state-of-the-art implementations. The remainder of the paper is organized as follows. Section II presents a discussion and full analysis of a modification to the SE strategy, reported in [9] , which results in a sort-free approach to the K-best algorithm. This will be referred to as the winner path extension (WPE) method. Section III presents the VLSI architecture of a detector based on the WPE method, while Section IV presents the FPGA functional verification, and VLSI implementation results and statistics. The paper is concluded in Section V. 
II. WPE: A SORT-FREE APPROACH
The WPE technique is illustrated in Fig. 1 . Instead of extending all the children of a node in parallel, only the minimum metric child of each node is extended. The minimum among these is selected as the winner and the first of the K-best extended paths; the parent which produced the winner is allowed to extend to its next best child, and the process is repeated till all K paths have been extended. This requires only 2K 0 1 paths to be extended for selection of K paths, and eliminates the need for a sorter. This approach has been first reported by the authors in [10] and [11] , and also independently in [12] and [13] . In this paper, we study the complexity of the WPE approach versus traditional extension and sorting. We propose a novel reduced cost (in terms of power and area) WPE, which is used as the core of a K-best detector, and quantify improvement in cost and performance. Finally, it is important to note that the method presented in [12] and [13] requires exact sorting among the set of first children, which is not a requirement in our proposed approach.
A. WPE: Complexity
Traditionally, parallel path extension is employed to achieve a high throughput at the cost of both area and power [7] . The proposed WPE approach solves the problem of sorting, as discussed previously; however, as shown in Fig. 1 , it is a highly serial algorithm. To establish a fair comparison, it is important to study the algorithms in terms of their PDP. To facilitate generating the PDP, we define a complexity factor (C); which indicates the relative complexity of a metric operation (i.e., computing jl ii j 2 (s i 0 c i ) 2 ) when referred to an adder or a comparison operation. Note that addition, subtraction, comparison, and purging are assumed to have a normalized complexity of 1. Clearly, C depends on the bit width used and the architecture of both the metric operation and the adder. A reasonable value to assume is C = 8 or 16. This can be carried out by a shift and add multiplier. The latency is equal to that of eight cascaded adders. Assuming that adder bypass logic is used, if a multiplicand bit is zero, only half of the adders are active at a time (i.e., assuming that all multiplicand bits have equal probability of being zero and one). This results in eight times more power consumption than an adder and eight adder delays. Similar analysis can be carried out for different adder architectures or multiplicand length. Hence, as discussed, C can take expected values between 8 and 16. Finally, the cost of computing the minimum among a set of K values is logarithmic in K . With this understanding, the PDPs for both the parallel and WPE approaches can be computed for a given K and Q, where the constellation size is 2 Q . These are computed for a 64-QAM constellation (Q = 6) for the two algorithms, and are presented in Fig. 2 .
As shown in the figure, the proposed extension technique is better in terms of PDP. The power delay advantage of the WPE technique for C = 8 is almost 50%; however, it reduces to around 30% for C = 16. Thus, as the cost of one path extension increases, the advantage tends to reduce, which, in turn, implies that maximizing the PDP gains of the WPE approach is contingent on minimizing the cost of a path extension.
B. Quantized Path Metric Computation: Power and Latency Reduction
The path metric computations required in each path extension step are expensive, both in terms of power and latency. To reduce the overhead associated with computing the path metric jl ii j 2 (s i 0 c i ) 2 , it is important to note that a part of the computation depends on the channel (jliij 2 ), while the other part depends on the received vector. In the proposed architecture, we use this observation, in addition to the structure of the QAM constellation, to construct LUTs that are updated only once per channel realization. These LUTs are then accessed on a per-vector basis, where quantization is used to ensure that the ensuing operations are pure shifts and adds rather than high-precision multiplications to minimize power consumption. derived by simple shifts and adds. This approach leads to significant power savings and improvement in speed, especially when the path metrics are represented using a large number of bits. As the critical path consists of only one multiplexer and two adders, the latency is much reduced compared to that of cascaded multipliers. From a complexity point of view, the time cost of one path extension using quantized metrics is roughly equal to two times that of a comparison (or add) operation (as opposed to 8 or 16), i.e., C as defined in Section II, is 2. This results in an improved PDP, as shown in Fig. 4 . It is important to note that the quantized approach can be applied to both the conventional (parallel extension) and the proposed (WPE) as shown in the Figure. 
C. Quantized Metric: Detection Performance
An MIMO detector system for a 4 2 4, 64-QAM constellation was simulated using quantized path metrics in order to check the performance degradation due to inexact metrics. A flat fading channel is assumed with the MIMO channel SNR as defined in [1] . From simulations, it was observed that the performance degradation for different levels of quantization is seen to be negligible (around 0.7 dB at 25 dB) for K = 8. However, for K = 64, it increases to around 2 dB, which is unacceptable. To avoid the loss in performance, an explicit path metric computer (EPM) block was introduced at the last stage, which computed the exact path metrics of the K -best leaf nodes, using multipliers. As this metric was computed only at the last stage, the power requirements were minimal. However, the introduction of this stage improved the performance of the system considerably, as detailed in Fig. 5 , which shows the symbol error rates (SERs) versus SNR for different values 
III. VLSI ARCHITECTURE
The detector cell architecture of the system is shown in Fig. 6 . The two basic tasks of computing the center and computing the path metric are carried out by the center calculator (CC) and the path metric computer (PM) block, respectively. Each detector cell has its local memory blocks, M1 and M2. At the beginning of the cycle, M1 contains the K best paths extended till the (i 0 1)th level. M1 also contains the K centers corresponding to the K paths, computed for the ith level. The extension cycle starts with PM extending all the centers to their nearest symbols and computing the corresponding PMs. Henceforth, at every clock cycle, a new path is extended to the ith level and written to M2. At the end of the extension cycle, M2 contains the K best paths extended till the ith level. Before the next cycle starts, M1 and M2 swap their roles, thus eliminating the need of any data transfer from one stage to the other stage. 
A. Dynamic Load CC
Typically, multiple detector cells are employed on a chip to achieve the required throughput. Each detector cell processes one received symbol; however, each detector cell requires different multiplicative resources based on the tree level it is processing.
This can be easily understood by looking at the expression for the center given byŝ i 0 i01 j=1 l ij =l ii (s j 0ŝ j ). The summation within the modulus operation varies in length with varying levels. Due to the constant throughput requirements, it is necessary to allocate more resources to the CC for deeper levels of the tree (i.e., for larger values of i), compared to starting levels (smaller values of i). A group of multipliers and a simple Bidierectional MUltipleXer (BiMUX) blocks are used to create a configurable CC, which caters to the different detector cells as required. Fig. 6 shows how the three different detector cells process three different levels of tree, and how the BI M U X blocks simply program the interconnects. The proposed scheme is highly scalable; for processing higher depths, one can simply add two or more of the outputs (CC1out, CC2out, and CC3out) to compute centers for large tree levels. Finally, the WPE technique also requires the selection of the minimum metric path from a set of K paths after every extension. This is achieved by the MinFinder (MF) block, which is implemented using a logarithmic arrangement of K comparators. The MF is pipelined with registers after every comparator.
IV. VLSI IMPLEMENTATION

A. FPGA Implementation and Verification
An FPGA implementation of the system utilizing six parallel detectors was carried out using a Xilinx XC2VP30 device running at 62.5 MHz. The experimental results are shown against the simulation results in Fig. 7 . As expected, a small performance loss due to fixed point effects is observed. All channel entries and path metrics were represented using 7 bits for integral part and 7 bits for fractional part. The FPGA implementation metrics are presented in Table I . The FPGA implementation requires 0.0225 MB of RAM. 
B. ASIC Implementation
A chip that implements the aforementioned functionality was synthesized using Taiwan Semiconductor Manufacturing Company (TSMC) standard CMOS cell libraries having eight metal layers (65-nm technology). Synopsys design compiler area estimates and power estimates are reported in Table II . The numbers correspond to the typical case. The frequency of operation was set to 158 MHz at a supply voltage of 1 V. The area is reported in kilo gate equivalents (kGEs) to normalize the difference in technology, where a single two-input NAND gate with drive strength of one was used for comparison. To better compare the different systems, a figure of merit (FOM) is defined as follows: FOM = KThPut=Area, where K is the number of nodes used in the K-best approach, ThPut is the throughput in megabits per seconds, and Area is the area expressed in kGE. This FOM can be thought of as a normalization of the throughput in terms of the area invested per node (K point) investigated. Clearly, as the number of K points increase, the area increases and the performance (BER) improves, typically at a cost of reduction in throughput. From the table, the proposed work has the highest FOM for K = 64. For K = 10 and K = 5, the work in [5] and [6] exhibits a better FOM. This is attributed to the fact that the synthesized ASIC was designed for a 64-QAM system with K = 64. Clearly, smaller area, and thus, a higher FOM can be achieved by targeting the ASIC for a smaller value of K. Furthermore, note that the presented work achieves 100
Mb/s of throughput at a much lower power than that reported for other architectures.
V. CONCLUSION
A novel, high-throughput, VLSI architecture for the K-best MIMO detector system has been presented and experimentally verified. The use of a sort-free K-best engine in conjunction with a quantized path metric unit yields a highly scalable and power efficient architecture as compared to state-of-the-art approaches.
