Abstract-Maximum likelihood (ML) detector is the optimal detector for the multiple-input multiple-output (MIMO) communication systems. Sphere decoding algorithm can achieve near optimal ML performance with reduced complexity. In this paper a new VLSI architecture for implementation of sphere decoding algorithm is proposed. The proposed architecture is fully parallel and designed based on the stack operation. The proposed architecture is implemented in 0.18 µm technology for a 4x4 QPSK MIMO system and was able to achieve a decoding throughput of 60 Mbps.
I. INTRODUCTION
Next generation wireless local area networks (WLANs) such as IEEE 802.11.n standard rely on multiple antennas at both transmitter and receiver to increase the link throughput without an increase of the used frequency spectrum. These improvements come at a significant increase in signal processing and hardware complexity compared to existing single antenna systems.
Maximum likelihood detector (ML) is the optimum receiver for multiple-input multiple-output (MIMO) channels. Lattice decoding algorithms, such as sphere decoders, can achieve near optimal ML performance with reduced complexity but their efficient implementation in hardware is a challenging task.
In this paper, we present a low complexity high throughput VLSI architecture for the implementation of sphere decoding algorithm. The proposed architecture has the following characteristics:
• It has a highly parallel structure and is designed based on the concept of stack algorithm.
• All computations are complex value based which makes the depth-first tree search more efficient than real value based search.
• The proposed VLSI architecture is highly expandable and configurable which makes it a suitable choice for future WLAN technologies such as 802.11.n standard.
The architecture was designed and synthesized for a 4-transmitter and 4-receiver antenna (4 x 4) QPSK based MIMO system using 0.18µm CMOS technology. Simulation results show that a throughput of 60 Mbps can be achieved using the proposed architecture. The total resource usage for implementing the architecture is about 18K equivalent NAND gates.
A. MIMO Channel Model
In a MIMO system using N t transmitter and N r receiver antennas (N t ≥ N r ), each of the N r receivers will receive components from each of the N t transmitters, including both line-of-sight and reflected components. For a narrow band transmission, e.g. single channels in an OFDM system, the receiver has to solve a linear combination of all these multipath transmissions. The system model for MIMO system in a flat-fading channel is given by the following equation: The transmit vector s corresponds to a binary vector b, containing N t Q bits, where Q is the number of bits per symbol in the complex constellation Λ containing P c = 2 Q constellation points. The task of MIMO detector is to recover s from y by solving this equation. The corresponding uncoded transmission rate is R = Nt Q bits per channel use (bpcu). For the purpose of numerical simulations, the entries of H are modeled as i.i.d Rayleigh fading. Wideband communication systems can be reduced to a set of narrowband MIMO system by using OFDM modulation and the above narrowband model can be used to design receivers for wideband systems.
B. Maximum Likelihood Detection
The task of a maximum likelihood (ML) detector in a hard-output detector is to solve the following equation
where Λ is set of all possible transmitted symbol vectors (constellation set). The channel matrix is assumed to be perfectly known to the receiver and we assume N t = N r in this paper. By triangularizing the channel matrix H using the QR decomposition we can further simplify ( 1) Starting at level i=N t , ( 6) can be solved recursively as follows:
T s is known as Partial Euclidean Distance (PED) and we assume 0 ) (
Maximum likelihood detectors are the optimal detectors and have better BER performance comparing to non-optimal detectors such as linear and Successive interference cancellation (SIC) detectors but they are also the most complex detectors. An exhaustive type detector solves ( 6) by trying all the possible combinations of transmitted symbols in ( 6) . The complexity of exhaustive detector grows exponentially as the number of antennas or the number of bits in constellation increase. As an example, the number of possible transmitted symbol for a 4x4 QPSK is 2 8 =256 and for a 4x4 16-QAM system this number grows to 2 16 =65536.
C. Sphere Decoding Algorithm
The Sphere Decoding algorithm has its origins in the work published by Pohst as a method for computation of lattice vectors with minimal length [2] . Its application as a decoder for multiple antenna channels appears in [3] and a complex version of it appears in [4] . This algorithm can achieve near optimal ML performance with reduced complexity. The fundamental idea in SD is to reduce the number of candidate symbols list that needed to be considered in the search for ML solution in ( 6) 
By introducing ( 9) which is called sphere constraint (SC), we have changed the original problem to the new problem of solving ( 9), or finding candidate vector symbols that meet sphere constraint equation. This approach can significantly reduce the average search complexity, while the BER performance is close to exhaustive search detector (ML performance). Throughput of this type of detector is variable and depends on the initial choice of sphere radius r and channel SNR level.
D. Radius Reduction
Radius reduction is the process of shrinking sphere radius in depth-first SD algorithm. The basic idea is that once a valid solution has been identified, we can update the value of the sphere radius to the value of the PED associated with this solution and continue the tree search with the updated radius value. Without radius reduction, the order in which the children of a node are explored in a depth-first search has no influence on the overall complexity of the search and the only parameter that determines the complexity of the search is the initial sphere radius.
E. Depth-First Tree search
We can use tree search methods to solve ( 6) . Depth-first Tree Search (DFTS) is an algorithm for traversing or a tree. It starts at the root of the tree (i= N t + 1) and explores as far as possible along each branch before backtracking. Formally, DFTS is an uninformed search that progresses by expanding the first child node of the search tree that appears and thus going deeper and deeper until a goal node is found, or until it hits a node that has no children. Then the search backtracks, returning to the most recent node it had not finished exploring. Fig. 1 shows depth-first tree search in a 2x2 QPSK system. Figure 1 Depth-first tree search in 2x2 QPSK system II. HARDWARE IMPLEMENTATION Architecture of the proposed MIMO detector is shown in Fig. 2 . The proposed VLSI architecture has been designed based on the concept of stack algorithm in software engineering. Stack algorithm has been being used in software to traverse and search tree structures. In forward traverse mode (push) at each level of the tree the values of PEDs are compared with sphere radius. The search will be continued with symbols that pass the test until the search reaches the last level of tree. After each SC test, the minimum value of PEDs that passed the test is found and will be sent directly to the pipeline and the rest of PEDs and their associated symbols will be stored in stack memory. This will save one clock cycle and will improve the overall throughput of the detector. The write operation to stack (push) is simultaneous. Stack should be able to accept up to P c -1 data simultaneously.
If no PEDs passed the SC test or the algorithm reached the first level of the tree a pop operation happens and a node from previous level of the tree will be used to continue the forward traverse of the tree. Stack Architecture is easily expandable. This is specifically useful when the number of antennas changes. Each submodule in stack memory is associated with one level of tree, if the number of antennas changes; we can simply turn-off the extra submodules in stack memory.
Stack architecture is parallel in nature; the key to parallel implementation of this architecture is design of a multi-port stack memory that can support simultaneous write operation on all its ports.
In Fig. 2 , PED_CU block computes the new PED values after each iteration. The role of stack read-write controller is to control write and read access to the stack memory. It sorts PED values before pushing them into stack. This will guarantee that during pop operation from stack always the search continues with the node with smallest PED value.
The proposed architecture for PED computation block (PED_CU) is shown in Fig. 3 . ECU block computes the increments in PEDs (e i (s i )) for each of the constellation symbols. The architecture of ECU block is easily expandable to support different modulation scheme. The role of the preprocessing block is to perform QR decomposition on channel matrix H and compute R and ŷ based on ( 4) and ( 5). 
III. RESULTS
The proposed VLSI architecture has been implemented using Verilog HDL language and later synthesized with 0.18 µm CMOS technology. Timing has been optimized for 100 MHz clock. Fig. 4 shows variation of average throughput as SNR changes. As it can be seen, average throughput increases as SNR increases. Fig 5 depicts the BER performance of this architecture. Table I shows key performance figures of the proposed architecture and compares them with some of the previously reported architectures. In order to be able to compare core areas implemented in different CMOS technologies, all the reported core areas are converted to gate equivalents (GE is defined as total area divided by the area of a two-input drive-1 NAND gate). The proposed architecture improves the core area and average throughput in comparison to the exhaustive search detector. This is mainly due to the lower computational complexity of the sphere decoding algorithm and highly optimized structure of the proposed architecture.
IV. CONCLUSION
In this paper a new VLSI architecture for sphere decoding algorithm has been proposed. Due to the parallel structure of architecture and use of stack memory for data storage, it is easily reconfigurable for different modulation scheme and antenna configuration. Simulation results show that proposed MIMO decoder achieves an average throughput of 60 Mbps @ 20 dB SNR for a QPSK modulation. Synthesis results using 0.18 µm CMOS standard cell technology shows that the overall complexity of the proposed architecture is 18K GE. 
