Maximum-likelihood (ML) decoding is a very computationalintensive task for multiple-input multiple-output (MIMO) wireless channel detection. This paper presents a new graph based algorithm to achieve near ML performance for soft MIMO detection. Instead of using the traditional tree search based structure, we represent the search space of the MIMO signals with a directed graph and a greedy algorithm is applied to compute the a posteriori probability (APP) for each transmitted bit. The proposed detector has two advantages: 1) it keeps a fixed throughput and has a regular and parallel datapath structure which makes it amenable to high speed VLSI implementation, and 2) it attempts to maximize the a posteriori probability by making the locally optimum choice at each stage with the hope of finding the global minimum Euclidean distance for every transmitted bit x k ∈ {−1, +1}. Compared to the soft K-best detector, the proposed solution significantly reduces the complexity because sorting is not required, while still maintaining good bit error rate (BER) performance. The proposed greedy detection algorithm has been designed and synthesized for a 4 × 4 16-QAM MIMO system in a TSMC 65 nm CMOS technology. The detector achieves a maximum throughput of 600 Mbps with a 0.79 mm 2 core area.
INTRODUCTION
Multiple-input multiple-output (MIMO) communication systems have received tremendous attention because of their high spectral efficiency and near-capacity performance. New wireless standards, such as IEEE 802.11n, IEEE 802.16e WiMax, and 3GPP LTE, include MIMO techniques in combination with advanced outer channel codes such as lowdensity parity-check (LDPC) codes [3] and Turbo codes [1] . The main challenge of the soft MIMO detection is to efficiently and accurately generate the log-likelihood radios (LLRs) for the outer channel decoder. The exhaustive search with the ML criterion will consume enormous computing power and require tremendous silicon resources on the chip which makes it impossible to be employed in multiple antenna systems with higher-order modulation schemes.
To reduce the exponentially algorithmic complexity, some sub-optimal detection algorithms and their VLSI architectures have been proposed by researchers recently [4, 2, 10, 5, 7, 9] . Garret et al. [4] implemented a depth-first soft sphere decoding (SD) algorithm with 256 search operations at each level of the tree. Burg et al. [2] presented a simplified hard sphere decoding ASIC architecture. On the other hand, Wong et al. [10] first introduced the breadth-first K-best hard detection algorithm. Later on, Guo et al. [5] extended it for the soft K-best (K=5) detection by keeping a list of best candidates at each search tree level.
The depth-first SD algorithm has non-deterministic complexity and variable throughput which makes it sensitive to the channel conditions. The depth-first SD with a small candidate list size suffers significant performance degradation due to the inaccurate and especially the infinite log likelihood ratios (LLRs). On the other hand, the K-best algorithm has advantages of fixed complexity and fixed throughput, which makes it more friendly for hardware implementation. However, when K is large, the complexity of the K-best algorithm dramatically increases because a large number of paths have to be extended and sorted.
In this paper, a greedy shortest path searching algorithm and its VLSI architecture is proposed for high throughput soft MIMO detection. We transform the traditional MIMO detection problem into a shortest path finding problem. By making a locally optimum choice at each stage, this algorithm tries to find the global minimum Euclidean distance for every transmitted bit. Therefore it avoids the LLR clipping issues that both depth-first SD and K-best detectors have. Moreover, this approach is very suitable for high speed VLSI implementation because of the regular and parallel datapath structure.
SYSTEM MODEL
We consider a coded MIMO system with M transmit antennas and N receive antennas. The MIMO transmission can be modeled as: 
where L A and L E denote the a priori L-value and extrinsic L-value, respectively. Assuming there is no prior knowledge of the transmitted signal, using the max-log approximation [6] , (2) can be simplified to
where set X k,+1 = {x|x k = +1} and set X k,−1 = {x|x k = −1}. Using QR decomposition according to H = QR, where Q and R refer to an N × M unitary matrix and an M × M upper triangular matrix, respectively, Λ(s, y) can be computed as
whereŷ = Q H y, and C is a constant (C = 0 if M = N ), which does not affect (3).
GREEDY DETECTION ALGORITHM
Without loss of generality, we use a 4 × 4 QPSK system to explain our proposed algorithm in this section.
Graph construction
The goal of the soft MIMO detector is to generate the LLR value for each transmitted bit x k based on (3), which requires the calculation of the minimum Euclidean distance 
This process can be viewed using a flow graph which is shown in Figure 1 
Problem definition
Given a received, possibly noisy MIMO symbol, we may associate the 1-D Euclidean distance with weights on each edge in the graph so that the problem of ML detection reduces to the problem of finding the minimum-weight path from the root to the toor in the graph. Definition 1: Hard MIMO detection problem: Find the shortest path from the root to the toor. Then the encountered vertices are the detected MIMO signals. Definition 2: Soft MIMO detection problem:
, find the shortest path, which must contain this vertex, from the root to the toor. The Q conditioning shortest paths found at every stage t make a candidate list Lt. Then the L-value of bit x <t> i is calculated as:
Λ .
Greedy algorithm
The optimum solution to the hard detection problem requires full search over the entire graph whose complexity grows exponentially with the size of the graph. Solving the soft detection problem is even harder since it needs to repeatedly perform full search on the condition of every vertex being included in this shortest path. In this section, we will introduce a greedy shortest path algorithm to approximately solve the soft detection problem. Like the K-best algorithm, it takes decisions on the basis of information at hand without worrying about the effect these decisions may have in the future.
Step 1. Edge reduction In Figure 1 , each vertex i at each stage t (except for the first and last stages), is connected to Q vertices at an earlier stage and Q vertices at a later stage. Figure 2 shows the data flow at each vertex which has Q incoming subpaths h0, ..., hQ−1 and Q outgoing subpaths h 0 , .
. . . To reduce the arithmetic complexity, a greedy algorithm is summarized as follows. Let the partial distance be d k , which is the cumulative weight of the subpath h k from the root to this vertex i. Among the Q incoming subpaths, we select the best subpath hm with the minimum weight
and discard the other Q − 1 subpaths. After choosing the best subpath hm, the outgoing subpath to vertex v(t + 1, k) is updated as
The outgoing path weight to vertex v(t + 1, k) is updated as
where the weight function w
is calculated based on the vertices along the subpath h k according to (6) . Moreover, among the Q outgoing subpaths we also find the shortest subpath h n where
This information will be stored in memory for later use. Figure 3 shows an example of the result graph after applying the edge reduction operation. Note that only the surviving path for each vertex is shown in Figure 3 . We can see that each vertex in stage 3 is along a path to the end point (toor). These paths are presumably the shortest and can be used to form the candidate list L 3 . However not every vertex in the stages 0, 1, and 2 is along a path to the end point. To solve this issue, we need to perform a path extension operation.
Step 2. Path extension A. Path extension for stage 2 In Figure 3 , if we look at the vertices on stage 2, we will see that not every vertex is along a path to the toor. For example, vertices 0 and 3 are disconnected with the toor. Therefore, we need to extend those uncompleted paths. The path extension algorithm is summarized as follows. Extend each subpath by checking all its Q outgoing edges and select the edge which leads to the shortest cumulative subpath Recall that in the step of edge reduction, we have saved the shortest outgoing subpath h n for each vertex into memory. If we retrieve this information for each vertex in stage 2, Figure  5 shows the result graph. Here the dotted lines represent the outgoing edges retrieved from the memory. Now each vertex in stage 2 is along a path to the toor (presumably the shortest). So the candidate list L2 can be created in this way.
B. Path extension for stage 1 Similarly, in Figure 3 , not every vertex in stage 1 is along a path to the toor. By retrieving the shortest outgoing subpath from memory for each vertex in the stage 1, we could extend these subpaths for one level as shown in Figure 6(a) , where the four extended subpaths are labeled as A, B, C, and D. It is worth to mention that no re-computing, but memory read is needed for the first step of the path extension operation. However the second level path extension operation needs re-computing and comparing the subpath weights as described in the path extension algorithm. The result after applying a second level path extension is shown in Figure 6(b) . Now the subpaths {A, B, C, D} have been fully extended and can be used to form the list L 1 . Step 3. LLR calculation
After all the candidate lists L t for 0 ≤ t ≤ M − 1 have been created, the LLR calculation defined in (7) is then very straightforward.
Algorithm complexity analysis
In the proposed greedy algorithm, calculating the partial Euclidean distance is the major contribution to the total arithmetic complexity. We only consider this part in the complexity analysis and ignore the other minor contributors such as minimization and memory operations. More generally, consider an M transmit antenna system with Qsize QAM modulation. In the edge reduction operation, the number of subpath weights that need to be calculated per stage is Q 2 , so the total complexity is O(M Q 2 ). On the other hand, the complexity of the path extension operation is O(
. In a practical MIMO system, M is usually not very large. In the case of M = 4, the total arithmetic complexity of this greedy algorithm is approximately O(7Q 2 ).
SIMULATION RESULTS
Like the K-best algorithm, the proposed greedy algorithm has a deterministic complexity and a fixed throughput. To evaluate the decoding BER performance, we have compared the proposed greedy algorithm with the traditional K-best algorithm. We consider 4 × 4 16-QAM and 64-QAM MIMO systems (the channel matrices are assumed to have independent Rayleigh fading distribution). In the simulation, the soft-output of the detector is fed to a length 2304, rate 1/2 WiMax LDPC decoder [8] , which performs up to 15 iterations. Figure 8 compares the BER performance for the proposed MIMO detector. For the 4 × 4 16-QAM system, our detector outperforms the K-best detector for K=16 and 32, and achieves similar performance compared with K=64. For the 4 × 4 64-QAM system, our detector outperforms the K-best detector with K=32, 48 and 64. 
VLSI ARCHITECTURE DESIGN
In this section, we will describe the hardware architecture design for a 4 × 4 16-QAM MIMO system. The greedy algorithm introduced in section 3 is generic and can be easily extended for higher order modulation systems, i.e. in the case of 16-QAM, there are 16 vertices instead of 4 at each stage. Figure 9 shows the top level hardware architecture for implementing the proposed greedy detection algorithm. It contains four major units: Edge Reduction Unit (ERU), Memory Module, Path Extension Unit (PEU), and LLR Calculation Unit (LCU). 
Edge reduction unit (ERU)
We define the subpath metric (SM) for vertex v(t, i) as the cumulative path weight (or partial Euclidean distance) up to this vertex v(t, i). In the step of edge reduction, each vertex will compare the Q incoming SMs and select the edge with the minimum SM, and prune the other Q−1 edges. Then the Q outgoing SMs are computed by adding the corresponding edge weight to the surviving incoming subpath weight and sent to the downstream vertices. Figure 10 illustrates the ERU architecture, where VPU stands for Vertex Processing Unit, and CS stands for Compare and Select. This is a partially-parallel architecture by having Q vertices being processed simultaneously. This is also a recursive architecture by reusing the logic for different stages. Note there are two CS units: CS-A and CS-B. CS-A unit is used to select the minimum incoming SMs and pass the survivor to the VPU in the next iteration. CS-B unit is used to select the shortest outgoing SM for each vertex and save it to memory for path extension operation. Figure 11 and Figure 12 show the VPU and CS architecture, respectively. The VPU unit is used to calculate the outgoing SMs (partial Euclidean distances). At
where Si is the complex constellation point for antenna i, the antennas are numbered 3, 2, 1, and 0, which correspond to stages 0, 1, 2, and 3 respectively is initialized to 0). The SADD in Figure 11 stands for shift and add which is used for implementing R * S.
Path extension unit (PEU)
The function of the PEU is to extend the subpaths, which were obtained in the edge reduction step, in a greedy and recursive fashion. Both memory read and path re-computation are required for stage 0 and 1. Only memory read is required for stage 2. Figure 13 shows the PEU architecture which contains 16 VPUs and 16 CSs such that 16 subpaths can be 
LLR calculation unit (LCU)
After obtaining the list L t , for 0 ≤ t ≤ M − 1, the LCU implements (7) in a straightforward way. The detail implementation is omitted. Figure 14 shows the timing diagram for a 4 × 4 16-QAM MIMO system. Each stage of the edge reduction takes three cycles to finish, so it takes 3 × 4 = 12 cycles to perform the edge reduction operation. The path extension operation for antenna 3 can start 6 cycles later and will take 6 cycles to finish. The path extension operation for antenna 2 will take another 3 cycles. So the total latency is 15 cycles. Taking into account the two extra cycles of latency for LLR generation, the total decoding latency for one MIMO symbol is 17 cycles. In terms of throughput, because two consecutive MIMO symbols can be overlapped as shown in Figure 14 , the decoding throughput for a 4 × 4 16-QAM MIMO system is 
Hardware scheduling
M × M c × f clk Cycle count = 4 × 4 × f clk 12 = 4 3 f clk.(13)
VLSI IMPLEMENTATION RESULT
A 4×4 16-QAM soft MIMO detector has been synthesized (using Synopsys Design Compiler), placed and routed (using Cadence SoC Encounter) for a TSMC 65nm CMOS technology. Figure 15 shows the VLSI layout view of the proposed MIMO detector. The fixed-point bit precision for R andŷ are 10 bits. The LLR outputs are represented in 7 bit. Based on the fixed-point simulation result, the finite word-length implementation leads to negligible performance degradation from using the floating-point representation. The maximum achievable clock frequency is 450 MHz based on the postlayout simulation. The corresponding maximum throughput is 600 Mbps. Figure 15 : VLSI layout photo Table 1 compares the detection throughput and hardware complexity of the proposed detector versus two state-of-theart detectors from the literature: depth-first soft sphere detector with 256 search operations from [4] , and soft K-best detector from [5] . In [5] , a real QR decomposition is used with a small K=5. Based on the simulation results in Fig. 8 , our solution has a better BER performance than [5] and can achieve a faster throughput because we avoid the sorting operation which is very expensive in the hardware implementation. On the other hand, at a cost of more hardware resources, the depth-first detector in [4] has a better BER performance than our solution. However [4] has a limited throughput because of the large number of sequential searching operations and it has variable throughput at different SNR levels. Our architecture provides a good solution in between the depth-first detector and the K-best detector. 
ERU PEU LCU MEM

CONCLUSION
We propose a new soft-output MIMO detector architecture based on a greedy graph algorithm. This detector can achieve a very high throughput of 600 Mbps at a hardware cost of only 550 K gates. Compared with other solutions, the proposed detector has a significant improvement in terms of detection throughput, latency, and area while still maintaining good bit error rate (BER) performance.
