A list sphere detector (LSD) is an enhancement of a sphere detector (SD) that can be used to approximate the optimal MAP detector. In this paper, we introduce a novel architecture for the increasing radius (IR)-LSD algorithm, which is based on the Dijkstra's algorithm. The parallelism possibilities are introduced in the presented architecture, which is also scalable for different multipleinput multiple-output (MIMO) systems. The novel architecture is implemented on a Virtex-IV field programmable gate array (FPGA) chip using high-level ANSI C++ language based Catapult C Synthesis tool from Mentor Graphics. The used word lengths, the latency of the design, and the required resources are presented and analyzed for 4 × 4 MIMO system with 16-quadrature amplitude modulation (QAM). The detector implementation achieves a maximum throughput of 12.1Mbps at high signal-to-noise ratio (SNR).
INTRODUCTION
Multiple-input multiple-output (MIMO) channels offer improved capacity and significant potential for improved reliability compared to single antenna channels. Sphere detector (SD) calculates the hard output maximum likelihood (ML) solution with reduced complexity compared to full-complexity ML detectors [1] . A list sphere detector (LSD) [2] is a variant of the sphere detector that can be used to approximate a MAP detector, which is the optimal detector for forward error coded (FEC) systems with lower complexity [2, 3, 4] .
The sphere algorithms are often divided into breadth-first search algorithms, such as the K-best algorithm [4] , and sequential search algorithms, such as the Schnorr-Euchner enumeration (SEE) based algorithms [3] . The architecture design is important for an efficient implementation, and architecture solutions with different levels of parallelism have been introduced for the most common K-best and SEE sphere algorithms, e.g., in [3, 4, 5] . In this paper, we consider a sequential search algorithm, namely the increasing radius (IR)-LSD algorithm, which is a modification of Dijkstra's algorithm [6] to a LSD algorithm and optimal in the sense of visited number of nodes in the sphere search tree structure [6, 7] . We identify and introduce the key functional units of the IR-LSD algorithm, and design a novel, highly parallel and scalable architecture for the algorithm. * This work was done in MITSE project which was supported by Elektrobit, Nokia, Nokia-Siemens Networks, Texas Instruments and the Finnish Funding Agency for Technology and Innovation, Tekes. The authors would like to thank Mentor Graphics for the possibility to evaluate Catapult C Synthesis tool.
The designed architecture is implemented on a Virtex-IV field programmable gate array (FPGA) chip for 4 × 4 MIMO system with 16-quadrature amplitude modulation (QAM). The implementation is done using Mentor Graphics' Catapult C Synthesis tool with high-level ANSI C++ language, which is then completely synthesized through the tool to produce the resulting RTL. We present the complexity and latency of the implementation and describe the major challenges.
The paper is organized as follows. The MIMO signal detection, and the IR-LSD algorithm are presented in Section II. The designed architecture is presented in Section III. The implementation of the algorithm is presented in Section IV. The conclusions are drawn in Section V.
MIMO SIGNAL DETECTION
A narrowband system with NT transmit and NR receive antennas is considered with assumption NR ≥ NT and QAM constellation. The received signal can be expressed in real domain as [1] 
where the received signal vector y ∈ IR 2N R ×1 , the transmit symbol vector x ∈ Ω 2N T R ⊂ IR 2N T ×1 and the Gaussian noise vector η ∈ IR 2N R ×1 are defined in the frequency domain. The channel matrix H ∈ IR 2N R ×2N T contains real Gaussian coefficients with unit variance. The complex QAM constellation Ω is transformed into real symbol alphabet with QR bits per symbol ΩR ⊂ Z, e.g., ΩR = {−3, −1, 1, 3} in the case of 16-QAM. We assume a practical case of system with FEC and with separate soft-output detector and decoder at the receiver, where the detector generates soft output information LD1(b k ) of each transmitted bit b k [2] .
The SDs find the ML solution of x with reduced complexity compared to exhaustive search algorithms. Then the sphere search is done by limiting the search to the points inside a 2NR-dimensional hyper-sphere S(y, √ C0) centered at y. After QR decomposition (QRD) of the channel matrix H, the condition can be written as [1] 
where C0 is the squared radius of the sphere, R ∈ IR 2N R ×2N T is an upper triangular matrix with positive diagonal elements, Q ∈ IR 2N R ×2N R is an orthogonal matrix, andỹ = Q T y. Due to the upper triangular form of R the values of x can be solved from (2) level by level using the back-substitution algorithm. Let x
T denote the last 2NT − i + 1 components of the vector x. The squared partial Euclidean distance (PED) of x
can be calculated as [3] d(x
where
Ri,jxj, Ri,j is the (i, j)th term of R and i = 2NT , . . . , 1. Depending on the search strategy and the channel realization, the SD searches a variable number of nodes in the tree structure, and aims to find the point
, also called a leaf node, for which the ED d(x
The list sphere detector (LSD) [2] can be used for obtaining a list of candidates of the transmitted symbol vectors and the corresponding EDs L ∈ Z N cand ×2N T as an output, where Ncand is the size of the candidate list so that 1 ≤ Ncand ≤ 2 Q R 2N T . The LSD output candidate list can then be used to approximate the log-likelihood ratio (LLR) of the transmitted data as a soft output. The increasing radius (IR) -LSD algorithm is listed as Algorithm 1. The algorithm operates in a sequential fashion starting from the root layer, and extends the partial candidate s = x 2N T i+1 with the next best admissible node xi. The father candidate s f = x 2N T i+2 is also, if admissible node xi+1 exists, extended. The variables n1 and n2 indicate the order number of the next best node, i.e., inform how many nodes have been checked, for the child and father candidates, respectively. The algorithm uses two memory sets: the final candidate memory L, which is the size of N cand candidates and the partial candidate memory S, which size depends on the algorithm iterations, i.e., while loop repetitions. The stored partial candidate information N (s, d(s), n2, i), which is stored to S, includes the partial candidate, the PED, the number of extended father nodes, and the current layer, respectively. The information stored to the final list L includes only the candidate and the ED, and the C0 is updated according to the largest ED in L. After each iteration, the algorithm continues with the partial candidate N with the minimum PED from S until d(s) < C0.
Initialize sets S and L, and set
Determine the n1th best node xi for sc = (xi, x
T and calculate d(sc)
5:
Determine the n2th best node xi+1 for father candidate
if sc is a leaf node then 
end if
18:
Continue with N with min PED from S and set n1 = 
ARCHITECTURE
A list sphere detector consists of the QRD, the LSD algorithm, and the log-likelihood ratio (LLR) calculation units. The QRD unit decomposes the channel matrix H into R and Q, which are given as an input with y to the LSD algorithm. The LSD algorithm unit executes the sphere tree search and determines the output candidate list L.
The approximation of LD(b k ) is calculated in the LLR calculation unit using the candidate list L. In this paper, we focus our attention to the architecture of the IR-LSD algorithm.
The architecture for IR-LSD algorithm is shown in Figure 1 , and it consists of two SEE and PED units, a partial candidate memory unit, a final candidate memory unit, and a logic unit. The SEE and PED units define and extend the selected partial candidate and its father node with the next best admissible nodes and calculates the PEDs of the updated candidates. The partial candidate memory unit is used to store the already extended partial candidates while the leaf candidates are stored to the final candidate memory unit. The logic unit defines the candidate(s) to be extended and stored in the next algorithm iteration.
The IR-LSD algorithm architecture, which is presented in Figure 1 , is designed to have as much parallel processing as possible to decrease the overall latency of one algorithm iteration. In one algorithm iteration, the algorithm studies one or two new nodes of the sphere search tree depending if the father node is extendable or not. The two SEE and PED units are designed to execute the algorithm description lines 4 and 5 in parallel. The control logic unit executes the logic between lines 6−18, and defines the candidates to be stored and the candidate to be extended in the next iteration. This means that the candidates extended in the iteration round 1 are stored to the memory in iteration round 2. The storing of the partial candidates in lines 12 and 16 to the partial memory unit and the storing of final candidate in line 8 to the final memory unit is then done in parallel with the SEE and PED units. Thus, the total latency of one algorithm iteration is equal to the latency of the highest latency parallel unit plus the latency of the control unit. The total latency of one signal vector detection process, i.e., one algorithm run, is dependent on the required number of algorithm iterations, which is also relative to the number of checked nodes in the search tree. The required number of iterations depends on the system configuration and the channel environment.
SEE and PED Unit
There are two similar SEE and PED units in the IR-LSD architecture as shown in Figure 1 parallel. Both units are not used in all of the cases, but in practice both units are occupied > 90% of the time. Each SEE and PED unit is divided into two subunits as shown in Figure 1 . The first unit calculates the bi+1(x 2N T i+1 ), which is the part of PED calculation that is independent from the new symbol xi, as in (3). The unit can be implemented with different levels of parallelism to get faster calculation of the multiplication (MUL) operations. The number of required multiplications in the calculation of bi+1(x
where i is the current layer in the search tree and imax = 2NT . It should be noted that the average layer in the search process is less than half of the tree hight NT , because a larger ratio of the search process is done in the upper part of the tree.
The second unit executes the SEE, i.e., determines the n1th best node xi, and calculates the PED of the extended candidate accordingly. The SEE is done in a slightly modified fashion from the way presented in (14) in [1] . Instead of calculating the costly and high latency division operation, we calculate the (3) with ΩR different symbols xi, what can be implemented with 1 − |ΩR| parallel MUL and subtraction (SUB) operations. Then the |ΩR| values are sorted and the PED is calculated with the n1th best node. The architecture could be designed to determine first the symbol with minimum PED and then determine the n1th best node using logic the same way as presented in [1] .
Memory Units
The memory units are designed as binary heap data structures [8] , which keep the stored elements in order according to selected definition. The partial candidate memory set S is implemented as minheap, where the stored elements N (s, d(s), n2, i) are ordered so that the candidate with minimum PED is always sorted to be at the top of the heap. The final memory set LF is implemented as max-heap, where the stored elements N (s, d(s)) are sorted according to the maximum PED. A binary min-heap tree structure is illustrated in Figure 2 , where the value of the memory slot is illustrated inside the circle and the memory address underneath the circle.
The used operations with heap memory are read min/max, extract-min/max, and insert new. The running time of read min/max operation is O (1) [8] ,i.e., it requires just a memory read of the base address. The insert and extract-min/max operations running time is O(log 2 (k)) in the worst case [8] , where k is size of the memory and log 2 (k) is the height of the tree. In the insert operation, we store the information to the next available memory slot, which is illustrated as inserting the value X to the address 8 in Figure 2 . The value is swapped to correct level with up-heap operation. The operation requires at least one read and write operation of the memory, and it is repeated maximum of log 2 (k) times until the new element is in its correct place. The extract-max/min operation extracts the base address element and replaces it with either a new element or the last element of the memory. Then the down-heap operation is executed to move the element in the right position. The down-heap operation, which requires at least two memory reads and one memory write, is repeated also a maximum of log 2 (k) times until the new added Table 1 . Determined word lengths for the real IR-LSD algorithm.
(10,4) (9,3) (8,1) (12,5) (10, 5) element is at the correct position.
In the IR-LSD architecture, the sizes of LF and S are equal to the required list size N cand and to the maximum number of algorithm iterations. In the worst case, one extract-min and one insert operations are required to the partial memory unit in each iteration. The partial memory size can be decreased by introducing a separate sphere constraint for the stored candidates. Therefore, with a proper choice of the sphere constraint, the required memory size is decreased without any performance loss. The sphere constraint can be determined to be, e.g., relative to the previous largest candidate in the final list or relative to the partial candidate search level.
Scalability
The architecture can be used as such in systems with different numbers of transmit antennas NT and constellation size Ω. The number of transmit antennas NT and constellation Ω effect the size of the search tree and, thus, effect the required number of algorithm iterations to detect the transmitted signal vector x. The maximum number of required iterations is equal to the number of required elements in the partial candidate memory. Also the required operations by the SEE in SEE and PED unit depend on the constellation size Ω. The proper final list size also varies with system configuration.
IMPLEMENTATION
The IR-LSD algorithm architecture was implemented with real signal model for 4 × 4 MIMO system with 16-QAM. The performance of a turbo coded system was studied with a real IR-LSD, sorted QRD (SQRD) [9] preprocessing and log-MAP LLR calculation in an uncorrelated (UNC) channel as shown in the left subplot in Figure 3 . The LSD candidate list size was selected as N cand = 15 and the used fixed-point word lengths are listed in Table 1 , where W and the I refer to the total number of bits and the number of bits used for the integer part representation, respectively. We also studied the required maximum and average number of iterations Dmax and Davg for 10% target frame error rate (FER) by the LSD algorithm with different SNR as shown in the right subplot in Figure 3 . It can be seen that the required maximum number of iterations Dmax decreases with increasing SNR and the IR-LSD algorithm requires as low as Dmin = 9 iterations, i.e., 18 studied nodes, in high SNR environment to reach target FER. The size of the partial candidate memory is selected as Dmax = 80 to support the lower SNR operation.
Complexity and Latency
The Catapult C Synthesis tool output RTL was synthesized with Mentor Graphics Precision RTL and the FPGA place and route operation was done with Xilinx ISE software for Xilinx Virtex-IV chip with fs = 150 MHz clock frequency. The device utilization of Xilinx Virtex-IV chip and the latencies of the units in terms of clock cycles (cc:s) Δtot are shown in Table 2 . The latencies of the memory units are calculated according to the average number of heap operations, and the units are implemented in dual port RAM memory, which enable two parallel read/write operations. The total latency of the detector iteration consists of the latencies of the slowest parallel unit, which is the SEE and PED unit, and of the control logic unit. The total guaranteed throughput of the implementation can be calculated as
The maximum guaranteed throughput is then 12.1 Mbps at γ = 21 dB for 10% target FER with Davg = 9. However, the guaranteed implementation throughput is 1.6 Mbps at γ = 13 dB, which can be considered as the worst case scenario with Davg = 70. Thus, the throughput is mainly dependent on the number of iterations Davg.
Discussion and Comparison to Other Work
The main limiting factor for higher throughput is the latency of one algorithm iteration. The total number of iterations can only be lowered with some lattice reduction techniques or by sacrificing the performance of the detector. The latency of one algorithm iteration is currently limited by the SEE and PED unit, and it could be lowered by, e.g., ASIC implementation. As far as the authors know, there has not been any architecture designs or implementations of the IR-LSD algorithm in the literature. The parallel nature of the introduced architecture makes the algorithm implementation competitive against the current state of the art depth first algorithm or soft output detectors [10, 11] . Some parallelism can be added by dividing the search into separate real and imaginary branches as in the hard output work in [10] . However, the increased throughput results in approximately double the complexity and decreased performance. The complex signal model leads to higher maximum throughput in [11] , but results in more complex units and higher average number of visited nodes. The main interest in practice is the performance and the complexity of the implementation in the worst case scenario, because the implementation has to be able to work in those conditions. As the other works typically present the maximum throughput results, a direct comparison of other work is difficult, but the authors believe that with possible ASIC implementation the IR-LSD algorithm is a very good competing alternative.
CONCLUSIONS
We designed and introduced a novel and scalable architecture for the IR-LSD algorithm with parallel processing units. The architecture was designed so that the main operations of the algorithm can be run in parallel and it is scalable to different configurations with minor changes. An implementation of the architecture was presented for 4 × 4 system with 16-QAM on a Virtex-IV FPGA chip. The complexity and the latency of the implementation were presented and analyzed. The throughput of the current implementation could be enhanced, e.g., with an ASIC implementation.
