Abstract-Multiple-input multiple-output (MIMO) technology enables higher transmission capacity without additional frequency spectrum and is becoming a part of many wireless system standards. Sphere detection has been introduced in MIMO systems to achieve maximum likelihood (ML) or near-ML estimation with reduced complexity.
I. INTRODUCTION
Multiple-input multiple-output (MIMO) communications [1] based on multiple transmit and receive antennae will be applied in several standards to increase the spectral efficiency and data rates. Linear minimum mean square error (LMMSE) and zero forcing (ZF) detectors can be applied in MIMO detection but their perforrnance is not optimal. A maximum likelihood (ML) detector approximation based on the sphere detector (SD) [2] for MIMO communications has been introduced in [3] . So called list sphere detectors (LSDs) [4] can be used in channel coded systems to approximate the maximun a posteriori (MAP) detection. In this paper, we focus on an SD variant called the K-best LSD [5] .
Application-specific integrated circuits (ASICs) can be used to get high computational power but the design work is laborious and the solutions tend to be limited in terrns of flexibility. Digital signal processors (DSPs) are flexible, but do not often provide enough computation power to meet the strict perforrnance requirements. Application-specific instruction set processors (ASIPs) can provide a possibility to reduce design and production costs and still result in sufficient perforrnance.
In this paper, we design an ASIP for K-best LSD using the transport triggered architecture (TTA) [6] computation paradigm. The goal is to try to achieve low energy consumption, so using memory is preferred extensively over registers.
The paper begins by defining the MIMO communication problem and the appropriate receiver algorithms in Section II. Section III describes the ASIP LSD implementation along with several variations that could be used for improving the detection throughput. The latency and hardware complexity of the implementation and its variations are estimated and compared in Section IV, and conclusions are drawn in Section V.
II. MIMO RECEIVER ALGORITHMS
A MIMO system with M receive and N transmit antennas can be modelled using equation
where x e C "X is the vector of received symbols, H e CAM N is the channel matrix, s E CN,1 is the vector of transmitted symbols and n E C(MX is the Gaussian noise vector with zero mean and covariance matrix cT2IM. The ML estimator is optimal in the sense of minimizing the error probability [1] . The ML solution can be computed as [1] SML = arg mrin x -s Hsi2 s(EC (2) where I I denotes the Frobenius norrn of a vector and C is the set of complex constellation points. Unfortunately, the computational complexity of ML estimation may be very high [1] .
SDs enable finding the ML or near-ML solution with reduced complexity. An LSD [4] is an SD variant which, instead of giving just one most likely symbol vector, outputs a list of the most likely symbol vectors and their Euclidean distances. This modification makes the LSD suitable for softdecision detection, as shown in [4] .
The search can be limited inside a sphere with radius d using the sphere constraint Now the symbol vector components can be considered separately. The K-best algorithm processes one vector component first, chooses K best partial symbols and stores them. Next, those K best partial symbols are expanded to the next symbol level, and again K best partial symbols are chosen to be continued with until the whole symbol vector has been processed.
In our implementation, the sphere radius was set to infinity, d = oc, which guarantees a constant number of visited nodes in all cases. The K-best algorithm used in the implementation, modified from [7] , is presented in Figure 1 . 
B. Sorting of Symbol Vectors
With long lists, the sorting and storing of symbol vectors becomes a bottleneck. The list maintenance could be made fast by using registers, but the register-based approaches tend to have a high energy consumption with a large list size.
Using memory instead of registers may be slower but possibly more energy-efficient. Heap is a good choice for long lists as the complexity of insertion is only of order O(l1g2 n) for binary tree-shaped heaps, where n denotes the list size. Becase of this, the heap sort was chosen for our implementation.
The heap is used with a list unit (LU) that is used for address calculation and value determination. The LU itself is used with a software algorithm. The LU for heap-based sorting is based on the unit described in [8] , where also the heap sort described. The LU takes five inputs: the address of the current parent node, the data this address contains, the data that the child nodes contain and the symbol level that is being processed. The unit decides whether the nodes should be swapped and outputs data that should be written to the current parent node and the child node that the parent node was possibly swapped with. In addition, it also gives the addresses of the new parent node and the new child nodes. An SFU was designed for the PED calculation also. The PEDs are calculated completely by this SFU and assembly routine is needed only for feeding the input values to the unit and reading the output value (PED and the corresponding symbol vector) from it.
As the heap insertion has a constant duration of 11 clock cycles, the PED unit latency was constrained to be less than or equal to that. The PED unit performs five operations: mmul, ped3, ped2, pedi and pedO. The operation mmul is used for computing y = QHx. Vector y is computed one element at a time, so matrix Q can be fed to the PED unit row by row instead of inputting the whole matrix (16 elements) at the same time. The operations ped3, ped2, pedi and pedO are used for PED calculation according to the algorithm presented in Figure  1 .
The unit has ten input ports. For mmul operation, the first eight inputs are used for inputting the values of matrix Q and vector x. For the PED operations, the first four inputs are used for feeding the elements of the R matrix, and the next four inputs are used for inputting the vector y. In addition, the symbol vector from the previous level with its corresponding PED and the current level symbol are input to the ninth and tenth input ports, respectively. A combination of the current level PED and the corresponding symbol are given as an output.
D. Variations of the Implemented Version
To study possibilities for performance enhancements, three variations are proposed. Their effects to latency and hardware complexity are estimated in Section IV.
1) Software-pipelined Heap Insertion: Another heap utilization strategy that reduces the clock cycles to 1log2(n + 1)] +1 per insertion was presented in [8] . The insertion latency approaches the theoretical limit of heap insertion complexity (O(log2(n))) when n -) oc. With n = 63, the insertion latency could be dropped down from 11 to 7 clock cycles.
2) Conditional Jump, Version A: As explained in Section III-B, the insertion routine of the implemented version always lasts for 11 cycles. The routine could be modified for higher average throughput by enabling a conditional jump out of the insertion routine. By adding a simple comparator to the processor, the insertion routine could detect on the first clock cycle of insertion if the new candidate fits in the heap. If the candidate is larger than the heap maximum, a jump instruction could be executed on the first clock cycle. We assume a jump latency of four clock cycles, so there would still be four clock cycles executed in the routine even if the candidate did not fit in the heap.
Because of the parallelized PED computation and sorting, now the PED would have to be computed in three clock cycles for it to be ready before the possible jump. In the implemented version, the insertion latency and the latency of the whole LSD are constant, whereas enabling the conditional jump would make the latencies variable.
3) Conditional Jump, Version B: Using conditional jump out of the insertion routine could be implemented in another way also. An additional output port could be included in the list unit, as in [8] . If the new symbol does not fit in the heap or the nodes are not swapped at some point during the routine, the unit could detect this and generate an output value, based on which a conditional jump could be made by using guarded execution, and the jump could be executed on the second clock cycle of the insertion routine. Also in this version, the PED computation would have to be faster than in the implemented version, and the PED latency should be maximally five clock cycles.
IV. LATENCY AND HARDWARE COMPLEXITY ESTIMATION
In this section, the latencies and data path complexities of different possible designs are compared. The effects of reduced list size and parallelization are investigated as means for achieving a higher throughput.
The area estimates consider the data path complexity first. The additional area requirements that come from, for example, the control logic and interconnection network, are first neglected but their effect is discussed later. Exact latency is provided from simulation results for the implemented version. The other variations are characterized by their total heap insertion latencies which give fairly good estimates of the overall latencies.
A. Implemented Version
In the implemented version, the insertion routine takes always 11 clock cycles and the number of insertions is constant. The insertion is used 16 x 16 + 16 x 63 + 16 x 63 = 2272 times so the total insertion latency can be calculated as 2272 x 11 = 24992 (clock cycles).
Some additional clock cycles are needed, for example, for controlling the program flow, and the simulation results show that the complete execution of the algorithm takes 26400 clock cycles. As (26400/24992 -1) x 100 % 5.6 %, the overhead that comes from other operations than running the insertion routine is small. This justifies using the insertion latencies of different variations for comparing them with each other.
Different parts of the ASIP were modelled in VHDL and synthesized with Synopsys Design Compiler. Table I shows the gate counts of different units that were used in the implementation, synthesized with 0.13 ,um technology at 100 MHz clock frequency. Also the list unit for software-pipelined insertion is included in the table. The register file includes three 32-bit registers.
As the implementation includes two ADDSUB units, two LSUs, a CMP unit, an RF, a PED unit and an LU, the data path hardware complexity can be computed as around 17300 gates. The amount of heap insertions remains the same if the software-pipelined heap utilization is used. However, the time per insertion drops to seven clock cycles and the softwarepipelined version would have a constant insertion latency of 2272 x 7= 15904 (clock cycles).
An area estimate can be calculated as for the implemented version, taking into account that now six LSUs are needed. The list unit for software-pipelined execution (SWLU) is also slightly more complex than in the implemented version and more performance is required from the PED unit also. The capability for simultaneous subtractions is needed inside the PED unit which is taken into account by adding the term 200 that approximates this complexity increase. The gate count can be estimated as GSW 2 x 1100 + 6 x 600 + 1100 + 1600 + (8900 + 200) + 2800 = 20400.
With the version A of the conditional jump out of the insertion routine, the insertion latency would be either four or 11 clock cycles. If all of the PEDs of inserted symbols are assumed to have equal probability distributions, simple simulations can be made to estimate how many inserted symbols will fit in the heap (for simplicity, assumed to lead to full 1 1-cycle insertion in both jump versions) and how many will not (leading to four-cycle insertion). On level 2, about 40 % of the symbol candidates will fit in the heap after initial filling. At levels 0 and 1, about 18 % of the symbols will fit in the heap after initial filling. The average latency in clock cycles can then be estimated as 
Compared to the implemented version, one comparator unit has to be added. In addition, the fact that the PED calculation has to be performed in three clock cycles requires a highly parallel PED unit. The unit has to be able to perform four complex multiplications and subtractions during one clock cycle. Assuming that the size of the parallel PED unit is quadrupled from the basic PED unit used in the implementation, the data path hardware complexity of the detector would be around 45100 gates.
D. Conditional Jump, Version B
In version B, the insertion routine would last for either five or 11 clock cycles, depending on the situation as for version A. In a similar way as for version A, the latency for version B can be computed as 14998.4 clock cycles.
In version B, an additional CMP unit is not needed and the PED unit latency can be as large as five clock cycles now, leading to a smaller PED unit than for A. The gate count of the PED unit is approximated to double from the basic PED unit as two parallel multiplications and subtractions are needed. The hardware complexity of the list unit is assumed to be equal to that of the list unit used in the implemented version. Similarly as for version A, the gate count can be estimated as 26200.
E. Comparison ofAlternative TTA Processors
Combining different methods, six schemes can be considered. In Figure 3 , different alternatives are compared in terms of data path hardware complexity and total insertion latency. It can be seen that utilizing the jump A without softwarepipelined heap insertion is not a reasonable option as a smaller latency can be achieved with simpler hardware with softwarepipelined insertion and jump B. The high latency of the implemented version is obvious, and significant improvements can be achieved by utilizing the proposed alternatives without a prohibitive complexity increase. Rough estimates can be made about the hardware complexity of the proposed parallel architecture. We assume that one symbol vector can be processed with the hardware for software-pipelined heap utilization, see Section IV-B, but now including a PED unit whose gate count is doubled from the PED unit used in the implemented version so that the PEDs could be computed in four clock cycles. Multiplying the required hardware by five, we may approximate the datapath complexity of the parallelized architecture as around 145500 gates. Assuming that the additional area for control logic and interconnection network would remain the same as for the implemented version and adding 10 % implementation overhead, the final gate count can be estimated as (145500 + 9300)gates x 1.1 170 kgates.
V. CONCLUSIONS This paper began by giving an overview of MIMO detection algorithms with most weight on K-best sphere detection. Next, a programmable ASIP design for K-best LSD was presented. The design space was explored by presenting and evaluating several modified designs that could be used for improving the detection throughput.
Software-pipelined heap insertion and conditional jump out of the insertion routine were shown to offer higher detection throughput without increasing the hardware complexity too significantly. A list size of 63 seems to be impractical, and a reduced list size was proposed to enable real-life implementation. According to our knowledge, the presented K-best implementation is the first published ASIP design for LSD. In addition, the memory-based heap sort method used in the implementation opens new perspectives for SD design.
The presented design cannot compete with fast registerbased ASIC implementations in terms of throughput, but the flexibility of ASIPs leaves the approach as an interesting topic for further development.
Future research should consider possibilities of reaching a higher detection throughput with register-based sorting methods. Also techniques for enabling small list sizes should be studied, including, for example, optimal ordering of the detected symbols.
