Abstract-The recently proposed Golden code is an optimal space-time block code for 2 2 multiple-input-multiple-output (MIMO) systems. The aim of this work is the design of a VLSI decoder for a MIMO system coded with the Golden code. The architecture is based on a rearrangement of the sphere decoding algorithm that achieves maximum-likelihood (ML) decoding performance. Compared to other approaches, the proposed solution exhibits an inherent flexibility in terms of QAM modulation size and this makes our architecture particularly suitable for adaptive modulation schemes. Relying on the flexibility of this approach two different architectures are proposed: a parametric one able to achieve high decoding throughputs ( 165 Mb/s) while keeping low overall decoder complexity (45 KGates), a flexible implementation able to dynamically adapt to the modulation scheme (4-,16-,64-QAM) retaining the low complexity and high throughput features.
I. INTRODUCTION
The hardware implementation of high data rate and high reliability wireless communication systems has raised engineering and research challenges for many years. Higher transmission reliability demands for higher levels of processing complexity in the mobile terminal, while faster data rates require increased throughput: both evolutionary trends are strong driving forces for the search of novel efficient architectures implementing the most critical baseband processing functions. In particular new standards proposed to regulate Wireless Local Area Networks (WLAN) and Metropolitan Area Networks (MAN) are significant examples of very challenging applications from the implementation point of view.
There are two main objectives on which research is actually focused. The first goal is to make wireless communication data rate comparable to that of wired communications, the second one is to improve reliability, by combating multipath, noise and interference effects. The recourse to multiple-input-multiple-output (MIMO) systems seems to be one of the most promising solutions to reach both these results. In particular new space-time codes over MIMO channels are now able to reach the best tradeoff between data rate and diversity gain, although they require more sophisticated detection schemes at the receiver [1] - [3] .
The main contribution of this work is in the hardware design of a decoder for this kind of codes, in particular for the decoding of a 2 2 2 MIMO signal coded with the Golden code [3] . Golden code is a recently proposed full-rate and full-diversity space-time block code, chosen for its good energy efficiency. The maximum-likelihood (ML) decoding algorithm for the Golden code is based on the sphere decoder [4] , which has already been widely addressed in the literature also from a hardware implementation point of view [5] - [7] .
Several architectures have been proposed for the efficient implementation of the sphere decoding architecture. In [6 [7] , where the K-Best algorithm allows for performance-complexity trade-offs. These choices lead to fully optimized architectures, achieving high throughput; however, they are not ML (the loss is about 1.4 dB in the case of [6, ASIC-II]), and most of all the cited implementatios have been proposed for specific modulation and transmission schemes and do not have reconfigurability features, although in [6] the possibility to adapt the proposed solution to different modulations is also mentioned. In this work, we overcome these limitations, proposing two novel architectures designed with VHDL as a reusable intellectual property (IP) macrocell: the first one is parametrized with respect to the fixed-point representation of data and to the addressed modulation scheme. The second architecture is flexible, meaning that it can be dynamically configured to cover multiple modulation schemes. We note that both these hardware implementations can be equivalently used in a 4 2 4 uncoded MIMO system [8] .
Section II is dedicated to reviewing the sphere decoding algorithm. The detailed descriptions of the two hardware implementations are then carried out in III and IV. In the last two sections, results and conclusions are presented.
II. SPHERE DECODING ALGORITHM
The system model when Golden code has been used [9] can be expressed as 
where Q n is the cardinality of the search space and n = 8.
Sphere decoding algorithms denote a family of algorithms, which aim at lowering the complexity of the minimization (2) by analyzing only a subset of the solution space [10] . They look at the set of possible solutions as points of a lattice and try to find the closest point to the received vector. In particular, a hypersphere is constructed around the received vector and only points inside it are taken into account, since the others are actually too far. This constraint can be written as 
where C 0 is the square radius of the hypersphere [4] , [11] . One of the most interesting consequences of this interpretation is that the exploration of the lattice can be thought as a tree traversal. This tree has n levels and every node at each level has Q sons. At every level the radius constraint (3) must be verified and satisfied, otherwise the branch is pruned.
2) Tree Exploration: Several algorithms have been studied in order to make the tree traversal efficient. An efficient algorithm has been proposed by Schnorr and Euchner (SE) [12] . The SE algorithm has intrinsically variable throughput and this makes it not very suitable for hardware implementation. The key to make this algorithm efficient or, at least, with predictable throughput, is to make an effective pruning.
III. FIRST HARDWARE IMPLEMENTATION: PARAMETRIZABLE SOLUTION
In the developed architecture, the datapath width, the size of the search tree and the modulation scheme are tunable parameters that can be statically configured to make the detector adaptable to different systems. The system is described with reference to the special case of the Golden code. The key elements of the developed architecture are described in the following paragraphs.
A. Flexible Hardware Solution
The key processing task in the tree exploration algorithm is given by (5), where we recall that (l+1) l =ỹ l 0 n j=l+1 R lj sj, is the lth entry of an n elements vector (l+1) , and l + 1 is the tree level we are referring to. At level l, the generic ith entry of this vector can be decomposed in a recursive manner through the following expression:
i 0 R il s l ; if l = n; . . . ; 1
where i is in the range 1; . . . ;n while the level l decreases from n + 1 to 1. The whole (l) can therefore be updated by means of (l) = (l+1) 0 R R R l s l ; l= n; . . . ; 1 (8) where R R R l is the lth column of R R R and the initial value is given by (n+1) =ỹ y y. In order to minimize the final metric d 2 (s s s) at each level of the tree the minimum (l+1) l 0 R ll s l value between all sons must be selected. More precisely, at each tree node, placed at level l, the following three main operations have to be accomplished: 3) for each i = 1; . . . ;n; (l) i is evaluated for the selected s l value, according to (7).
Thus, the straightforward minimization of partial metrics T l (s s s l value generated by U psi unit; the obtained T (l) is propagated to the lower tree level.
The described approach, and particularly the use of a division to obtain the optimal s l , allows avoiding multiple metric computations; thus it offers low complexity and, at the same time, flexibility in terms of supported modulation schemes. It is worth noting that, although the described technique introduces the division for the algorithm, those that correspond to the equivalent PAM constellation points 61; 63; . . .. As a consequence, a general purpose hardware divisor is not necessary and the required operation can be executed by means of a simplified component able only to find the closest integer solution of this division and to determine if the approximation is by defect or by excess: the first log 2 Q steps of a successive subtraction divider [13] can be employed to this purpose, where Q 2 is the number of signals in the QAM constellation. This divider has a very simple architecture that uses only shifts and subtractions and employs a dichotomous process to find the requested value after log 2 Q steps. In the block diagram of Fig. 2 , the multiplexer selects the dividend at the first step and the subtraction result in the following ones; the n-bit variable shifter is used to shift the divider by a number of positions that changes from the initial value of log 2 Q 0 1 down to 0. The subtractor returns the result one bit per iteration, starting from the most significant one.
B. Pipelined Tree Exploration
In order to ensure that a new node is expanded at each clock cycle, a new, alternative metric must be available also after a pruning operation has taken place. As a consequence, when the metrics of a given father node are evaluated, two "candidate" nodes are concurrently computed: the first one is a direct son of the current node and it is processed by the U psi unit, while the alternative node, placed at a higher level in the tree, is concurrently computed by the U psi step sub-circuit. Both of them generate novel (l) values for the next step in the tree traversal.
U psi and U psi step units share a very similar architecture, however the latter does not need to perform the division, as the second best choice for s l (and thus for the alternative node) can be easily derived as follows. When U psi unit computes the division, the result is approximated either by defect or by excess to the nearest PAM constellation point: the best choice for s l is given by (see Fig. 3 )
where 1 is the correction term provided as output by the U psi unit (see Fig. 1 ). The sign of 1 is used by U psi step unit to take the second (and following) nearest point in the PAM constellation, according to the following rule:
where A is the distance between two consecutive points and the initial value, s l , is the closest point given in (9) . Fig. 3 shows the sequence of alternative nodes selected at a given tree level, after the occurrence of pruning. Depending on the values assumed by the father node metric, the algorithm descends along the tree, reaching the son node, or it moves to the alternative node on the same level. It is worth noting that the computations of the (l) values for both son and alternative nodes are performed concurrently with the elaboration of the T l metric for the father node. This approach provides a significant speedup to the inherently serial SE Sphere Decoding Algorithm and has a limited impact on complexity.
C. Global Architecture
The block scheme of the SE tree-traversal circuit is finally depicted in Fig. 4 . The following four fundamental processing blocks can be identified in this architecture:
• U psi unit, which selects the most probable son of the current node and computes updated (l) through expression (8) (see also • C.U., control unit devoted to the proper selection of the tree search direction. The C.U. constitutes the core of the tree traversal algorithm and it must also carry-out two further tasks: to verify the pruning condition and, on the basis of this verification, to properly dispatch data between the other units. Symbols given in Fig. 4 are related to the case of a node expanded in the depth-first mode, with no pruning: inputs of the Metric compute unit are fed with outputs provided by U psi block. When a pruning occurs, multiplexers are switched and metrics related to the alternative node are selected. Finally, Psi Memory stores
vectors. The capability of managing more than one modulation scheme in order to adaptively select the most efficient one according to user needs and channel conditions, is one of the most important requirements of modern wireless communications systems. The Golden code, thanks to the non-vanishing determinant property, is very well suited for such application since it achieves the best performance independently of the QAM size [3] . In order to take full advantage of this Golden code feature, an enhanced implementation has been realized to allow run-time choice of the modulation scheme.
This implementation relies on the same architecture described in the previous section, with an additional parameter that allows the run-time selection of the constellation and that impacts mainly on the control logic.
At each level of the tree, the C.U., besides the pruning condition verification, also carries out a second verification task, related to the mapping constraint: it verifies if a certain value of s l still belongs to the specified constellation and uses this information to drive the processing.
This mapping constraint must also be taken into account in the division (l+1) l =R ll . As the number of acceptable values for this operation depends on the adopted modulation, the constellation parameter is used to dynamically drive the iterations of the dichotomic division algorithm.
Although the architecture deals with the implementation of the Golden Code where n = 8, it is also scalable in terms of n. Increasing the number of transmitting and receiving antennas: a larger value of the n parameter can be set in the VHDL code to synthesize detectors for larger STBcodes. Of course a larger n implies a more expensive architecture: particularly the value of n mainly affects the number of values to be evaluated in parallel in figures and, the depth of the tree and the size of memory.
The complexity of U psi and U psi step units grows almost linearly with n; the memory size increases as n 2 , because n values of (l) have to be stored for n tree levels. Finally, the throughput is expected to decrease with n, since the number of visited nodes grows, but this effect is strongly dependent also on the supported code.
V. SYNTHESIS RESULTS
The first proposed architecture, tailored to process the 16-QAM case, has been synthesized on both 0.13-and 0.25-m CMOS standard cell technologies, using the Synopsys Version Z-2007.03-SP1; synthesis on 0.13-m technology has been performed for the second flexible architecture. A commercial low-power library has been chosen.
A wide range of simulations have been carried out in order to determine the effects of different fixed-point representations on the performance for both 16 and 64-QAM modulation schemes, they show that a total of 12 bits lead to performance very close to the floating-point case for 16-QAM modulation, while 14 bits are necessary in the detection of 64-QAM signals. In order to enable the direct comparison with existing hardware realizations [6, I and II ASIC], [7] , a 16 bit datapath has been chosen and for the first implementation a 16 QAM has been adopted. The overall decoder has also been simulated with the uncoded 4 2 4 MIMO system and throughput figures reported in Table I refer to this configuration. The comparison of the described architectures to existing implementations tend to be quite difficult to carry out, because different approaches have been adopted: particularly, our solution implements the ML detection algorithm by means of a serial architecture, while the first ASIC in [6] maps the same algorithm onto a parallel structure and the second ASIC in [6] makes use of a serial scheme to realize a close to ML algorithm. These differences must be carefully evaluated while reading results in Table I .
Comparing the parameterizable architecture to parallel implementations in Table I , the solution described in [7] and the first ASIC presented in [6] , it can be observed that a single metric computation is performed at each cycle, instead of multiple parallel metric computations. This characteristic justifies both the reduced complexity and the inherent flexibility of the proposed architecture. At the same time, thanks to the adopted pipelined architecture, a remarkable average decoding throughput is achieved without any highly specialized structure.
Implementation cost is slightly higher than for the second ASIC proposed in [6] , where a serial approach is also adopted, in conjunction with a close to ML algorithmic approach.
On the other hand, the flexible implementation in the last column of Table I prove the limited complexity and performance overhead associated to the capability of dynamically adapting to different modulations (4-, 16-, and 64-QAM).
Finally, the finite precision analysis of the decoding algorithm proved that a negligible performance degradation is introduced in the case of 64-QAM modulation when a total of 14 bits are used: post-synthesis results show a complexity reduction of 8 Kgates (14% of core area) and a throughput increase to 150 Mb/s.
VI. CONCLUSION
A novel approach has been presented for the hardware implementation of a sphere decoder detector: the proposed solution uses a single metric computation per cycle and pipelining, breaking the sequential nature of SD algorithm.
The main element of novelty of the described approach is in its inherent flexibility that makes it suitable for adaptive modulation schemes. Two different hardware architectures have been designed: the first implementation is a parametrizable one, while the second is able to adapt on the fly to different modulation schemes.
Final synthesis results of the proposed architectures show a significant complexity reduction (approximately 50% for 16 QAM modulation) with respect to parallel structures. This is mainly due to the single metric computation per cycle. A remarkable average decoding throughput can be achieved with both implementations, thanks to the pipelining technique, even if the hardware was not tailored on a single modulation scheme as all previously proposed solutions.
