Abstract. Stochastic detection for multi-antenna (MIMO) systems promises communications performance close to max-log detection for certain SNR regimes, especially when the system iterates between detector and channel decoder following the Turbo Principle. In this work, we propose a parallel VLSI architecture for soft-input soft-output Markov chain Monte Carlo based stochastic MIMO detection. It features runtime adaptability to varying channel conditions, effectively allowing us to adjust the invested effort. Besides the details of our area-throughput efficient design, like the low-level algorithm and micro-architecture design, we also provide an extensive data set from our experiments regarding the detector's communications performance and relate it to our VLSI implementation results. The provided data analysis highlights the architecture's run-time adaptability and demonstrates how we can trade off throughput for improved communications performance.
Introduction
With the wide-spread adoption of MIMO (multi-antenna) technology in current and future wireless communication systems, such as those based on the IEEE 802.11n standard [1] , academia and industry are searching for MIMO detectors with reasonable implementation complexity and algorithmic performance. Especially for systems using bit-interleaved coded modulation with iterative decoding (BICM-ID) [2] , a major challenge for VLSI implementation is the required soft-input soft-output (SISO) detector, since optimal detection has an exponential complexity.
Iterative MIMO decoding can yield impressive algorithmic performance gains in terms of significantly reduced signal-to-noise ratio (SNR) requirements to achieve a certain fixed error rate [3] . This SNR gain has several possible uses amongst others: we can extend the transmission range, we can serve more users (i.e. tolerating more interference), we can lower the transmission power to save energy (and at the same time reduce interference to other users), or transmit at a higher throughput in the same bandwidth.
Possible detectors can be roughly put into two categories: linear detectors, e.g. MMSE-filter based [4] [5] [6] , and non-linear detectors e.g. [3, [7] [8] [9] . Basically, linear detectors try to suppress noise using linear filtering, then decode the estimate. In contrast to this, non-linear detectors perform a search, e.g. a randomly guided one, in the space of possibly transmitted data vectors. Stochastic detection based on Markov chain Monte Carlo (MCMC) methods [10] belongs to this class. It enables small configurable detectors that can cover a large design space. Furthermore, when iterating between detector and channel decoder, MCMC detection shows a communications performance close to max-log detection for certain SNR regimes [10] .
To date, only some research effort has been directed towards this field. There exist only a handful of publications on MCMC detector architectures at the moment [11] [12] [13] [14] . None of them correlates communications performance with VLSI implementations results.
Related
Work. An MCMC-based SISO MIMO detector ASIC design supporting independent parallel Gibbs Samplers is presented in [12] . Amongst other things, [12] introduces an initialization scheme for the completely recursive, and thus simplified, computation of the detector states, and shows how to reuse the circuitry to draw independent first samples. However, a multiplier in the timing critical path yields a limited throughput and a relatively large area consumption.
In [11] , the authors propose an MCMC-based SISO MIMO detector architecture mapped on an FPGA. It features one multiplier-free Gibbs Sampler pipelined at the symbol vector level. The architecture uses a simple recursive metric computation, but requires one dot-product per cycle. The first sample of every chain needs to be generated externally.
The hybrid soft-output only MCMC detector architecture [14] combined with a hard-output fixed-complexity sphere detector (FSD) features parallel multiplier-free Gibbs Samplers that start with the best candidates found by the FSD. However, the design requires the QR-decomposition of the channel matrix, and the results are only given in terms of operation counts.
Contribution. We present a complete redesign of the MCMC-based MIMO detector architecture presented in [12] , with multiplier-free Gibbs Samplers and further architectural improvements that result in a significant area reduction and timing improvement. Post-layout area and clock period reduce by about 50 % and 40 % respectively. In extension to our previous publication [13] , we additionally provide our detector's communications performance results and present an analysis showing how to trade off throughput for improved communications performance at run-time.
Outline. First, we introduce the general concept of MCMC-based MIMO detection (Sect. 3), describe the implemented algorithm (Sect. 4), then we propose the redesigned architecture (Sect. 5). Subsequently, we explicitly highlight the differences to the reference design [12] in Sect. 6. Our implementation results are presented in Sect. 7. The analysis of the communications performance results is explained in Sect. 8.
System Model
We consider a spatial-multiplexing N t × N r MIMO system with BICM-ID, as depicted in Fig. 1 . A message b ∈ {0, 1} N b is encoded with rate r = N b /N c and interleaved, yielding the code word c ∈ {0, 1}
Nc . Let X ⊂ C be a modulation alphabet with K = log 2 |X | bits per symbol. The code word is partitioned into multiple subvectors c n ∈ {0, 1} KNt . They are subsequently mapped to symbol vectors x n ∈ X Nt that are transmitted independently. Assuming a frequency-flat fading channel characterized by H n ∈ C Nr×Nt , the received symbol vector at time n is y n = H n x n + w n where w n ∈ C Nr is a white Gaussian noise process with E[w n w H n ] = N 0 I Nr . In the remainder, the time index n is dropped for convenience. Using iterative MIMO decoding following the Turbo Principle [15] , detector and channel decoder exchange extrinsic information λ e = λ p − λ a in terms of log-likelihood ratios (LLRs), where λ p are the detector's posterior LLRs and λ a are the prior LLRs fed back from the decoder.
MCMC-Based MIMO Detection
The Markov chain Monte Carlo based MIMO detector class that we consider performs a randomly guided search in the space c ∈ {0, 1} KNt . It starts with a random candidate, then walks around randomly. On its way, it evaluates and saves metric values of the current candidates, which are later used to approximate the posterior LLRs. The random process (Monte Carlo) from which it draws new candidates evolves recursively (Markov chain). By design the search converges towards candidates of high probability [10] .
We select independent first samples c (q,0) ∈ {0, 1} KNt , one per chain q = 1 . . . N q , either randomly from the prior distribution c (q,0) ∼ p(c) = f (λ a ) or given by an external hard-output detector c (q,0) = c ext (usually for at most one chain). Every sample s = 1 . . . N s is drawn in KN t steps. The algorithm sequentially replaces every bit with 0 and 1, computes the metric for those two candidates, then selects one of them as the next partial sample.
Let ϕ : {0, 1} NtK → X Nt be a rule that maps bit labels onto symbol vectors x ∈ X Nt . We define the metric
for the candidate c ∈ {0, 1} KNt , which is related to the posterior probability P (c|y, H, λ a ). Furthermore, let
be the vector c with the b-th bit replaced by β. The detector approximates the posterior LLRs as λ
where we search for the two maxima for every bit over all chains and samples.
Low-Level Algorithm
The presented algorithm implements the max-log variant of the RaoBlackwellized MCMC detection algorithm with uniform sampling described in [10] . Its basic idea is to recursively compute the metric in Eq. (1) by tracking the changes while drawing bits [12] . First, we introduce the basic concepts required for understanding the algorithm, then describe the algorithm in detail.
For the theoretic background, the reader is kindly referred to [10, 12] .
Basic Concepts
Matched Filter. The algorithm in [12] replaces H with the Gram matrix R = H H H and the received symbol vector y with the matched filter output y mf = H H y in the metric. This does not influence the posterior LLR calculation, however it allows to use the symmetry R = R H . 
Gibbs Sampler (GS
and thus contains bits from the previous sample c (q,s−1) and the current sample c (q,s) . 
. This concept enables the initialization of parallel independent Gibbs Samplers [12] . Symbol Deltas. When the GS state changes, at most one bit is different. We introduce the notation
where ϕ n is the mapping rule for the n-th antenna, and the b-th bit belongs to the n-th antenna.
Recursive Dot-Product. The algorithm tracks the current value of
whereR is the matrix R with the diagonal set to zero. Starting from S (−1) = y mf −Rx (−1) , it updates S recursively when c (q,s) b changes.
Recursive Metric Computation. We introduce an arbitrary offset such that μ(c (−1) ) = 0, which cancels out in Eq. (3). Let the distance update be
where the b-th bit belongs to the n-th antenna, then the metric update is
which we either subtract from or add to the current metric μ(c (q,s) ), depending on the bit flip direction, if the b-th bit changes.
Log-Domain Bit Probability. The term
expresses the probability of the next bit being 1 in the log-domain, where the temperature parameter η mitigates lock-in effects in the high-SNR regime [10] . For the conversion to the linear domain, we apply a piece-wise linear approximation to logistic(γ) = 1/(1+e −γ ) as in [11, 12] . To this end, the GS simply limits γ to the range [−4, 4) and compares −γ to a uniformly distributed pseudo-random number u ∼ U (−4, 4) in the same range. Figure 2 depicts the algorithm partitioned into four different parts: the Front-end Processing (FEP), that transforms the channel observations, the parallel Gibbs Samplers (GS) realizing the Markov chains, the Metric Update (M) tracking the current metric state, and the LLR Computation, which searches for the two maximum metric values per bit.
Overall Algorithm Design

Front-end Processing
as described in Sect. 4.1 (Recursive Dot-Product) but scaled by Γ .
Gibbs Sampler
Algorithm 1 describes how the GS sequentially draws bits of the candidate sequence c (q,s) . GS and Metric Update share the term δ 
Algorithm 1. Gibbs Sampler
3 for s = 0 to Ns do 
Metric Update
Algorithm 2 recursively computes the current candidate's metric μ(c (q) b ), using the state μ (q) , and produces the two metrics for the current bit μ(c b0/1 ). As stated earlier, we arbitrarily set the metric for the common starting point to zero (line 1). Lines 4 to 9 show the underlying metric update. Of the two possible states, one is identical to the current state, and thus has the same metric value (line 4). The other one is updated according to the direction of the bit flip (lines 6 and 8). In line 9, we select one of the two as the new current metric. It remains unaltered if the bit does not change.
Algorithm 2. Metric Update
input: c (q,s) b , c (q,s−1) b , δ (q,s) b , λ a , Chain Index q output: μ(c (q,s) b0 ), μ(c (q,s) b1 ) 1 μ (q) ← 0 2 for s = 0 to Ns do 3 for b = 1 to NtK do 4 μ(c (q,s) b0 ) = μ(c (q,s) b1 ) = μ (q) 5 if c (q,s−1) b = 0 then 6 μ(c (q,s) b1 ) = μ(c (q,s−1) b ) − (ηδ (q,s) b + λ a b ) 7 else 8 μ(c (q,s) b0 ) = μ(c (q,s−1) b ) + (ηδ (q,s) b + λ a b ) 9 μ(c (q,s) b ) = μ(c (q,s) b0 ) if c (q,s) b = 0 μ(c (q,s) b1 ) if c (q,s) b = 1 10 μ (q) ← μ(c (q,s) b )
LLR Computation
Algorithm 3 searches for the maximum metrics among all chains, then compares these local maxima with the current global maxima. It excludes the s = 0 step, which is the transition from c (−1) to c (q,0) , from the search (line 3). The computation of the extrinsic LLRs in line 7 is included, as it can be easily implemented in hardware. 
Algorithm 3. LLR Computation
input: μ(c (q,s) b0 ), μ(c (q,s) b1 ), λ a output: λ e 1 μ max b0 ← −∞ ∀ b = 1 . . . NtK 2 μ max b1 ← −∞ ∀ b = 1 . . . NtK← max(μ max b0 , max q (μ(c (q,s) b0 ))) 6 μ max b1 ← max(μ max b1 , max q (μ(c (q,s) b1 ))) 7 λ e b = μ max b0 − μ max b1 − λ a b ∀ b = 1 . . . KNt
VLSI Architecture
Overview
The macro pipeline of FEP-Circuit and MCMC core, shown in Fig. 3 , constitutes the proposed MCMC detector. Both components require multiple clock cycles per input vector, but double buffering between FEP and Core ensures that the computations can overlap. The MCMC core in turn contains four stages connected via registers. The stages exchange information in every clock cycle. They effectively run in a pipeline manner. The FSM and the multiplexers (e.g. 
FEP-Circuit
The architecture, depicted in Fig. 4 , contains in total five multipliers. Using four of these, the dot-product for the terms H H y and R = H H H requires N r cycles per complex entry. We need only the lower triangular of R due to R H = R. The architecture computes either one complex off-diagonal entry, or two real diagonal entries in parallel. The fifth multiplier alternatingly multiplies real and imaginary parts with Γ = 2 α N0η . In parallel, we multiply the entries of R with x (−1) t = 1 + j (cf. Sect. 4.1 (Common Starting Point)) using only adders and multiplexers, and accumulate the results to obtain S. Figure 5 depicts the GS-Circuit. The |Δ| 2 -multiplier, depicted in detail in Fig. 6(a) , exploits the limited range of |Δ| 2 ∈ {−3, −2, . . . , 3} × {8, 16} which , then finishes in the write-enable control for the S registers. The M-Circuit, shown in Fig. 7 , implements Algorithm 2 using a writeenabled register for the current metric, which is updated when we flip the current bit. The multiplication with η is implemented as a constant shift.
GS/M-Circuit
Update-S-Circuit. The Update-S-Circuit shown in Fig. 8 has (N t −1) complexvalued Δ-multipliers, i.e. 2(N t − 1) times Fig. 6(b) . Using the multiplexers 
and
, we can update all N t elements of S, however only N t − 1 change per clock cycle. The entries of R e.g. r 1n , r 2n are selected in the Mux stage. Similar to the GS-Circuit, the adder-subtractor control considers Δ < 0, if |Δ| is imaginary, and additionally the old bit c 
L-Circuit
The L-Circuit shown in Fig. 9 contains two register files (RFs) for the current maximum metrics with KN t entries each. We use tokens propagating alongside the data to indicate whether a value is valid. The Compare Select (CS) elements select the maximum of the valid inputs. The registers also store tokens per entry, which are reset to zero when the processing of a symbol vector starts. After the scalar subtractor, we saturate the extrinsic LLRs to limit their dynamic range. The saturation has a positive influence on the communications performance. 
Differences to Reference Architecture
The proposed architecture is a complete redesign of [12] . This section explicitly highlights the architectural modifications. The original and new timing critical path are located in the GS-Circuit.
Multiplier-Free Gibbs Sampler: Similar to [11] , we move the multiplication with 1/(ηN 0 ) out of the GS into the FEP, by scaling R and S with Γ . This removes the multiplier from the detector's critical path, but increases the required word lengths.
Dynamic Scaling:
The normalization of Γ ∈ [0.5, 1) allows to use smaller word lengths, mitigating the previously mentioned increase. Consequently, we need an arithmetic shifter in the GS-Circuit at the previous location of the multiplier, which reverts the normalization.
Pipelined Input Multiplexers:
Our MCMC detector selects the column of R and the entry of λ a in the new Mux stage in front of the GS stage. While this removes those multiplexers from the detector's critical path, it adds an additional latency cycle.
Reduced Update-S-Circuit: We remove two Δ-multipliers (one per real and imaginary part) from the Update-S-Circuit, since in every cycle one of the entries of S does not change. This requires multiplexers for the resource sharing, which are however not in the critical path and are smaller than the removed Δ-multipliers.
Shared Maximum Metric Register File:
The RFs are moved from the M-Circuit [12] to the L-Circuit. This reduces the required RFs from N p to one.
We also add a pipeline register after the L-Circuit to improve timing, which requires another extra latency cycle. Also, our M-Circuit in Fig. 5 has one addersubtractor instead of two adders, similar to [11] .
Adder-Subtractor Units: These new units right after the Δ-multipliers in the GS-and the Update-S-Circuit, replace the original adders and the conditional negation units. The control selects addition or subtraction depending on the sign of Δ, if Δ is imaginary, the old bit c Simplified Delta Multiplier: Our Δ-multipliers, used for γ and S, compute the absolute value |Δ|. This removes one multiplexer stage from the critical path.
Postponed Conjugation:
We are storing only the lower half of R. Due to the hermitian property of R, we have Im{r tn } = −Im{r nt }. The control of the subsequent adder-subtractor units considers the required negation, instead of an explicit conjugation [12] .
Results
With the word lengths given in Sect. 7.1 and the throughput equations in Sect. 7.2, we first compare our model to the reference architecture [12] based on gate-level synthesis results, then we present post-layout results for different design-time variants of our architecture. Section 8 presents the algorithmic evaluations.
Simulation Setup
A 802.11n-like 4 × 4 MIMO system is considered assuming a spatially uncorrelated Rayleigh channel, perfect channel knowledge and a max-log BCJR decoder. For all results, we assumed a rate-5/6 tail-biting binary convolutional code with generator polynomials 0133 and 0171 and puncturing, a random interleaver and 64-QAM modulation (K = 6). The frame length of 2160 information bits equals the interleaver's length, which is one OFDM symbol for this setup. For every data point, we simulated at least 10 5 frames. The average signal-to-noise ratio (SNR) per receive antenna is defined as SNR = E[ Hx 2 ]/(N r N 0 ). The required word lengths for an SNR loss of ≤ 0.1 dB compared to the floating-point model at a frame error rate (FER) of 10 % are: [integer.fractional] y [7.8] [8.4] . All are signed, per entry, and for real and imaginary part identical. The first chain (q = 0) is always initialized with the result of an hard-output zero-forcing MIMO detector. We assume N q = 8 chains with N s = 8 samples per chain (i.e. N gs = 64 in [12] ) for the next three sections, but vary those parameters in Sect. 8.
Architecture
Our parameterized architecture implementation currently supports up to 4 × 4 MIMO and 64-QAM. MIMO mode and QAM scheme can be configured at runtime within the supported range, which in turn can be configured at design-time. Each GS/M-pair can process up to 16 chains sequentially, with up to 16 samples per chain. The FEP-Circuit requires
cycles for its computation. This is slightly faster than the FEP-Circuit in [12] . The MCMC core runs for
cycles. Compared to [12] , we need two extra latency cycles (cf. Sect. 6 (Pipelined Input Multiplexers)). The code bit throughput of the architecture is θ c = KNt ngs f clk assuming n gs ≥ n fep and sufficient input data.
Synthesis Results
We synthesized the design with Synopsys Design Compiler I-2013.12-SP2 in topographical mode using a 1.0 V standard-performance standard cell library for the UMC 90 nm SP-RVT LowK CMOS process. One gate-equivalent (GE) is the area of one 2-input drive-1 NAND gate. Figure 10 compares the four instances N p = {1, 2, 4, 8} to [12] . While the most efficient design in [12] has an AT exec -product of 181.7 kGEµs, our proposed design achieves 50.0 kGEµs, which is 3.6 times more efficient. Table 1 lists the synthesis results for our fastest design instance and the reference design [12] . The FEP is larger (5 kGE), while the GS is smaller (−6.2 kGE per GS), since we moved the multiplier from the GS to the FEP. The additional area of the new arithmetic shifter is partially compensated for by the other improvements. The Update-S-Circuit becomes smaller (−2.5 kGE) since we save one complex Δ-multiplier and use |Δ| now. The saving effect is larger than the additional area from the multiplexers required for the resource sharing. The M-Circuit exhibits only about 7.4 % of the original area, since we moved the RFs to the L-Circuit, which consequently became larger (10 kGE). The remainder of the area (−12.9 kGE) is occupied amongst others by the R column multiplexers. The area is reduced because the multiplexers are no longer in the timing critical path.
In total, the redesigned architecture takes on only about 48 % of the original area for N p = 8. The saving depends on the number of GS/M-Circuits. The critical path was shortened by about 40 %, i.e. the maximum clock frequency increased from 312 MHz to 526 MHz.
Layout Results
A layout was obtained with Cadence SoC Encounter 9.1 for each configuration's fastest design instance in order to further study the proposed architecture's implementation complexity and to enable more precise comparison with future related work. All following area figures are taken from the layout results, depicted in Fig. 11 . The consumed area slightly increased, while the achievable clock frequency decreased. It is interesting that the throughput mainly depends on the number of parallel GS/M-Circuits and the chain parameters, i.e.
as can be seen in Fig. 11 . Fig. 11 . Area vs. throughput based on the MCMC detector's layout results. For each design-time configuration, the ASIC with the fastest clock is shown. As an example, the 16-QAM 2 × 2 design supports one or two antennas and 4-or 16-QAM at run-time.
The largest instance, for 64-QAM, N t = 4 and N p = 8, requires 149.5 kGE or 0.47 mm 2 and achieves a maximum clock frequency of 479 MHz, yielding a code bit throughput of 52 Mbit/s. The fastest instance in terms of throughput supports 4-QAM, N t = 2 and has N p = 8 GS/M-Circuits. It occupies in total an area of 70.7 kGE or 0.22 mm 2 and runs at 664 MHz, which results in a throughput of 66 Mbit/s.
To determine the smallest instance, which should be the lower corner of the covered design space, in Fig. 11 , we synthesized the detector with N t = 2, 4-QAM and one GS/M-Circuit for a target of 100 MHz. This ASIC consumes 19.2 kGE or 0.06 mm 2 , runs at 165 MHz and yields a 2.27 Mbit/s throughput. The FEP-Circuit and MCMC core require 10.9 kGE and 8.3 kGE respectively. Further word length optimizations could yield additional area reductions. Table 2 compares our work to a selection of reported MIMO detector implementations. We make three observations. First, in terms of hardware efficiency expressed in Mbit/s/kGE, the MCMC detector resides in about the same order of magnitude as the single-tree-search sphere decoder (STS-SD) [7] , though our architecture is more than two times more efficient than our reference architecture [12] . The MCMC detector exhibits a deterministic run-time, which eases the receiver system design, while the SD can in principle always achieve nearcapacity performance at the cost of a strongly varying run-time. Secondly, the MCMC detectors (and the STS-SD) are about one order of magnitude less efficient than the linear [4, 5] , iterative-linear [6] detectors, and most notably the fixed-complexity sphere decoder (FCSD) [8] , which achieves close-to-optimal communications performance at a deterministic run-time. In this perspective, the FCSD [8] is the best choice. In case that a particularly small implementation is needed, the MCMC might have an advantage, depending on how well the FCSD scales. Lastly, there are three cases for the preprocessing circuitry. Some implementations include it [4] [5] [6] , it is optional for the MCMC detectors [12] , and definitely required for the other reported work [7] [8] [9] 16] . This of course makes the area-throughput efficiency comparison difficult.
Algorithmic Considerations
In this section, we put the code bit throughput θ c , as an implementation property of our architecture, in relation to our design's communications performance in terms of SNR required to achieve a 10 % frame error rate. With this data, we can determine for example appropriate run-time parameters, or an appropriate run-time strategy to adapt them. Depending on the optimization criterion, the parameter choices might be different. Possible criteria are for example spectral efficiency or energy efficiency (as future work, we plan to perform energy estimations). The first part of this section gives a general overview, while the second part explains in more detail the iterative receiver figures.
In the remainder, we use the post-layout implementation results of the 64-QAM, N t = 4, N p = 8 instance that runs at 479 MHz. The simulation setup that we select resembles the highest-throughput mode of the 802.11n standard, which requires a high SNR. However, our experiments show that the MCMCbased detection performs best in a mid-range SNR regime, in combination with lower-order modulation schemes. Thus this can be considered as kind of a worstcase scenario for the MCMC detector.
We assume the same simulation setup as in Sect. 7.1. Additionally, we perform up to two detector-decoder iterations, i.e. per frame, we execute the MCMC detector and BCJR decoder twice. This gives us four run-time parameters: the number of chains N q1 and samples N s1 in the first iteration and respectively N q2 , N s2 for the second iteration. The short-hand notation GS 1 8x6 denotes N q1 = 8 and N s1 = 6, similarly we use GS 2 N q2 xN s2 . We simulated the parameter set N q1/2 ∈ {8, 16} and N s1/2 = {1, 2, . . . , 16}. Thus all N p = 8 GS/M-Circuits are always active. The total number of samples per iteration defined as N gs1/2 = N q1/2 · N s1/2 is our measure for the invested effort. Figure 12 shows four curves: two for the first iteration, and two for the second. The last part of this section explains how we determine the two second-iteration curves. They are pareto-optimal in terms of SNR versus throughput. Code bit throughput over SNR required to achieve a 10 % frame error rate Clearly in Fig. 12 , we can identify the existence of a run-time tradeoff between SNR and throughput. As could be expected, more effort (i.e. more samples, more chains) results in a better algorithmic performance (lower SNR). An SNR gain has several possible uses amongst others: we can extend the transmission range, we can serve more users (more interference), or we can also lower the transmission power to save energy (and reduce interference to other users).
In the non-iterative case (first iteration), we observe a vanishing gain beyond five samples, both for eight and 16 chains. At around 33.5 dB, it is better to use 16 instead of eight chains. Interestingly, this switches from GS 1 8x8 at 33.52 dB to GS 1 16x4 at 32.53 dB. The total number of samples for both configurations is 64, but we gain about 1 dB SNR while approximately maintaining the throughput. It is not completely identical due to the pipeline delays of the architecture.
Instead of using GS 1 8x6 after GS 1 8x5, a good decision would be to switch to the second iteration, therefore never using 16 chains in the non-iterative case. This yields a large SNR gain of about 2.7 dB at a similar throughput. At this transition point, we switch from GS 1 8x5 to GS 1 8x2-GS 2 8x2. The throughputs drops slightly from 77.15 Mbit/s to 74.65 Mbit/s. With N gs1 = 40 compared to N gs1 + N gs2 = 32, the MCMC detector's effort remains very similar.
MCMC-based detection benefits greatly from iterative MIMO decoding. Switching from one to two iterations yields SNR gains as large as 6 dB. While in the first iteration we achieve only about 31.7 dB, all SNR operating points of the second iteration are lower than 31 dB. A possible explanation is that the guidance from the channel decoder, in terms of prior LLRs, is the contributing factor for this. It helps the MCMC-based detection in two ways: we select the initial samples c
, and the transition probability γ depends on λ a . This seems to let the chains converge faster (in less samples) to interesting regions.
It follows a closer look on the second iteration. There are four parameters, N q1/2 and N s1/2 . For a given SNR, we determine the parameter combination that yields the highest throughput. These pareto-optimals points are shown in Fig. 13 . For the two second-iteration curves in Fig. 13 , we fix the number of chains in the first iteration N q1 = 8 and N q1 = 16 respectively, then optimize over the remaining three parameters.
For our calculations, we assume that the channel decoder and the buffering between decoder and detector cause no additional delay. This is a somewhat ideal scenario, since it might give us a large area consumption e.g. of the buffers, but definitely provides us with an upper bound for the achievable throughput. Thus, the throughput is given as
with n gs,1/2 = N q1/2 Np N s1/2 + 1 KN t + 5 and fixing N p = 8 here. We observe that more effort is required in the first iteration. For example, around 27 dB, the two configurations GS 1 8x7 and GS 2 8x4 are in use, i.e. N gs1 = 56 total samples for the first iteration, and N gs2 = 32 for the second.
At about 26.8 dB, we switch from eight to 16 chains in the first iteration. It appears that multiple short chains are favorable for the first iteration. Only at around 24 dB, the detector should switch from eight to 16 chains in the second iteration. It is also the point where the effort significantly rises (near 23.6 dB), especially for the second iteration. This could be an indication for switching to three iterations.
From a pure SNR-throughput perspective, we can say that two iterations are better than a single. As previously stated, we observe large SNR gains from iterating, and the best non-iterative operating point is off by about 0.7 dB compared to the worst second-iteration point. However, this of course ignores the hardware cost caused by the required buffering and the increased throughput requirement on the detector and decoder architectures. A realistic comparison depends on the overall objective, i.e. lowest energy, small area, best spectral efficiency, and on additional constraints, like minimum supported bandwidth. While this is out of scope here, we think that our data outlines the run-time adaptability of the MCMC-based MIMO detection architecture. It also shows that it performs particularly well in iterative receivers, therefore it could be a reasonable candidate to consider in the design of such a system.
Conclusions and Outlook
We have presented synthesis and layout results of the proposed MCMC detector architecture. The area reduction of up to 52 % and the shorter clock period by up to 40 % indicate that the proposed architectural modifications to the reference design are effective. Our extensive data set for the communications performance further highlights the available tradeoff between signal-to-noise ratio and architecture throughput. With its run-time adaptability covering a large design space, our detector is effectively able to cope with a lot of channel conditions at the appropriate effort. Though being a stochastic detector, its completely deterministic run-time eases scheduling at the system level, i.e. inside a complex iterative receiver.
Still, the architecture suffers from a relatively low but deterministic throughput, which stems from the MCMC detection method itself. The main advantage appears to be its simple scalability through N p and configurability through N t and K. This allows the architecture to cover a large design space. Practically, only the availability of sufficient data might limit the architectural parallelism.
As future work, we plan to correlate algorithmic performance with energy consumption, which might reveal another tradeoff capability of the proposed design.
