A Study on Multi-User MIMO Wireless Communication Systems

著者

その他のタイトル

学位授与年度

学位授与番号

URL

http://hdl.handle.net/10228/00006318
A STUDY ON MULTI-USER MIMO WIRELESS COMMUNICATION SYSTEMS

Tran Thi Thao Nguyen
# Contents

1 Introduction 8
   1.1 Background .......................................................... 8
   1.2 Research Objectives ............................................... 11
   1.3 Thesis Hierarchy .................................................. 13

2 Multi-User MIMO Wireless System Overview 15
   2.1 Overview ............................................................. 15
   2.2 Multi-User Protocol ............................................... 17
   2.3 Multi-User Transmission System ................................. 17
       2.3.1 Channel Emulator ............................................ 17
       2.3.2 IDMA System .................................................. 20
   2.4 Summary ............................................................ 22

3 Multi-User MIMO Channel Emulator with Automatic Sounding Feedback 23
   3.1 Introduction ........................................................ 23
   3.2 MU-MIMO Channel Model ......................................... 25
       3.2.1 General MU-MIMO Channel Model ....................... 25
       3.2.2 Statistical Model ........................................... 26
       3.2.3 Feedback Delay .............................................. 27
   3.3 Hardware Platform Implementation .............................. 30
       3.3.1 Design of Functional Blocks ............................ 31
       3.3.2 Gaussian Random Number Generator .................. 33
       3.3.3 Doppler Filter ............................................... 34
3.3.4 Spatial Correlation Block .............................................. 35
3.3.5 Rician Fading Block ................................................. 35
3.3.6 FPGA Implementation .............................................. 37
3.4 Measurement Results .................................................. 38
  3.4.1 Statistical Verification ............................................. 38
  3.4.2 Feedback Delay Verification ....................................... 39
  3.4.3 Platform Verification ............................................... 41
3.5 Synthesis Results of Proposed Channel Emulator .................... 44
3.6 Summary ................................................................. 46

4 Higher Order QAM Modulation for Uplink MU-MIMO IDMA Architecture 48
  4.1 Introduction ........................................................... 48
  4.2 System Overview ..................................................... 49
  4.3 Iterative Chip-By-Chip Receiver ..................................... 51
    4.3.1 Elementary Signal Estimator .................................... 51
    4.3.2 Extrinsic LLR Calculation ....................................... 57
    4.3.3 Interleaver .......................................................... 58
    4.3.4 Antenna Diversity ................................................. 58
    4.3.5 Soft mapper .......................................................... 58
  4.4 Simulation Results of QAM IDMA System ............................ 60
  4.5 Complexity Comparison between SCM and QAM Modulation ........ 63
  4.6 Summary ................................................................. 64

5 Interleaved Domain Interference Canceller for Low Latency IDMA System 65
  5.1 Introduction ........................................................... 65
  5.2 Latency Analysis ..................................................... 67
  5.3 Proposed Interleaved Domain Architecture .......................... 68
  5.4 Implementation of Proposed Architecture ......................... 70
    5.4.1 Conventional Architecture ..................................... 70
    5.4.2 Proposed Architecture .......................................... 71
  5.5 FPGA Implementation Results of Interleaved Domain IDMA Receiver ... 74
    5.5.1 Simulation Results of Interleaved Domain IDMA Receiver .... 76
List of Tables

3.1 Channel Emulator Specification .............................................. 31
3.2 Simulation Parameters ......................................................... 41
3.3 Platform Verification Parameters ............................................ 42
3.4 Synthesis Result of Feedforward Channel vs. Feedforward and Feedback Channel ................................................................. 47
4.1 Simulation Parameter of Higher Order QAM IDMA System .......... 60
4.2 Complexity Comparison between SCM and QAM Modulation ........ 64
5.1 Summary of Latency ............................................................... 68
5.2 Input/Output Port Parameters ................................................ 77
5.3 Simulation Parameters .......................................................... 78
5.4 Comparison of Architectures .................................................. 79
5.5 Synthesis Comparisons .......................................................... 83
5.6 Synthesis Results (Xilinx Virtex 6 240TFF784) ......................... 83
List of Figures

1.1 Multi-user transmission for a dense network ................. 9
1.2 Standard development ........................................... 9
1.3 Thesis hierarchy .................................................. 13

2.1 MU transmission .................................................. 16
2.2 UL-MU MAC Protocol in IEEE802.11ax ....................... 18
2.3 MU communication systems ...................................... 18
2.4 Channel sounding procedure ..................................... 19
2.5 IDMA transceiver with $N$ users ............................... 21

3.1 MIMO fading coefficient generator structure .................. 25
3.2 MU-MIMO channel emulator ..................................... 26
3.3 CSI feedback protocol ............................................ 28
3.4 Feedback mechanism in conventional channel emulator platform [20] ................................. 29
3.5 Feedback mechanism in proposed channel emulator platform ........................................... 29
3.6 Flexible feedback delay adjustment .............................. 31
3.7 MIMO fading coefficient generator structure .................. 32
3.8 Single path processing ............................................. 33
3.9 AWGN generator .................................................. 34
3.10 Doppler filter block .............................................. 36
3.11 IEEE 802.11ac evaluation platform ................................ 37
3.12 Channel spectrum for 4x4 model D TGac ..................... 40
3.13 Channel capacity for 4x4 model D TGac ...................... 41
3.14 Snapshot of the feedback channel output ...................... 42
<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>3.15</td>
<td>BER performance of IEEE 802.11ac system</td>
<td>43</td>
</tr>
<tr>
<td>3.16</td>
<td>Overview of the MU beamforming process</td>
<td>44</td>
</tr>
<tr>
<td>3.17</td>
<td>Platform implementation of MU beamforming process</td>
<td>44</td>
</tr>
<tr>
<td>3.18</td>
<td>EVM and constellation of the proposed system</td>
<td>45</td>
</tr>
<tr>
<td>4.1</td>
<td>Transceiver IDMA system with ( N ) users in one antenna ( k=1 )</td>
<td>50</td>
</tr>
<tr>
<td>4.2</td>
<td>16-QAM constellation in IDMA system</td>
<td>53</td>
</tr>
<tr>
<td>4.3</td>
<td>Mapping table of higher order QAM modulation</td>
<td>54</td>
</tr>
<tr>
<td>4.4</td>
<td>IDMA system with antenna diversity</td>
<td>59</td>
</tr>
<tr>
<td>4.5</td>
<td>Multiuser detection algorithm</td>
<td>60</td>
</tr>
<tr>
<td>4.6</td>
<td>Performance of SCM-QPSK and 16-QAM modulation with one antenna</td>
<td>62</td>
</tr>
<tr>
<td>4.7</td>
<td>Performance of Higher order QAM modulation with two antennas</td>
<td>62</td>
</tr>
<tr>
<td>4.8</td>
<td>Performance in mixed modulation for IDMA system</td>
<td>63</td>
</tr>
<tr>
<td>5.1</td>
<td>Conventional architecture of IDMA receiver</td>
<td>67</td>
</tr>
<tr>
<td>5.2</td>
<td>Proposed architecture of IDMA receiver</td>
<td>70</td>
</tr>
<tr>
<td>5.3</td>
<td>Flow chart of the conventional architecture</td>
<td>72</td>
</tr>
<tr>
<td>5.4</td>
<td>Flow chart of the proposed architecture</td>
<td>73</td>
</tr>
<tr>
<td>5.5</td>
<td>Architecture of the proposed interleaved domain architecture using dual-port RAM</td>
<td>75</td>
</tr>
<tr>
<td>5.6</td>
<td>Timing chart of the proposed architecture</td>
<td>76</td>
</tr>
<tr>
<td>5.7</td>
<td>BER performance of the proposed system vs SNR</td>
<td>77</td>
</tr>
<tr>
<td>5.8</td>
<td>Latency of the IDMA system vs iteration</td>
<td>80</td>
</tr>
<tr>
<td>5.9</td>
<td>Latency evaluations of the conventional architecture and the proposed architecture</td>
<td>82</td>
</tr>
<tr>
<td>A.1</td>
<td>MU-MIMO channel emulator for 4x4 antenna and 35 taps</td>
<td>90</td>
</tr>
<tr>
<td>A.2</td>
<td>MU-MIMO channel emulator with sounding feedback</td>
<td>91</td>
</tr>
<tr>
<td>A.3</td>
<td>MU-MIMO channel emulator evaluation by using oscilloscope</td>
<td>92</td>
</tr>
<tr>
<td>A.4</td>
<td>Spatial correlation block of MU-MIMO channel emulator</td>
<td>92</td>
</tr>
<tr>
<td>A.5</td>
<td>Rician block of MU-MIMO channel emulator</td>
<td>93</td>
</tr>
</tbody>
</table>
# Abbreviations

<table>
<thead>
<tr>
<th>Abbreviation</th>
<th>Full Form</th>
</tr>
</thead>
<tbody>
<tr>
<td>5G</td>
<td>5th Generation</td>
</tr>
<tr>
<td>ADC</td>
<td>Analog-to-Digital Converter</td>
</tr>
<tr>
<td>AP</td>
<td>Access Point</td>
</tr>
<tr>
<td>APP</td>
<td>A Posteriori Probability</td>
</tr>
<tr>
<td>AWGN</td>
<td>Additive White Gaussian Noise</td>
</tr>
<tr>
<td>BER</td>
<td>Bit Error Rate</td>
</tr>
<tr>
<td>BICM</td>
<td>Bit-Interleaved Coded Modulation</td>
</tr>
<tr>
<td>BPSK</td>
<td>Binary Phase Shift Keying</td>
</tr>
<tr>
<td>CDMA</td>
<td>Code Division Multiple Access</td>
</tr>
<tr>
<td>CSI</td>
<td>Channel State Information</td>
</tr>
<tr>
<td>CSMA/CA</td>
<td>Carrier Sense Multiple Accesses with Collision Avoidance</td>
</tr>
<tr>
<td>DAC</td>
<td>Digital-to-Analog Converter</td>
</tr>
<tr>
<td>DL</td>
<td>Downlink</td>
</tr>
<tr>
<td>ESE</td>
<td>Elementary Signal Estimator</td>
</tr>
<tr>
<td>FDMA</td>
<td>Frequency Division Multiple Access</td>
</tr>
<tr>
<td>FEC</td>
<td>Forward Error Correction</td>
</tr>
<tr>
<td>FFT</td>
<td>Fast Fourier Transform</td>
</tr>
<tr>
<td>FPGA</td>
<td>Field Programmable Gate Array</td>
</tr>
<tr>
<td>ICI</td>
<td>Inter Carrier Interference</td>
</tr>
<tr>
<td>IDMA</td>
<td>Interleave Division Multiple Access</td>
</tr>
<tr>
<td>ISI</td>
<td>Inter-Symbol Interference</td>
</tr>
<tr>
<td>LLR</td>
<td>Log-Likelihood Ratio</td>
</tr>
<tr>
<td>LOS</td>
<td>Line Of Sight</td>
</tr>
<tr>
<td>LPF</td>
<td>Low Pass Filter</td>
</tr>
<tr>
<td>LTE</td>
<td>Long Term Evolution</td>
</tr>
<tr>
<td>LUT</td>
<td>Look Up Table</td>
</tr>
<tr>
<td>MAC</td>
<td>Media Access Control</td>
</tr>
<tr>
<td>Abbreviation</td>
<td>Description</td>
</tr>
<tr>
<td>--------------</td>
<td>-------------</td>
</tr>
<tr>
<td>MRC</td>
<td>Maximal Ratio Combining</td>
</tr>
<tr>
<td>MU</td>
<td>Multi-User</td>
</tr>
<tr>
<td>MU-BF</td>
<td>Multi-User Beamforming</td>
</tr>
<tr>
<td>MUD</td>
<td>Multi-User Detection</td>
</tr>
<tr>
<td>MU-MIMO</td>
<td>Multi-User Multi-Input Multi-Output</td>
</tr>
<tr>
<td>NDP</td>
<td>Null Data Packet</td>
</tr>
<tr>
<td>NDPA</td>
<td>Null Data Packet Announcement</td>
</tr>
<tr>
<td>NLOS</td>
<td>None Line Of Sight</td>
</tr>
<tr>
<td>NOMA</td>
<td>Non-Orthogonal Multiple Access</td>
</tr>
<tr>
<td>OFDMA</td>
<td>Orthogonal Frequency Division Multiple Access</td>
</tr>
<tr>
<td>OMA</td>
<td>Orthogonal Multiple Access</td>
</tr>
<tr>
<td>PDP</td>
<td>Power Delay Profile</td>
</tr>
<tr>
<td>PHY</td>
<td>Physical</td>
</tr>
<tr>
<td>PSD</td>
<td>Power Spectral Density</td>
</tr>
<tr>
<td>PSDU</td>
<td>Physical Layer Service Data Unit</td>
</tr>
<tr>
<td>QAM</td>
<td>Quadrature Amplitude Modulation</td>
</tr>
<tr>
<td>QPSK</td>
<td>Quadrature Phase Shift Keying</td>
</tr>
<tr>
<td>RAM</td>
<td>Random-Access Memory</td>
</tr>
<tr>
<td>RX</td>
<td>Receiver</td>
</tr>
<tr>
<td>SCM</td>
<td>Superposition Coded Modulation</td>
</tr>
<tr>
<td>SIFS</td>
<td>Short Interframe Space</td>
</tr>
<tr>
<td>SMC</td>
<td>Simulink Model Compiler</td>
</tr>
<tr>
<td>SOC</td>
<td>System On Chip</td>
</tr>
<tr>
<td>STA</td>
<td>Station</td>
</tr>
<tr>
<td>SU</td>
<td>Single-User</td>
</tr>
<tr>
<td>TDMA</td>
<td>Time Division Multiple Access</td>
</tr>
<tr>
<td>TF-R</td>
<td>Trigger Frame for Random Access</td>
</tr>
<tr>
<td>TGac</td>
<td>Task Group ac</td>
</tr>
<tr>
<td>TX</td>
<td>Transmitter</td>
</tr>
<tr>
<td>UL</td>
<td>Uplink</td>
</tr>
<tr>
<td>URNG</td>
<td>Uniform Random Number Generator</td>
</tr>
<tr>
<td>VHT</td>
<td>Very High Throughput</td>
</tr>
</tbody>
</table>
Symbols

\( N \) Number of users
\( H \) Channel coefficient matrix
\( L \) Number of multi-path
\( t \) Number of time slot
\( M \) Number of transmitter antenna
\( R \) Number of receiver antenna
\( R \) Channel correlation
\( S(f) \) Doppler power spectrum
\( f_d \) Doppler frequency
\( T_d \) Feedback delay duration
\( S_{amp} \) Sampling rate
\( f_{serial} \) Serial processing frequency
\( Chan_{Forward} \) Number of feedforward channel coefficients
\( Num_{PDPtaps} \) Number of PDP taps
\( Chan_{Coeff} \) Number of feedforward and feedback channel coefficients
\( f_{MAX_{uniform}} \) Maximum frequency with uniform random generators
\( U \) Number of uniform random generators added
\( a_0 \) Denominator coefficients
\( b_0 \) Numerator coefficients
\( f_s \) Normalizing frequency
\( H_{iid} \) Independent identify matrix
\( C \) Cholesky decomposition matrix
\( P \) Overall power of channel
\( K \) Rician K-factor
\( H_{LOS} \) LOS matrix
\( H_{Rayleigh} \) Rayleigh matrix
\( x_n \) Transmitted signal of the \( n \)-th user
\( d_n \) Data length of \( n \)-th user
\( c_n \) Chip sequence of \( n \)-th user
\( x_{n,k} \) Symbol sequence of \( n \)-th user and \( k \)-th antenna
\( J \) Frame length
\( K \) Number of transmitter antenna for each user
\( r_k \) Received signal
\( x_{n,k}^{\text{Real}} \) Real part of symbol sequence
\( x_{n,k}^{\text{Img}} \) Image part of symbol sequence
\( a_k \) Complex zero mean AWGN with variance \( \sigma^2 \)
\( y_k \) Received signal after OFDM demodulation
\( \zeta_{n,k} \) Sum of interference from other users and AWGN noise
\( H_{n,k}^* \) Conjugate of \( H_{n,k}(j) \)
\( \overline{y}_{n,k} \) Received signal with the conjugate
\( \overline{\zeta}_{n,k} \) Sum of interference from other users and AWGN noise with the conjugate
\( \lambda(x_{n,k}) \) Output of ESE processing
\( E(\zeta_{n,k}) \) Mean of the interference
\( E(y_k) \) Mean of the received signal
\( E(x_{n,k}) \) Mean of the transmitted signal
\( \text{Var}(\zeta_{n,k}) \) Variance of the interference
\( \text{Var}(\zeta_{n,k}) \) Variance of the interference without the conjugate
\( \text{Var}(y_k) \) Variance of the received signal
\( \text{Var}(x_{n,k}) \) Variance of the transmitted signal
\( \hat{g}_{n,k} \) Estimated symbol
\( \hat{b}_{n,k}^{\text{Real}} \) Estimated bit in real part
\( \hat{b}_{n,k}^{\text{Img}} \) Estimated bit in image part
\( \hat{c}_{n,k} \) Estimated chip sequence
\( v \) Half of the number of bit per symbol
\( \alpha \) A point in the constellation diagram
\( \pi_{n}^{-1} \) Deinterleaving for the \( n \)-th user
\( \pi_n \) Interleaving for the \( n \)-th user
\( \tilde{a}_{n,k} \) Despread output

\( \tilde{c}_{n,k} \) Spread output

\( e_{n,k}(j) \) Extrinsic LLRs

\( N_c \) Number of sub-carriers

\( Ctrl \) Sum of soft mapper delay and the ESE delay

\( SP \) Number of spreading length

\( I \) Number of interference iteration

\( ID \) Index number of RAM

\( w_{ena} \) Write enable of RAM

\( N_b \) Number of data bit

\( W_d \) Bit length in fixed-point operation

\( F \) Clock frequency
Summary

In recent years, Multi-User Multi-Input Multi-Output (MU-MIMO) transmission has become a very important technique to improve the efficiency of wireless communication systems. MU-MIMO transmission can allow multiple users to simultaneously communicate enhancing the system performance. Because of this, MU-MIMO systems have been incorporated in current generation of wireless system standards.

Current MU-MIMO transmission schemes employ orthogonality in one way or another. For example, Space-Division Multiple Access (SDMA) introduced in 802.11ac avoids interference by applying a spatial precoding matrix before transmission. On the other hand, Orthogonal Frequency Division Multiple Access (OFDMA) avoids interference by scheduling users in separate frequency resource units. Next generation of MU-MIMO transmission works in completely non-orthogonal way which further increases the system throughput due to the absence of control packets necessary for user orthogonalization.

Non-orthogonal multiple access (NOMA) has been proposed for Long Term Evolution (LTE) and envisioned to be an essential component of the 5th Generation (5G) mobile network. Interleave Division Multiple Access (IDMA) is one of the NOMA techniques that can support multiple access for a large number of users in the same bandwidth. IDMA has several other advantages over multiple access schemes such as OFDMA and Code Division Multiple Access (CDMA). These include higher spectral efficiency and insensitivity to clipping distortion. However, some problems of the conventional IDMA must be considered. These include latency and hardware complexity. In addition, IDMA theoretical improvements are still unverified in practice and hence it needs experimental tests to verify that all parts of the system are properly working.

This thesis presents contributions to make IDMA systems applicable for future MU-MIMO communication systems.

- First, we present an MU-MIMO channel emulator that is indispensable not only in testing the proposed ideas in this thesis regarding MU-MIMO transmission but also in allowing experimental validation of current wireless communication systems.

- Second, we propose a novel interleaved domain IDMA architecture applicable to current wireless communication standards. The proposed architecture is able to reduce
the latency of interference cancellation to half increasing the throughput by twice.

- In addition, to further improve the proposed IDMA system in terms of throughput and low receiver complexity, we propose the use of higher order quadrature amplitude modulations (QAMs) which allows increase in throughput by simply changing the Log-Likelihood Ratio (LLR) calculation without increasing the needed parallel IDMA cancellation processing chain.
Chapter 1

Introduction

1.1 Background

In high density wireless local area network (WLAN) environments in which many users are present in a specific area, the collision probability of data transmission is high. As a result, the effective system throughput will be severely decreased because of the collisions among the stations accessing the wireless channel simultaneously. In Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA), the transmission by hidden nodes causes severe interference, i.e. collision, to an on-going transmission [3]. Wireless multiple access techniques supporting a large number of users are considered in order to take into account the problems mentioned above. There have been significant advances of multiuser (MU) techniques for wireless communication over the last ten years. Fig. 1.1 shows the volume of public WLAN users from years 2011 to 2016. As shown in the figure, the ever increasing number of users can only be supported through an efficient MU transmission based system.

MU transmission techniques can be distinguished by the different frequency, time, code, or power. These MU techniques are now being introduced in several new generation wireless standards (e.g., the fifth generation (5G) [1], 802.11ax [2]) as shown in Fig. 1.2. In next generation systems, the high transmission data rates, low latency and low complexity are required. Furthermore, there is a growing concern about user fairness. From system point of view, the customers have to pay the same charges for the same service except the
same quality of service (QoS). In future standards, we also need to focus more on fairness to satisfy the customer.

To satisfy these requirements, enhanced technologies are needed. Among the potential candidates, non-orthogonal multiple access (NOMA) is a key technology to enhance
the performance of next generation wireless communications. Orthogonal frequency division multiple access (OFDMA) is a well-known high-capacity orthogonal multiple access (OMA) technique whereas NOMA offers a set of desirable benefits, including greater spectrum efficiency and its ability to support for a large number of users. There are different types of NOMA techniques, including power-domain and code-domain. In the NOMA power-domain multiplexing, multiple users are superimposed with different power gains, which causes a problem of user unfairness. Interleave Division Multiple Access (IDMA) is one of the NOMA code-domain techniques. IDMA is a special form of Code Division Multiple Access (CDMA). The receiver differentiates each station (STA) by their unique interleaving patterns instead of using unique spreading codes. Compared to OFDMA and NOMA power allocation, IDMA allows multiple users to be transmitted at the same time and frequency without the strict requirements of different frequencies and powers. Because of the advantages of the IDMA system above, the thesis studies how to improve the current IDMA transceiver systems as well as their ability to employ the practical implementation.

To apply enhanced systems for future standards, the wireless channel emulator is important to test the systems. It dictates the transmitter architecture, the transmission rate, and the receiver architecture. In an MU wireless communication, the transmitted signals are being attenuated by fading due to multipath propagation and by shadowing due to large obstacles in the signal path, yielding a fundamental challenge for a reliable communication. In this thesis, the field programmable gate array (FPGA) implementation of an MU communication system is focused. Thus, the MU channel emulator is indispensable. The thesis proposes the MU multi-input multi-output (MU-MIMO) channel emulators with automatic sounding feedback. The feedback channel coefficients are separated by programmable time duration as compared to the feedforward channel coefficients. This programmability allows a thorough evaluation of the Doppler effecting in MU transmission.

In previous studies of IDMA system [4]-[7], the authors suggested the use of BPSK and QPSK modulation for IDMA system. The purpose of this thesis is to improve the spectral efficiency transmission of IDMA system by proposing a low complexity higher order quadrature amplitude modulation (QAM) for IDMA system.

The main problem that needs to be addressed in designing an IDMA system is the
latency caused by the interleaving process. According to the interleavers proposed in published literature, both the interleaving and de-interleaving operations permute sequences serially, which will take many hardware clock periods and lead to high processing latency and low processing throughput. This has been the bottleneck of the system throughput, especially when the number of iterations is large. Since the interference cancellation updates the extrinsic log likelihood ratios (LLRs) to improve performance by using previous LLR values, the reduction of latency in each iteration has a significant effect because the parallel processing cannot be employed to hasten the interference cancellation. The latency is particularly important because it has to follow a strict requirement. For example, in the case of recent 802.11 systems, the standard defines a short interframe space (SIFS) such that a wireless interface processes a received frame and responds with a response frame of 16μs. With practical IDMA system however, each iteration of the interference cancellation consists of an interleaving and deinterleaving process that would make the latency much higher than the defined SIFS. This problem hinders the development of IDMA system in practice. The thesis proposes a novel architecture for IDMA system. The architecture can calculate the updated extrinsic LLRs to detect multiple users in the interleaved domain without the deinterleaver iteration in interference canceller. As a result of the interleaved domain architecture, the proposed architecture can increase the throughput by almost twice and reduces the latency by almost half, but it does not increase the complexity that makes IDMA more feasible for the practical implementation.

From these contributions, the implementation of a MU communication system such as IDMA is possible for future wireless systems.

1.2 Research Objectives

The target of this thesis is to make IDMA system applicable for future wireless standards which have to satisfy the following objectives:

- An implementation of MU-MIMO channel emulator for testing not only the IDMA system but also current MU wireless systems.

- A low complexity and high throughput IDMA system.
A low latency IDMA system which can meet the requirements of future wireless standards.

The design of an MU-MIMO channel emulator is capable of sending channel feedback automatically to the access point from the generated channel coefficients after the programmable time duration. This function is used for MU beamforming features such as IEEE 802.11ac. The low complexity design of a MIMO channel emulator with a single path implementation for all MIMO channel taps is also considered. A single path design allows all elements of the MIMO channel matrix to use only one Gaussian noise generator, Doppler filter, spatial correlation channel and Rician fading emulator to minimize the hardware complexity. In addition, the single path implementation allows the addition of the feedback channel output with only a few additional non-sequential elements which would otherwise double in a parallel implementation.

Previous works proposed systems in the context of Superposition Coded Modulation (SCM) where multiple layers of BPSK or QPSK modulated symbols are transmitted simultaneously to achieve high spectral efficient transmission for IDMA system. However, this method has a very high complexity due to the high number of streams that need to be separated in the multi-user detection of the receiver. The thesis instead of SCM employs QAM modulation up to 256-QAM for high spectral efficiency transmission. The thesis shows the receiver architecture using a soft demapper which significantly decreases the receiver detection complexity. While a maximum number of users that can be accommodated in the proposed system is slightly less than the conventional, our proposed system is much more suited in modern multi-mode transceivers. Aside from the fact that it needs about 25% complexity compared with SCM-QPSK.

One of the problems in hardware implementation of IDMA is its high latency due to iterative processing. The thesis proposes a novel architecture for IDMA receiver with low latency while maintaining low complexity. The results show that the proposed architecture can reduce the latency about half and increase the throughput about double compared to the conventional architecture.
1.3 Thesis Hierarchy

Fig. 1.3 shows this thesis hierarchy. The thesis has six chapters. This first chapter is the introduction of this thesis. The remaining chapters are as follows:

**Chapter 2. Multi-User Wireless System Overview**

This chapter describes general introductions to the topic of MU wireless communication systems. The thesis briefly introduces the current techniques for multiple access systems. Then, it points out the advantages of IDMA systems such as great spectral efficiency and user fairness. The overview of IDMA system and MIMO channel emulator for testing are also described in this chapter.

**Chapter 3. Multi-User Channel Emulator System with Automatic Sounding Feedback**

This chapter focuses on the channel emulator for MU wireless systems and the automatic sounding feedback channel. First, the thesis describes MU-MIMO wireless channel emulator and the feedback delay. Then, it shows the hardware implementation of the proposed channel emulator and the measurement results.

**Chapter 4. Higher Order QAM Modulation for Uplink MU IDMA Architecture**
This chapter shows the proposed higher order modulation IDMA system that includes the iterative multi-user detection with a simplified soft bit computation. The complexity comparison, the simulation result of QAM-IDMA system and the superposition coded modulation IDMA system are shown to clarify the effectiveness of the proposed QAM-IDMA system.

Chapter 5. Interleaved Domain Interference Canceller for Low Latency IDMA System

This chapter describes the proposed interleaved domain architecture that can reduce the latency to almost half effectively doubling the throughput with almost the same hardware utilization. The details of the implementation of the proposed architecture and its results are also shown in this chapter.

Chapter 6. Conclusion and Future Work

This chapter shows the summary of our whole works and the achievement results. It also discusses about the possible research directions for future works to improve the MU wireless communication systems.
Chapter 2

Multi-User MIMO Wireless System
Overview

2.1 Overview

Multi-User transmission is a radio transmission scheme that allows several stations to transmit at the same time. There are several specific multiple access techniques such as Time Division Multiple Access (TDMA), Frequency Division Multiple Access (FDMA), CDMA and OFDMA designed to share the channel among several users. We separate these multiple access techniques into orthogonal multiple access (OMA) such as FDMA and OFDMA and non-orthogonal multiple access (NOMA) such as CDMA and IDMA. In OMA, wireless users competes with each other for the frequency resource to transmit their information flow. If we cannot control concurrent access of several users, collisions can occur. Since collisions are undesirable for connection-oriented communication such as mobile phones, personal/mobile users need to be allocated into the dedicated channels on request. A main issue with the OMA techniques such as OFDMA is that its spectral efficiency is low when some bandwidth resources are allocated to users with poor channel state information. On the other hand, the use of NOMA enables each user to have access to all the subcarrier channels, and so the bandwidth resources allocated to the users with poor CSI can still be accessed by the users with strong CSI, which significantly improves the spectral efficiency.

A duplex method of MU transmission is divided into uplink (UL) (many-to-one) and
downlink (DL) (one-to-many) transmission as shown in Fig. 2.1. Our main emphasis will be on UL communication in which multiple users simultaneously communicate with a single receiver such as access point (AP). In the UL transmission, the IDMA technique can allow all users to spread their signals across the entire bandwidth, like in the CDMA system. However, rather than using unique spreading codes to decode every user treating the interference from other users as noise, the receiver differentiates each STA by their unique interleaving patterns. This leads to a low complexity receiver which grows linearly with the number of parallel stations (STAs) supported [10].

In testing a MU system, experimental tests using actual wireless transmission are very important to ensure that all parts of the system are properly working. However, due to various factors such as government restrictions and logistical problems, experimental tests using wireless medium often cannot be performed. In this case, having a wireless channel emulator is indispensable. While all of various research works in the literature [8],[9] support single-user (SU) transmission, we need to consider the MU channel emulator for MU transmission.
2.2 Multi-User Protocol

MU techniques have been applied and proposed for current and future wireless communication systems. After the 802.11ac standard was ratified a few years ago, the downlink MU-MIMO system has become a very promising option to improve WLAN spectral efficiency [11]. Uplink MU is supported in 802.11ax [12]. Fig. 2.2 shows a simple example of the UL-MU access in 802.11ax. In this protocol, the transmission timing of each station (STA) is centrally controlled by the AP. To inform necessary control information of UL-MU transmission to users, the AP transmits a controlled frame called Trigger Frame for Random Access (TF-R). Each user performs OFDMA random access according to the control information which is informed by the AP. Users who get transmission opportunity will send a frame to the AP. The AP responds in accordance with the condition of received UL-MU frames. A series of this flow is repeated every trigger interval time. In order to process UL-MU Media Access Control (MAC) protocol, first the UL-MU physical (PHY) transmission has to be supported. IEEE 802.11ax adopts uplink OFDMA random access scheme. However, the spectral inefficiency and high complexity in user scheduling are the problems of OFDMA techniques. Therefore, NOMA techniques are the promising technology for future wireless systems as 5G [1]. IDMA is one of the NOMA techniques; thus it has many advantages of NOMA for spectral efficiency and user fairness.

2.3 Multi-User Transmission System

The MU communication system includes the transmitter and the receiver which are connected by the channel as shown in Fig. 2.3. The transmitted signal is affected by channel fading and a thermal noise caused by electronic devices.

2.3.1 Channel Emulator

The performance of the wireless system depends on channels where the signal is transmitted from the transmitter to the receiver. Unlike stable and predictable wired channels, radio channels are completely random and not easy to analyze. Signals are transmitted via radio channels, hampered by buildings, mountains and trees. They are then reflected, scattered
These phenomena are referred to as fading. As a result, in the receiver, a lot of different versions of the transmitted signal are collected. These fadings affect the quality of radio communication systems. Hence, channel emulator is very important to ensure that all parts of the system are properly working.

MU-MIMO is a set of multiple-input and multiple-output technologies for wireless communications, in which a set of users or wireless terminals, each with one or more
antennas, communicate with each other. In contrast, the single-user MIMO is a single-user multi-antenna transmitter communicating with a single-user multi-antenna receiver. In a similar way that OFDMA adds multiple access capabilities to OFDM, MU-MIMO adds multiple access capabilities to MIMO. The MU-MIMO channel models comprise of the Doppler spectrum, the spatial correlation, the Rayleigh fading, the Rician fading, the multipath fading, the path loss and shadowing. If the line of sight (LOS) signal is much stronger than the others, Rician fading occurs. If there are multiple scatterers and no LOS signal, Rayleigh fading occurs. MU-MIMO techniques can be adapted to both indoor and outdoor environments such as channel models in 5G, WIMAX or 802.11ac system. In 802.11ac, there are the channel models A, B, C, D, and E for indoor environment as well as the model F for both indoor and outdoor environment. In indoor environment, the channel is not as easily affected by rough path loss exponents. While delay spreads are often much smaller than outdoor environments, the indoor systems often have to achieve very high data rates. In the MU-MIMO channel emulator, although the parameters of the channel emulator in the standards are different, the coefficient generator is the same.

The MU transmission for 802.11ac systems enables the access point (AP) to send signals simultaneously to all stations (STAs) without interference. This is possible by calculating an MU beamforming (MU-BF) matrix from a priori knowledge of each STAs channel state information (CSI). In order to evaluate the MU-BF performance, the transmitter media
access control (MAC) must perform a channel sounding procedure as shown in Fig. 2.4 for all the receiving STAs. The transmitter, after receiving the feedback from each of the STAs, will compute an MU-BF matrix to be used for the MU-MIMO transmission. Depending on the duration between the time when the STAs compute their channel feedback and the time when the AP performs MU transmission, the performance of the system changes due to channel evolution [14]. The channel feedback has an important role in MU transmission.

2.3.2 IDMA System

The focus of this thesis is on the uplink MU transmission for IDMA system since it can increase performance for future wireless systems. The IDMA system differs to the CDMA system in the use of interleaving code instead of spreading code. In IDMA system, the spreading code is used as repetition code. Therefore, bandwidth expansion is fully exploited for forward error correction code that typically results in very low rate code as compared to CDMA system. In the case of using the same spreading length, the number of users in IDMA system is larger than the number of users in CDMA system because the spreading length can be used smaller than the number of users in IDMA system. Another advantage of IDMA system is insensitivity to clipping distortion compared to CDMA system. However, the most advantage of IDMA system is low complexity at the receiver. IDMA system has low cost and superior performance in multi-user detection because it detects desired signals from interference and noise. Matched filter of CDMA system is low complexity but it has poor performance. MMSE filter of CDMA has moderate performance but it is large complexity. While the computation cost of MMSE filter is $N^2$ for CDMA system, the computation cost of the interference cancellation is $N$ for IDMA system where $N$ is the number of users. In IDMA system, the interleaver patterns used by the participating stations (STAs) are pre-generated and stored in both access point (AP) and STAs. The specific interleaver used by one client depends on its index assigned by the AP during association.

The IDMA receiver includes the interference canceller to process the multiuser detection. In the IDMA and turbo coding literature, the a posteriori probability (APP) decoder is inside the iteration loop because it make the performance of IDMA systems better in
iterative decoding. However, since this will cause a very high latency to implement, we simplify a simpler iteration loop where only the repetition decoder is placed inside the iteration loop [13]. The interference canceller consists of the elementary signal estimator (ESE), the deinterleaver, the despreader, the extrinsic LLR calculation and the soft mapper as in Fig. 2.5. The extrinsic LLR calculation includes the spreader and the interleaver. The ESE is used as a soft demapper by calculating the LLR for each bit in one symbol. The LLR output of ESE is deinterleaved with the unique interleaver index for each user. Then the ordered LLR value is despread. In the first iteration, the extrinsic information is very inaccurate. The receiver needs more than 4 iterations even with a little actual noise to obtain an acceptable bit error rate (BER) [15]. If this iteration is not the last iteration, the despread LLRs are spread again for the extrinsic LLR calculation that bases on the difference of before and after despreading. These are the values of the other spreading codes excluded itself. The extrinsic LLRs are then interleaved to produce the values for the soft mapping which updates the mean and variance variables for the ESE processing. In the
case of the final iteration, the spreader and the interleaver are not needed. The decoded LLR values from the despreader are decoded by channel decoder to produce the estimate of the transmitted bits.

2.4 Summary

In this chapter, the thesis has shown the overview of multi-user wireless system. The multi-user protocol has also presented. The MU communication system includes the transmitter and the receiver. The channel emulator is also needed for testing the system. The thesis focuses on MU channel emulator and the uplink MU transmission for IDMA system.
Chapter 3

Multi-User MIMO Channel Emulator with Automatic Sounding Feedback

3.1 Introduction

In this chapter, we focus on the field programmable gate array (FPGA) implementation of MU channel emulators for MU systems. While various research works in the literatures [8], [9] all support wireless local area network (WLAN) environments, they are designed for single-user (SU) transmissions. After the 802.11ac standard was ratified a few years ago, downlink (DL) multi-user (MU) transmission with multiple input multiple output (MIMO) antennas has become a very promising option for improving WLAN system efficiency [11]. Uplink (UL) MU-MIMO is supported in 802.11ax [12]. UL and DL MU schemes can be considered as dual modes. Hence, in this chapter, we only consider the DL MU case because the DL requires the channel state information feedback for beamforming processing which is not necessary in UL.

In the evaluation of MU transmission performance of the hardware WLAN platform, one hurdle is that it is able to evaluate the performance of the system. Timely channel sounding operations must be performed, which needs a working MAC layer. Although channel emulators are commercially available [16], their features do not support the generation of the feedback channel coefficients for MU-MIMO systems. A complete MAC and PHY module that can process MAC information elements must be available for MU-BF.
However, MAC development in itself takes a lot of time and resources such that development is done in parallel with the PHY.

In this chapter, we present the design of an MU-MIMO channel emulator. This MU-MIMO channel emulator can be used for testing any MU systems such as IDMA, OFDMA and MU-MIMO by changing the parameters in the design. The proposed channel emulator is capable of sending channel feedback automatically from the generated channel coefficients. It is called the feedforward channels used for convolving the input transmitted signals. The feedback channel coefficients are separated by programmable time duration compared to the feedforward channel coefficients. In the case of uplink IDMA system, this channel feedback can be used for power control of each users. Moreover, in 802.11ac, the feedback channel can be used for downlink MU-MIMO which needs channel state information to process the MU-BF. The programmable time duration of feedback channel allows a thorough evaluation of the Doppler effecting in MU-BF transmission. Aside from this, the feedback capability of the channel emulator makes it possible for the following advantages:

1. Evaluation of MU-BF algorithms without channel estimation error. This is important for non-linear MU-BF algorithms whose performance gain is highly sensitive to the effect of channel estimation.

2. PHY level evaluation of MU-MIMO transmission with very minimal MAC features.

3. Evaluation of the MU-MIMO systems with virtual STAs. Virtual STAs are STAs that are part of the MU-MIMO system, but whose bit error rate (BER) performance is not calculated. This enables the evaluation of any MU-MIMO system configurations even with a limited platform that has room for only one AP and one or few STAs.

The chapter is organized as follows. In Section 3.2, we describe MU-MIMO WLAN channel emulator models and the feedback delay. Hardware platform implementation is shown in Section 3.3. Section 3.4 shows the measurement results. Section 3.5 presents the synthesis results, and Section 3.6 is our summary.
3.2 MU-MIMO Channel Model

The MU-MIMO channel coefficient generator structure is shown in Fig. 3.1. At every time instant, the channel model generates a set of matrix coefficients $H_{1}^{l} \sim H_{N}^{l}$ for STAs 1 to $N$ and path 1 to $L$. The aggregate MU-MIMO channel is then defined as $H(t) = [(H_{1}^{1}(t))^{H}, (H_{2}^{1}(t))^{H}, \ldots (H_{N}^{1}(t))^{H}]^{H}$ for the $l$-th multi-path and the $t$-th time. While not seen in the model, each of the matrices can have multiple path components following a certain power delay profile (PDP).

3.2.1 General MU-MIMO Channel Model

The MU-MIMO channel models comprise of the Doppler spectrum, the spatial correlation, the Rayleigh fading, the Rician fading, the multipath fading, the path loss, and the shadowing as in Fig. 3.2, where $M$ is the number of transmitter antenna and $R$ is the number of receiver antenna. The designed channel emulator can be used for the general MU-MIMO channel model, but in this case, we used the actual value defined in the 802.11ac channel model as an example. Moreover, because the 802.11ac transceiver was completed without the channel [17], a channel emulator can be used to test our 802.11ac transceiver platform well.
3.2.2 Statistical Model

The statistics for path delay, Doppler and spatial correlation are based on the values defined in the 802.11ac channel model. These values are the results of many experimental measurements done by many companies that attend the IEEE 802.11ac standardization. The Task Group ac (TGac) channel model [18] produces randomly generated channel matrix coefficients with a defined spatial, temporal and spectral statistics. The spatial correlation of the channel matrices which follows the Kronecker model as assumed since 802.11n directly affects the channel capacity [19]. This means that the spatial correlation can be expressed as

\[
R_l = \text{vec}(H_l)H'\text{vec}(H_l) = R_{TX} \otimes R_{RX}
\]  

Equation (3.1) signifies that the channel correlation \( R \) can be estimated independently in the transmitter and receiver. \( \text{vec()} \) is the vectorization of a matrix. It is a linear transformation which converts the matrix into a column vector. Since the spatial correlation
is calculated by the Kronecker product of the correlation between the transmitter and the receiver antenna, the vectorization is used to express matrix multiplication as a linear transformation on matrices. \( \mathbf{R}_{TX} \) and \( \mathbf{R}_{RX} \) are the spatial correlations between the transmitter antennas and the receiver antennas respectively.

The temporal correlation of the channel is directly due to the Doppler spread where the channel coefficients undergo fading with respect to time. For outdoor environments, the auto-correlation of the channel coefficient can be affected by the relative motion of the user terminal and the base station.

For indoor wireless channels, the typical fading effect scenario involves human-based motion as opposed to the relative motion between the transmitter and the receiver [18]. These fading effects can be described by the following Doppler power spectrum:

\[
S(f) = \frac{1}{1 + A \left( \frac{f}{f_d} \right)^2}
\]  

(3.2)

where \( A \) is a constant, defined to set \( S(f) = 0.1 \) (a 10 dB drop) at frequency \( f_d \) (thus: \( A = 9 \)) and \( f_d \) is the Doppler frequency. Based on new experimental data collected during the 802.11ac standardization, the channel coherence time was set to 800ms or an equivalent Doppler spread of \( f_d = 0.414Hz \) [18].

In term of frequency selectivity, the power delay profile (PDP) followed by the channel model directly affects the frequency domain statistics of the frequency selective channel. The 802.11ac channel model did not change the PDP definitions for 802.11ac, but defined a mechanism to extend the previously defined PDP to higher bandwidths instead. The 802.11n PDP was defined only with a minimum tap spacing of 10ns for bandwidths up to 40MHz.

### 3.2.3 Feedback Delay

The 802.11n standard defines a mechanism for channel feedback from the STA to the AP. This was expanded in 802.11ac to support multiple user feedback as shown in Fig. 3.3. First, the AP sends a null data packet announcement (NDPA) frame starting the CSI feedback process. The null data packet (NDP) is a packet only containing the training symbols
and is solely used for sounding the channel. After the NDP is received, each of the STA will send the very high throughput (VHT) Compressed Beamforming frame containing the channel feedback information.

As seen in the above protocol, a complete MAC and PHY module that can process MAC information elements must be available in order to experiment transmissions with MU-BF. We propose an implementation of a feedback channel emulator which automatically generates MIMO channel feedback with the programmable delay timing. This function helps to evaluate the MU-BF without using channel estimation and very minimal MAC features. In other words, one benefit of using our channel emulator instead of using the wireless channel is that it is possible to provide a channel feedback to the AP without initiating the protocol in MAC. In addition, the channel evolution due to the time delay associated with the protocol can be parameterized to simulate various update periods in real WLAN operation.

In the conventional model [20], the design of the channel emulator which generates the channel coefficients is shown in Fig. 3.4. At the beginning, the AP-MAC sends the NDP to start the CSI process. The CSI is estimated at the PHY of each STA. The MAC of each STA then constructs the beamforming report frame and feedbacks to AP. At the AP, the PHY parses each channel feedback and the MAC computes a MU-BF weight to be used to produce the MU-BF signal. The computed MU-BF weights of the MAC are stored in the MU-BF RAM inside the AP. Note that this is done transparently to the PHY, meaning that the PHY will use any MU-BF weight stored in the MU-BF RAM regardless of the
"freshness" of its contents.

In the design of our proposed MAC and PHY operation for evaluation, the channel feedback is directly written by the proposed channel emulator. These results are in a much simpler flow as shown in Fig. 3.5. Based on the feedback channel coefficients generated by the proposed channel emulator, the non-AP STAs do not need any MAC functions and hence the MAC layer can be omitted. Moreover, we use the very minimal MAC features at the AP. It is the CSI RAM that stores the channel feedback from the STAs and the
MU-BF weight calculation. In addition, the physical layer service data unit (PSDU) RAM that contains the packets to be transmitted is also needed. The rest of the MAC features such as carrier sense multiple accesses with collision avoidance (CSMA/CA), control or management frames and operator are not needed. In the case of the transmitter and the receiver share information by connecting directly, there are two technical problems. First, the transmitter and the receiver must agree on an NDP-like signaling scheme and some related control information to support the direct connection. Hence, one needs to create a crude channel sounding protocol which in itself must be verified. This procedure is inefficient and prone to error. The proposed emulator is transparent to the transmitter and the receiver except for the writing of the feedback channel coefficients to the transmitter RAM. Second, when the delay duration is large, our proposed emulator has an advantage to reduce the memory register of the hardware resource which is used to save the feedforward channel until the delay time happens.

The delay controller in our proposed design is shown in Fig. 3.6. This controller is used to choose the feedback delay duration $T_d$ for generating the feedback channel. In realistic channel environment, because of the delay in gathering CSI, e.g. CSMA/CA and random back-off, the CSI feedback delay duration for each STA is a random number. To emulate the feedback channel in this case, the delay controller sets the duration to a random number which has the same design with the simulator of IEEE 802.11ac system. Our channel emulator can support both the random delay and the constant delay. In the case of evaluation of a new MU-BF scheme, a constant delay is very helpful. Published papers have given feature constant delay MU-MIMO BER performance verification [21], [22]. In these cases, the proposed channel emulator allows us to provide a programmable constant delay, e.g. 20ms or 40ms. In our proposed system, the delay controller sets the delay duration using any pre-defined values per user input.

### 3.3 Hardware Platform Implementation

In the hardware implementation, the parameters of 802.11ac channel emulator are chosen to implement as an example. The structure of the MIMO channel coefficient generator block of the 802.11ac channel emulator is shown in Fig. 3.7. The main components include the
Table 3.1: Channel Emulator Specification

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Output Sampling Rate</td>
<td>124 Hz</td>
</tr>
<tr>
<td>Doppler Frequency</td>
<td>0.414 Hz</td>
</tr>
<tr>
<td>Channel coherence time</td>
<td>800ms</td>
</tr>
<tr>
<td>PDP tap spacing</td>
<td>5 ns</td>
</tr>
<tr>
<td>Number of taps</td>
<td>35</td>
</tr>
<tr>
<td>Supported Channel Models</td>
<td>TGac A-F</td>
</tr>
<tr>
<td>Supported MIMO Configuration</td>
<td>4 × 4</td>
</tr>
<tr>
<td>Supported Number of Users/Streams</td>
<td>2</td>
</tr>
<tr>
<td>Transmit signal bandwidth</td>
<td>80MHz</td>
</tr>
</tbody>
</table>

additive white Gaussian noise (AWGN), the Doppler fading emulated by using low pass filter (LPF), the spatial correlator, PDP blocks and line of sight (LOS) effects. The channel emulator specification is shown in Table 3.1.

### 3.3.1 Design of Functional Blocks

In Fig. 3.7, the functional blocks of the 802.11ac channel model are shown. The functional blocks of the proposed channel emulator are based on this model. In Table 3.1, the case with the maximum number of channel coefficients that need to be generated is the Channel Model D (35 PDP taps) with 4 × 4 MIMO TGac configuration and 5ns PDP tap spacing. This configuration needs a total of \( \text{Chan}_\text{Forward} = \text{Num}_\text{PDPtaps} \times M \times X \times 2 = 35 \times 4 \times 4 \times 2 = 1120 \) independent Gaussian numbers to be generated where \( \text{Num}_\text{PDPtaps} \) is the number of PDP taps. The \( \times 2 \) factor is used because of the channel coefficient being the complex numbers. If these function blocks are processed in parallel, these Gaussian
numbers need 1120 blocks low-pass filters, spatial correlation, and Rician to generate the channel coefficients. When a feedback channel is supported, the total blocks will double to \( Chan\_Coef = Chan\_Forward \times 2 = 1120 \times 2 = 2240 \) independent Gaussian numbers as presented in Fig.3.8.

As a number of coefficients are very large and the hardware resource is limited, the implementation cannot be fitted using parallel implementation. In order to address this issue, a design methodology for computing all channel coefficients using single path implementation is proposed. Since the frequency clock of FPGA board is high at 80MHz, we propose to use higher sampling frequency to reduce the complexity. For example, the sampling frequency of the Doppler filter \( Samp\_Rate \) is 124Hz and with a maximum of 35 PDP taps for model D, the maximum frequency to generate all 2240 channel coefficients is \( f_{\text{serial}} = Samp\_Rate \times Chan\_Coef = 124Hz \times 2240 = 277.7kHz \). Therefore, by increasing the sampling frequency, all channel coefficients are generated as a serial processing which is designed to include one Gaussian generator, one LPF, one spatial correlation and one Rician fading block. This processing reduces the computational complexity up to 99%
compared to the parallel processing of the conventional design. The single path processing is shown in Fig. 3.8. All channel coefficients are generated by using the serial processing.

This architecture makes use of a model based design methodology using simulink model compiler (SMC) from Synopsys, Incorporated. Model based design methodology utilizes mathematical and visual methods for rapid simulation and prototyping. This is especially suitable for channel design where channel models are either described visually or mathematically.

### 3.3.2 Gaussian Random Number Generator

To generate these numbers, we use the uniform random number generator (URNG) block in SMC and apply the central limit theorem by adding time samples of the URNG block. To ensure no correlation between random coefficients, we add many uniform random generators which have different random seeds. Therefore, the maximum frequency becomes

\[
 f_{\text{MAXuniform}} = f_{\text{serial}} \times U = 277.7\text{kHz} \times 4 = 1.1\text{MHz}
\]

where \(U\) is the number of uniform random generators added, which is processed one every 4 samples in this case. We chose \(U = 4\) as a good trade-off between the complexity and the low sampling frequency. The AWGN generator block is shown in Fig. 3.9. The top branch produces all the necessary taps for the main channel output or feedforward channel output while the bottom branch produces all the necessary taps for the feedback channel output. It is to be observed that at the end of the block, the commutator is used to sequentially switch the data from two parallel input ports to a single output port and the data rate of the output port will double as in Fig. 3.9. This is called a single path implementation. Therefore, the output of the AWGN
3.3.3 Doppler Filter

As mentioned in the previous section, the time variant channel is modeled by a "Bell shape" power spectrum. The TGn channel model provided the digital filter in eq. (3.3) and was used by our emulator as it is an infinite impulse response filter.

\[
S(f) = \frac{U b_0 + b_1 z^{-1} + b_2 z^{-2} + \ldots + b_7 z^{-7}}{a_0 + a_1 z^{-1} + a_2 z^{-2} + \ldots + a_7 z^{-7}}
\]

(3.3)

where \( U = 2.79 \) while the rest of the coefficients including the denominators \( a_0, a_1, a_2, a_3, a_4, a_5, a_6, a_7 \) are \( 1.00, -5.94, 14.8, -19.9, 15.2, -6.44, 1.28, 0.06 \), respectively and the numerators \( b_0, b_1, b_2, b_3, b_4, b_5, b_6, b_7 \) are \( 1.00, -4.63, 9.40, -10.9, 7.91, -3.59, 0.92, -0.09 \), respectively [19]. Because we used these parameters in IIR filter according to 802.11ac standard, we chose a normalization factor of 300 consistent with [19] to achieve the effective sampling period of the Doppler filter. This is equal to the Doppler spread \( f_d = 0.414 \text{Hz} \) multiplied by a normalizing factor \( f_s = f_d \times 300 = 0.414 \times 300 = 124 \text{Hz} \).

While in parallel processing we need a total of 2240 IIR filters for all 2240 channel coefficients as in Fig. 3.7, in single path implementation we only need one IIR filter for all coefficients for low complexity. Normally, we cannot share the IIR filter with multiple input streams as switching between the states of the filter registers will destroy the previous state.
To do this without affecting the statistics of the generated channel taps, we use banks of random-access memory (RAM) to save the filter states before switching from one channel tap to another. We use 7 RAM blocks with size of 2240×32bit to store the filter states of all channel taps. The design is shown in Fig. 3.10.

### 3.3.4 Spatial Correlation Block

While the temporal elements of the matrices have already been correlated by the Doppler filter, the spatial domain is still uncorrelated. Let the output of Doppler filter be arranged into a column vector $H_{iid}^l$ such that

$$H_{iid}^l = \begin{bmatrix} h_{l1}^1 & h_{l2}^1 & \cdots & h_{M1}^1 & h_{l1}^2 & h_{l2}^2 & \cdots & h_{MR}^2 \end{bmatrix}^T$$

(3.4)

where $T$ is the transpose of a matrix. Equation (3.1) can be rewritten as

$$H_{V}^l = CH_{iid}^l$$

(3.5)

where $C$ can be obtained from the Cholesky decomposition

$$R^l = CC^H$$

(3.6)

The spatial correlation block needs a total of $M^4 = 256$ complex multipliers to implement. Similar to the Doppler filter block, we use one complex multiplier block to oversample by 256. Given that the output sampling frequency of the Doppler filter is 124Hz and with a maximum of 35 PDP taps, the spatial correlation block throughput needs to run at about 1.1MHz to fulfill the task. We also use the simple complex multiplier which has only three multipliers instead of four multipliers (as in the normal case) to reduce the utilization of hardware resource.

### 3.3.5 Rician Fading Block

In general, the wireless MIMO channel consists of a line-of-sight (LOS) component and non-line-of-sight (NLOS) components. In this section, both LOS and NLOS fading are
Figure 3.10: Doppler filter block
considered. The first tap power, or LOS component which is much larger than the NLOS component, is added to generate Rician fading as in eq. (3.7).

$$
H = \sqrt{P} \left( \sqrt{\frac{K}{K+1}} H_{LOS} + \sqrt{\frac{1}{K+1}} H_{Rayleigh} \right)
$$  \hspace{1cm} (3.7)

where $P$ is the overall power of channel, $K$ is the Rician K-factor, $H_{LOS}$ is the LOS matrix and $H_{Rayleigh}$ is the Rayleigh matrix. The Rician fading with parameter $K = 0$ which is defined as the ratio of the LOS and NLOS component powers is the Rayleigh fading. When the LOS component exists, $K > 0$.

As in the spatial correlation block, throughput needs to run at about 1.1MHz to fulfill the task.

### 3.3.6 FPGA Implementation

Fig. 3.11 shows the channel emulator as a part of a complete MU-MIMO evaluation platform. The transmitter and receiver are a complete MAC and PHY 802.11ac verification platform previously implemented in [17].

The channel emulator board itself contains 5 Stratix II EP2S180F1508 FPGAs and one Virtex 4 FPGA. Four of the FPGAs are equipped with 4 analog-to-digital converter (ADC) and 4 digital-to-analog converter (DAC) for connecting with the baseband. This
is connected to the passband converter of the channel emulator called the interconnection
device. There are two interconnection devices in the channel emulator block where one
connects to the passband of the transmitter and the other one connects to the passband of
the receiver. This architecture is used to verify the transmitter, the receiver and the channel
emulator at the passband. In the channel emulator board, the 4 FPGAs called FPGA A, B, C
and D receive the transmission signals from their ADCs and channel coefficients generated
from the remaining FPGA called FPGA E. Then, these FPGAs convolve the transmitted
signals with the channel coefficients to produce the received signals and transmit them
to the receiver after using their DACs. It is to be observed that the feedback channel is
connected to the transmitter by using the ribbon cable connection.

3.4 Measurement Results

In this section, we verify the results measured by the 4 × 4 channel emulator platform.
First, we verify the statistical properties of the generated main channel samples by com-
puting the Doppler spectrum of each tap and the stochastic capacity of the resulting MIMO
channel. Next, we investigate the MU-MIMO features of the proposed system by testing
the feedback channel output as well as capturing the constellation of the transmitted (TX)
signal which processes the MU-BF progress using oscilloscope. In this experiment, the
transmitter and the proposed channel emulator are synthesized inside the channel emulator
FPGA board. The oscilloscope Tektronix 3032 is used to replace the receiver to capture the
constellation diagram. After combining the MU-BF signal and the channel coefficients, the
received signal is transmitted to oscilloscope by using the 12-bit DACs inside the FPGA
board. We configure the oscilloscope to display the constellation diagram by using the XY
display feature.

3.4.1 Statistical Verification

The simulator uses 802.11ac system whose parameters are set as in Table 3.2. To test the
Doppler spread, we set the channel emulator configuration to the TGac channel model D
with 4x4 antennas. We then output the first channel coefficient $h_{11}$ of the first channel tap
to the signal tap and plot the power spectral density (PSD) spectrum to compare with the Doppler spread of TGac system simulation. In similarly method, we receive the Doppler spectrum of the second tap. Fig.3.12a shows the comparison of the PSD spectrum between the reference output from simulation and the hardware result of the first tap while Fig.3.12b shows the results of the second tap. As we can be seen, both outputs have the Doppler spectrum with similar distribution. In model for indoor wireless LANs, all taps have a classical Doppler spectrum, except for the first tap of channel D which has a 10 dB spike [23]. The results show the 10 dB spike shape of the first tap and the bell shape of the second tap in Fig.3.12a and Fig.3.12b respectively.

Much of the increase in capacity of IEEE 802.11 systems depends on the rank of the channel matrix. In 802.11ac channel model, the spatial correlation of the channel matrices follows the Kronecker model, which affects the channel capacity. The PHY capacity of MIMO channel for measured MIMO channels is calculated as in (3.8) [18].

\[
C = \log_2 \det |I + SNR MHH^H| \\
\text{(3.8)}
\]

where SNR is the average received signal to noise ratio, \( R \) and \( M \) are the number of the receiver and transmitter antennas respectively, \( \mathbf{H} \) is the channel coefficient matrix and \( H \) denotes the Hermitian transpose. Assuming 30 dB average SNR, we use (3.8) to verify the capacity of the generated MIMO channel. We set the channel emulator configuration to channel model D and the distance of transmitter and receiver to be 15 m, which satisfies the NLOS condition of TGac channel. In Fig.3.13, we can see that the capacities of the first tap of the model D and model E in NLOS condition obtained from the hardware emulator which matches well with that of the theory reference channel output from the standard TGac simulator.

### 3.4.2 Feedback Delay Verification

In this subsection, we verify the feedback delay output of the channel emulator. Fig.3.14 shows the picture of two waveforms with a 100ms delay verifying the correctness of the emulator output.

Next, we demonstrate the advantage of having a programmable feedback delay. The
CSI feedback delay in TGac system simulator is randomly changed from $0ms$ to $40ms$ as in the condition of actual channel environment while the feedback delay of the proposed channel is set at a constant $20ms$ delay. The BER performance of random feedback delay using the channel simulator and the proposed channel is shown in the pink curves and blue curve respectively in Fig.3.15. From these results, there are at least 3 dB differences when the delay duration is changed. The proposed channel emulator can generate the constant channel feedback delay which has stable performance. This function is useful in doing the experimental tests for testing new MU-BF algorithms which need constant delay duration.
Table 3.2: Simulation Parameters

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Simulator</td>
<td>802.11ac system</td>
</tr>
<tr>
<td>Number of transmitter antennas</td>
<td>4</td>
</tr>
<tr>
<td>Number of receiver antennas</td>
<td>4</td>
</tr>
<tr>
<td>Data length</td>
<td>100 bytes</td>
</tr>
<tr>
<td>Transmit signal bandwidth</td>
<td>80MHz</td>
</tr>
<tr>
<td>Modulation and coding scheme</td>
<td>2</td>
</tr>
<tr>
<td>Precoding</td>
<td>Block Diagonal</td>
</tr>
<tr>
<td>Number of iteration</td>
<td>300</td>
</tr>
<tr>
<td>Channel decoding</td>
<td>Hard Viterbi</td>
</tr>
<tr>
<td>Channel model</td>
<td>TGac model D</td>
</tr>
<tr>
<td>CSI feedback delay</td>
<td>Randomly from 0 ms to 40 ms</td>
</tr>
</tbody>
</table>

3.4.3 Platform Verification

The platform verification parameters are shown in Table 3.3. In this subsection, we want to verify the platform in Fig.3.11. However, in order to avoid problems related to synchronization of multiple FPGAs, we synthesize the transmitter and receiver inside the channel
emulator FPGA board which includes five FPGAs connected in one board. Instead of receiving the transmitted (TX) signal from the external FPGA, we generate the TX signal in FPGA A of the channel emulator board. In this verification, we assume a two user MIMO
system, the quadrature phase-shift keying (QPSK) modulation, 0.5MHz signal bandwidth and TGac channel D. Fig.3.16 shows the MU-BF process. Fig.3.17 shows the platform implementation of the MU-BF process. In the MIMO channel emulator board, FPGA E is used to implement the MIMO channel emulator which includes the feedforward channel and the feedback channel. The transmitted signal of two users $x_1$ and $x_2$ is produced inside the FPGA A. In the FPGA A, the MU-BF signal is also calculated by convolving the transmitted signals and the feedback channel. After that, this signal is transmitted to FPGA B and FPGA C. These FPGAs convolve the feedforward channel from FPGA E and the MU-BF signal from FPGA A to output $x_1$ and $x_2$. These signals are captured by using oscilloscope Tektronix 3032. The EVM results of $x_1$ are shown in Fig. 3.18. The EVM of hardware implementation has about 1% difference with the EVM of Matlab simulation because of the fixed point nature of hardware implementation. According to the results, we observe the constellation of $x_1$ in our proposed system at $T_d=7.2ms, 28.7ms$ and $100ms$ delay duration on oscilloscope as examples. Fig. 3.18 shows the EVM results continuously increase when the feedback delays increase. This is reasonable with the degree of constellation scattering which is observed on the oscilloscope.
3.5 Synthesis Results of Proposed Channel Emulator

The synthesis results with the target FPGA Stratix II EP2S180F1508 are shown in Table 3.4. The efficiency of the single path implementation in reducing the complexity is apparent in this table. The table includes the synthesis results of the single path implementation of feedforward channel, the single path of both feedforward and feedback channel, and the parallel processing of feedforward channel.
In a parallel implementation, adding a feedback channel output would double the hardware complexity. A single path implementation, however, would result in only a few additional non-sequential elements even though the sequential elements such as registers would double as usual. In the single path implementation, the logic utilization for both feedforward channel and feedback channel is only 20% while the utilization of one feedforward channel takes all 15%. Comparing single path implementation with parallel processing, the significant efficiency of single path implementation is indicated. The estimated logic utilization of parallel processing takes 16,800%, which cannot be consequently fitted into the implementation device. The single path implementation method, however, requires only 15%, reducing its workload by 1120. Because of the single path processing and large
available memory resources in the FPGA, the platform can further lower tap spacing needed for higher bandwidth at the expense of higher operating frequency. We emulate the channel emulator for 802.11ac which the Doppler frequency is fixed at 0.414Hz. Because the Doppler frequency in IEEE802.11 standards is small, the proposed model uses single path implementation. It has an advantage of reducing the hardware resource. If the Doppler frequency is high in another system, the design of more than one path processing can be used.

3.6 Summary

In this chapter, we have proposed a 4x4 MU MIMO channel emulator with automatic CSI feedback which is necessary for the evaluation of the MU-BF system. Our emulator is based on FPGA technology and rapid prototyping software tools. Synthesis results have also shown the efficiency of single path processing. After describing the theoretical model, we have outlined the emulator design and its basic operation. We have also discussed in detailed about the actual hardware emulator results which are compared to the theoretical ones. The design implemented in the target FPGAs of Stratix II EP2S180F1508 and analog results have been verified on an oscilloscope.
Table 3.4: Synthesis Result of Feedforward Channel vs. Feedforward and Feedback Channel

<table>
<thead>
<tr>
<th>Type</th>
<th>Feedforward channel</th>
<th>Feedforward and Feedback channel</th>
<th>Parallel processing for full model</th>
</tr>
</thead>
<tbody>
<tr>
<td>Logic Utilization (W)</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>-Combination ALUTs</td>
<td>15%</td>
<td>20%</td>
<td>16,800%</td>
</tr>
<tr>
<td></td>
<td>18,929 / 143,520 (13%)</td>
<td>21,663 / 143,520 (15%)</td>
<td>21,200,480 / 143,520 (14,772%)</td>
</tr>
<tr>
<td>-Dedicated logic registers</td>
<td>2,552 / 143,520 (2%)</td>
<td>9,328 / 143,520 (6%)</td>
<td>2,858,240 / 143,520 (1,992%)</td>
</tr>
<tr>
<td>Total I/O pins</td>
<td>123 / 1,171 (11%)</td>
<td>123 / 1,171 (11%)</td>
<td>134,400 / 1,171 (11,477%)</td>
</tr>
<tr>
<td>DSP blocks</td>
<td>768 / 768 (100%)</td>
<td>768 / 768 (100%)</td>
<td>768 / 768 (100%)</td>
</tr>
<tr>
<td>Total block memory bits</td>
<td>276,468 / 9,383,040 (3%)</td>
<td>549,928 / 9,383,040 (6%)</td>
<td>309,644,160 / 9,383,040 (3,300%)</td>
</tr>
<tr>
<td>Total PLLs</td>
<td>4 / 12 (33%)</td>
<td>4 / 12 (33%)</td>
<td>4 / 12 (33%)</td>
</tr>
</tbody>
</table>
Chapter 4

Higher Order QAM Modulation for Uplink MU-MIMO IDMA Architecture

4.1 Introduction

Interleave division multiple access (IDMA) is one of the multiple access schemes that are currently being considered for next generation wireless systems. Although IDMA scheme has been studied as a special form of code division multiple access (CDMA) with advantages in supporting a large number of users, it has not been widely used as a technique for uplink multiple access because of the difficulties in the multi-user detection (MUD).

IDMA utilizes different interleaver patterns which are used to distinguish users. A distinguishing feature of IDMA is the necessity for MUD which uses turbo-type iterative joint detection and decoding. In previous results on IDMA system [4]-[7], the authors suggested the use of BPSK and QPSK modulation for IDMA system. For higher spectral efficiency transmission, some papers recommended the use of the similar superposition coded modulation (SCM) which used multiple layers of BPSK or QPSK streams and treated them as virtual users [24],[25]. Due to the increase in an effective number of users needed to be separated, the complexity of this method linearly increases as a number of SCM layers increase.

In this chapter, a method which transmits a single layer of high order QAM modulated...
symbol and its low complexity detection at the receiver is proposed. We employ the logarithm likelihood ratios (LLR) calculation in soft mapper and soft de-mapper to quickly separate the bits of one user. This is especially useful in very high order QAM modulations employed in modern wireless standards.

The soft decision demapper for a QAM modulation is in itself also computationally complex. Hence, we estimate the LLR by the simplified soft-output demapper method by using multiple comparators instead of a highly complex summation of multiple logarithms. This scheme has been previously used in bit-interleaved coded modulation (BICM) based systems such as 802.11 wireless LAN [5].

In this chapter, we explain the operation of higher order QAM modulation for IDMA system and the throughput with antenna diversity for 16-QAM, 64-QAM and 256-QAM modulation. The performance of the proposed system is shown in terms of BER and hardware complexity compared to SCM-IDMA. Due to the use of a regular QAM mapper in our proposed system, the transmitter architecture is identical to the transmitter of the 802.11 system apart from the actual interleaver pattern. Hence, our IDMA system is much easier to integrate in IEEE 802.11 system compared to the conventional SCM-IDMA system.

The chapter is organized as follows. In Section 4.2, the thesis describes the proposed IDMA system. In Section 4.3, we introduce the iterative MUD with a simplified soft bit computation. Section 4.4 presents the simulation results of the system. Section 4.5 shows the complexity comparison between SCM-QPSK-IDMA and QAM-IDMA system and in Section 4.6 is our conclusion.

### 4.2 System Overview

The transmitter and receiver structures of the proposed IDMA system with $n$ users transmitting at the same time are shown in Fig. 4.1.

Let $d_n$ be the data length of user $n$. The data is encoded by a convolution code and spread with a repetition code which generates the chip sequence $c_n$. Then $c_n$ is permuted by a user specific interleaver of user $n$. After symbol mapping, the symbol sequence $x_{n,k} = [x_{n,k}(1); \ldots; x_{n,k}(J); \ldots; x_{n,k}(J)]$ is produced, where $J$ is the frame length and $k$ is the
number of antennas. Next, IFFT accomplishes the OFDM modulation to multiple subcarriers. Finally, a cyclic prefix is inserted into the OFDM symbol to prevent inter-symbol interference (ISI). This OFDM signal is transmitted to the channel.

At the channel, the transmitted data of each user is affected by multi-path fading with the different Rayleigh coefficients. Then, all of users are combined together to generate the received signal \( r_k(j) \).

Subscripts, "Re" and "Im", indicate real and imaginary parts, respectively. Then,

\[
x_{n,k}(j) = x_{n,k}^{\text{Real}}(j) + ix_{n,k}^{\text{Im}}(j)
\]  

(4.1)

In this chapter, we use 16-QAM, 64-QAM and 256-QAM modulation as examples for general higher order modulations. \( x_{n,k}(j) \) denotes the transmitted QAM symbol.

The MUD algorithm includes two main parts, which are Elementary Signal Estimator (ESE) and the part for updating the mean and variance variables. Exact user separation relies on the accurate estimation of the variables which are sent as feedback to the ESE.
4.3 Iterative Chip-By-Chip Receiver

4.3.1 Elementary Signal Estimator

The IDMA system using higher order QAM modulation proposed in this chapter assumes
a multi-path fading channel. Because of OFDM modulation, it is understood that ISI and
Inter Carrier Interference (ICI) can be completely eliminated.

The received signal after OFDM demodulation can be expressed as (4.2).

\[ y_k(j) = \sum_{n=1}^{N} H_{n,k}(j)x_{n,k}(j) + A_k(j) \]  

where \( H_{n,k}(j) = \sum_{l=0}^{L-1} h_{n,k}(l)e^{-j2\pi jl/N_c} \) is the channel coefficient of subcarrier-\( j \) with \( L \)-path;
and \( A_k(j) \), the FFT of \( a_k(j) \), is a complex zero mean AWGN with variance \( \sigma^2 \). We focus on \( x_{n,k}(j) \) and re-write (4.2) as

\[ y_k(j) = H_{n,k}(j)x_{n,k}(j) + \zeta_{n,k}(j) \]  

where

\[ \zeta_{n,k}(j) = \sum_{m \neq n} H_{m,k}(j)x_{m,k}(j) + A_k(j) \]  

Note that the complex conjugate of \( H_{n,k}(j) \) by \( H_{n,k}^*(j) \). We have (4.5).

\[ \overline{y}_{n,k}(j) = H_{n,k}^*(j)y_k(j) = |H_{n,k}(j)|^2 x_{n,k}(j) + \overline{\zeta}_{n,k}(j) \]  

where

\[ \overline{\zeta}_{n,k}(j) = H_{n,k}^*(j)\zeta_{n,k}(j) \]  

Based on the central limit theorem, \( \overline{\zeta}_{n,k}(j) \) can be approximated as a Gaussian variable.

This approximation is used by ESE to generate LLR for \( x_{n,k}(j) \).

\[ \lambda(x_{n,k}(j)) = \frac{2|H_{n,k}(j)|^2\left(\overline{y}_{n,k}(j) - \mathbb{E}(\overline{\zeta}_{n,k}(j))\right)}{\text{Var}(\overline{\zeta}_{n,k}(j))} \]  

51
\[
E(\zeta_{n,k}(j)) = H_{n,k}^*(j)\left( E(y_k(j)) - H_{n,k}(j)E(x_n(j)) \right) 
\]

\[
\text{Var}(\zeta_{n,k}(j)) = R_{n,k}^T(j)\text{Var}(\zeta_{n,k}(j))R_{n,k}(j) 
\]

where

\[
\text{Var}(\zeta_{n,k}(j)) = \text{Var}(y_k(j)) - R_{n,k}(j)\text{Var}(x_n,k(j))R_{n,k}^T(j) 
\]

\[
R_{n,k}(j) = \frac{H_{n,k}^R(j) - H_{n,k}^I(j)}{|H_{n,k}(j)|^2} 
\]

with \(E(x_n,k(j)) = 0\) and \(\text{Var}(x_n,k(j)) = I\) in the first iteration. They are also used to update the interference mean and variance in the next iteration which will be discussed in details in the soft mapper.

We define the signal \(\hat{g}_{n,k}(j)\) as (4.12).

\[
\hat{g}_{n,k}(j) = \frac{\tilde{y}_{n,k}(j) - E(\zeta_{n,k}(j))}{|H_{n,k}(j)|^2} 
\]

where \(E(\zeta_{n,k}(j))\) is the mean of \(\tilde{\zeta}_{n,k}(j)\).

For demapping, we maximize the probability of bit \(b_{n,k}(j)\) by using the signal \(\hat{g}_{n,k}(j)\). It is defined as \(P(b_{n,k}(j)|\hat{g}_{n,k}(j))\). Using Bayes rule, we have

\[
P(b_{n,k}(j)|\hat{g}_{n,k}(j)) = \frac{P(\hat{g}_{n,k}(j)|b_{n,k}(j)) * P(b_{n,k}(j))}{P(\hat{g}_{n,k}(j))} 
\]

In Fig. 4.2, we can clearly see that the probability of all constellation points occurs equally, we have

\[
P(b_{n,k}(j)|\hat{g}_{n,k}(j)) = P(\hat{g}_{n,k}(j)|b_{n,k}(j)) 
\]

In higher order QAM modulation, we need to soft de-map the received data by the LLR based on (4.15).

\[
\text{LLR}(b_{l,v,n,k}(j)) = \log \frac{P(b_{l,v,n,k}(j) = 1|\hat{g}_{n,k}(j))}{P(b_{l,v,n,k}(j) = 0|\hat{g}_{n,k}(j))} 
\]
\[ LLR(b_{I,v,n,k}(j)) = \log \frac{\sum_{\alpha \in S_{I,v,n,k}^{(0)}} P(\hat{g}_{n,k}(j)|x_{n,k}(j) = \alpha)}{\sum_{\alpha \in S_{I,v,n,k}^{(0)}} P(\hat{g}_{n,k}(j)|x_{n,k}(j) = \alpha)} \]  

(4.16)

where \( \alpha \) is a point in the QAM constellation; \( S_{I,v,n,k}^{(0)} \) and \( S_{I,v,n,k}^{(1)} \) denote all the points in the constellation where \( v \) is half of the number of bit per symbol. \( S_{Q,v,n,k}^{(0)} \) and \( S_{Q,v,n,k}^{(1)} \) have the same meaning as \( S_{I,v,n,k}^{(0)} \) and \( S_{I,v,n,k}^{(1)} \), respectively but in the imaginary component of the symbol. Computing the exact LLR for each bit in higher order QAM modulation signal involves computing the ratio of the sum of probabilities in the constellation. Mathematically, this calculation involves the computation in (4.16) for each bit of the \( \hat{g}_{n,k}(j) \) received signal (e.g. computing 8 probabilities in 16-QAM modulation).
Sub-optimal simplified LLR can be obtained by the log-sum approximation: \( \log \sum_j z_j \approx \max_j \log z_j \). Thus, we have (4.17).

\[
LLR(b_{I,v,n,k}(j)) \approx \log \frac{\max_{\alpha_t \in S_{I,v,n,k}^{(1)}} P(\hat{g}_{I,n,k}(j) x_{n,k}(j) = \alpha_t)}{\max_{\alpha_t \in S_{I,v,n,k}^{(0)}} P(\hat{g}_{I,n,k}(j) x_{n,k}(j) = \alpha_t)} \tag{4.17}
\]

\[
LLR(b_{I,v,n,k}(j)) \approx \frac{1}{4} \left\{ \min_{\alpha_t \in S_{I,v,n,k}^{(0)}} \left( \hat{g}_{I,n,k}(j) - \alpha_t \right)^2 - \min_{\alpha_t \in S_{I,v,n,k}^{(1)}} \left( \hat{g}_{I,n,k}(j) - \alpha_t \right)^2 \right\} \tag{4.18}
\]

\[
\hat{D}_{I,v,n,k} = D_{I,v,n,k} \tag{4.19}
\]

Obtaining \( D_{I,v,n,k} \) and \( D_{Q,v,n,k} \) in (4.19) requires multiple computation of the logarithmic function and so highly complex. Thus, in this chapter, we employ a further approximate method illustrated in Fig. 4.2. The mapping table for 16-QAM, 64-QAM and 256-QAM are shown in Fig. 4.3. The approximate values of \( D_{I,v,n,k} \) and \( D_{Q,v,n,k} \) of the 16-QAM modulation...
is shown below.

\[ D_{l,1,n,k} = \begin{cases} 
2(\hat{g}_{l,n,k}(j) + 1) & \hat{g}_{l,n,k}(j) < -2 \\
\hat{g}_{l,n,k}(j) & -2 \leq \hat{g}_{l,n,k}(j) \leq 2 \\
2(\hat{g}_{l,n,k}(j) - 1) & \hat{g}_{l,n,k}(j) > 2 
\end{cases} \]  
(4.20)

\[ D_{l,2,n,k} = -|\hat{g}_{l,n,k}(j)| + 2, \quad \text{for all } \hat{g}_{l,n,k}(j) \]  
(4.21)

For 64-QAM modulation, we utilize the same method as 16-QAM, but we calculate the probability of six bits instead of four bits in 16-QAM. We have (4.22), (4.23) and (4.24).

\[ D_{l,1,n,k} = \begin{cases} 
4(\hat{g}_{l,n,k}(j) + 3) & \hat{g}_{l,n,k}(j) < -6 \\
3(\hat{g}_{l,n,k}(j) + 2) & -6 \leq \hat{g}_{l,n,k}(j) < -4 \\
2(\hat{g}_{l,n,k}(j) + 1) & -4 \leq \hat{g}_{l,n,k}(j) < -2 \\
\hat{g}_{l,n,k}(j) & -2 \leq \hat{g}_{l,n,k}(j) \leq 2 \\
2(\hat{g}_{l,n,k}(j) - 1) & 2 < \hat{g}_{l,n,k}(j) \leq 4 \\
3(\hat{g}_{l,n,k}(j) - 2) & 4 < \hat{g}_{l,n,k}(j) \leq 6 \\
4(\hat{g}_{l,n,k}(j) - 3) & \hat{g}_{l,n,k}(j) > 6 
\end{cases} \]  
(4.22)

\[ D_{l,2,n,k} = \begin{cases} 
2(|\hat{g}_{l,n,k}(j)| + 3) & |\hat{g}_{l,n,k}(j)| \leq 2 \\
4 - |\hat{g}_{l,n,k}(j)| & 2 < |\hat{g}_{l,n,k}(j)| \leq 6 \\
2(-|\hat{g}_{l,n,k}(j)| + 5) & |\hat{g}_{l,n,k}(j)| > 6 
\end{cases} \]  
(4.23)

\[ D_{l,3,n,k} = \begin{cases} 
|\hat{g}_{l,n,k}(j)| - 2 & |\hat{g}_{l,n,k}(j)| \leq 4 \\
-|\hat{g}_{l,n,k}(j)| + 6 & |\hat{g}_{l,n,k}(j)| > 4 
\end{cases} \]  
(4.24)

\( D_{Q,1,n,k}, D_{Q,2,n,k} \) and \( D_{Q,3,n,k} \) are calculated similarly to \( D_{l,1,n,k}, D_{l,2,n,k} \) and \( D_{l,3,n,k} \), but \( D_{Q,v,n,k} \) is based on the imaginary component of the received signal.

For 256-QAM modulation, we do similarly as 16-QAM and 64-QAM, but we calculate...
the probability of eight bits. We have (4.25), (4.26), (4.27) and (4.28).

\[
D_{l,1,n,k} = \begin{cases} 
8(\hat{g}_{l,n,k}(j) - |7|) & |\hat{g}_{l,n,k}(j)| \geq 14 \\
7(\hat{g}_{l,n,k}(j) - |6|) & 12 \leq |\hat{g}_{l,n,k}(j)| < 14 \\
6(\hat{g}_{l,n,k}(j) - |5|) & 10 \leq |\hat{g}_{l,n,k}(j)| < 12 \\
5(\hat{g}_{l,n,k}(j) - |4|) & 8 \leq |\hat{g}_{l,n,k}(j)| < 10 \\
4(\hat{g}_{l,n,k}(j) - |3|) & 6 \leq |\hat{g}_{l,n,k}(j)| < 8 \\
3(\hat{g}_{l,n,k}(j) - |2|) & 4 \leq |\hat{g}_{l,n,k}(j)| < 6 \\
2(\hat{g}_{l,n,k}(j) - |1|) & 2 \leq |\hat{g}_{l,n,k}(j)| < 4 \\
\hat{g}_{l,n,k}(j) & 0 \leq |\hat{g}_{l,n,k}(j)| < 2 \\
\end{cases} 
\]  

(4.25)

\[
D_{l,2,n,k} = \begin{cases} 
4(|\hat{g}_{l,n,k}(j)| + 11) & |\hat{g}_{l,n,k}(j)| \geq 14 \\
3(|\hat{g}_{l,n,k}(j)| + 10) & 12 \leq |\hat{g}_{l,n,k}(j)| < 14 \\
2(|\hat{g}_{l,n,k}(j)| + 9) & 10 \leq |\hat{g}_{l,n,k}(j)| < 12 \\
|\hat{g}_{l,n,k}(j)| + 8 & 6 \leq |\hat{g}_{l,n,k}(j)| < 10 \\
2(|\hat{g}_{l,n,k}(j)| + 7) & 4 \leq |\hat{g}_{l,n,k}(j)| < 6 \\
3(|\hat{g}_{l,n,k}(j)| + 6) & 2 \leq |\hat{g}_{l,n,k}(j)| < 4 \\
4(|\hat{g}_{l,n,k}(j)| + 5) & 0 \leq |\hat{g}_{l,n,k}(j)| < 2 \\
\end{cases} 
\]  

(4.26)

\[
D_{l,3,n,k} = \begin{cases} 
2(|\hat{g}_{l,n,k}(j)| + 13) & |\hat{g}_{l,n,k}(j)| \geq 14 \\
|\hat{g}_{l,n,k}(j)| + 12 & 10 \leq |\hat{g}_{l,n,k}(j)| < 14 \\
2(|\hat{g}_{l,n,k}(j)| + 11) & 8 \leq |\hat{g}_{l,n,k}(j)| < 10 \\
2(|\hat{g}_{l,n,k}(j)| + 10) & 6 \leq |\hat{g}_{l,n,k}(j)| < 8 \\
|\hat{g}_{l,n,k}(j)| + 9 & 4 \leq |\hat{g}_{l,n,k}(j)| < 6 \\
2(|\hat{g}_{l,n,k}(j)| + 8) & 2 \leq |\hat{g}_{l,n,k}(j)| < 4 \\
2(|\hat{g}_{l,n,k}(j)| + 7) & 0 \leq |\hat{g}_{l,n,k}(j)| < 2 \\
\end{cases} 
\]  

(4.27)

\[
D_{l,4,n,k} = \begin{cases} 
|\hat{g}_{l,n,k}(j)| + 14 & |\hat{g}_{l,n,k}(j)| \geq 12 \\
|\hat{g}_{l,n,k}(j)| + 13 & 10 \leq |\hat{g}_{l,n,k}(j)| < 12 \\
|\hat{g}_{l,n,k}(j)| + 12 & 8 \leq |\hat{g}_{l,n,k}(j)| < 10 \\
|\hat{g}_{l,n,k}(j)| + 11 & 6 \leq |\hat{g}_{l,n,k}(j)| < 8 \\
|\hat{g}_{l,n,k}(j)| + 10 & 4 \leq |\hat{g}_{l,n,k}(j)| < 6 \\
|\hat{g}_{l,n,k}(j)| + 9 & 2 \leq |\hat{g}_{l,n,k}(j)| < 4 \\
2(|\hat{g}_{l,n,k}(j)| + 8) & 0 \leq |\hat{g}_{l,n,k}(j)| < 2 \\
\end{cases} 
\]  

(4.28)

\[
D_{Q,1,n,k}, D_{Q,2,n,k}, D_{Q,3,n,k} \text{ and } D_{Q,4,n,k} \text{ are calculated similarly to } D_{l,1,n,k}, D_{l,2,n,k}, D_{l,3,n,k} \text{ and } D_{l,4,n,k} \text{ but } D_{Q,v,n,k} \text{ is based on the imaginary component of the received signal.}
\]
From equation (4.7) and equation (4.12), we have the ESE equation as in (4.29)

\[
\hat{b}_{n,k}(j) = \frac{2|H_{n,k}(j)|^4(D_{I,\nu,n,k})}{\text{Var}(\varepsilon_{n,k}(j))}
\]

(4.29)

And \(\hat{b}_{n,k}(j)\) can be generated in a similar way.

### 4.3.2 Extrinsic LLR Calculation

After calculating LLR, the corresponding ESE outputs, \(\hat{b}_{n,k}(j)\), are de-interleaved with the same interleaver index of transmitter to form \(\hat{c}_{n,k}(j)\).

From equation (4.29), the extrinsic LLR can be calculated. After an initial estimate of the transmitted symbols for all STAs, the decoding of each STA’s transmitted sequence is done. For STA \(n\), the receiver performing deinterleaving is expressed as

\[
\hat{c}_{n,k}(j) = \hat{b}_{n,k}(\pi_n^{-1}(j))
\]

(4.30)

where \(\hat{b}_{n,k}(j)\) is the LLRs following the ESE processing and \(\pi_n^{-1}(j)\) is the deinterleaver address of the \(j\)-th address.

Given the deinterleaved ESE output \(\hat{c}_{n,k}(j)\), the despreading output is

\[
\hat{a}_{n,k}(i) = \sum_{sp=0}^{SP-1} \hat{c}_{n,k}(i \times SP + sp)
\]

(4.31)

where \(i=[\frac{j}{SP}]\), \(i=0, 1, ..., (\frac{J}{SP}-1)\) is the despreading data and \([\cdot]\) is the floor calculation.

The spreading can be done as

\[
\tilde{c}_{n,k}(j) = \sum_{sp=0}^{SP-1} \hat{c}_{n,k}([\frac{j}{SP}] \times SP + sp)
\]

(4.32)

The extrinsic LLR can be calculated by the difference of \(\hat{c}_{n,k}(j)\) and \(\tilde{c}_{n,k}(j)\) and followed
by the interleaver as

\[ \epsilon_{n,k}(j) = \tilde{c}_{n,k}(\pi_n(j)) - \hat{c}_{n,k}(\pi_n(j)) \]  

(4.33)

At the final iteration, channel decoding of the data is performed to produce the estimate of the transmitted bits \( \hat{d}_n \). In this chapter, we use the Viterbi algorithm for the channel decoder.

### 4.3.3 Interleaver

Interleaver is a key component in designing IDMA system. The interleaver assigned to the users should be efficient and the least complex. Interleaver indices have to be unique and distinguishable with each other as well as easy to implement. The interleaver which is used in this chapter is a random interleaver. Interleaving patterns of data for the users are generated randomly. These patterns allow the system to uniquely identify each user during MUD process.

### 4.3.4 Antenna Diversity

To improve the performance of higher order modulation in IDMA system, we have applied antenna diversity transmission technique with two antennas, \( y_1 \) and \( y_2 \). We are using Maximal Ratio Combining (MRC) with Post-FFT Processing by combining LLRs after de-interleaving. The detailed system is presented in Fig. 4.4.

The total signal on the \( n \)-th user at the output of de-interleaver with \( k \)-th antenna element can be given by (4.34)

\[ \hat{a}_n(i) = \sum_{k=1}^{K} \hat{c}_{n,k}(j) \]  

(4.34)

where \( K \) is the total number of antenna and \( \hat{c}_{n,k}(j) \) is the output value of the de-interleaver.

### 4.3.5 Soft mapper

An important part in the IDMA system is the soft-mapper which maps the LLR bits to the constellation as described in Fig. 4.5. The output is the mean and variance used in the next
The soft mapper is processed in 4 steps:

- **Step 1**: Calculating the probability of each bit with known LLR values.
- **Step 2**: Calculating the probability of each symbol.
- **Step 3**: Mapping probability of each symbol to constellation.
- **Step 4**: Calculating the mean of bits.

The output of the de-spreading is the extrinsic LLRs for $\hat{g}_{n,k}(j)$. Then, these LLRs are used to generate the updated mean as in (4.35) and the updated variance as in (4.36).

$$E(x_{n,k}(j)) = \sum_{Nc=0}^{Nbpsc-1} (p + iq)_{Nc} * \left( \frac{\epsilon_{n,k}(j)}{1 + \epsilon_{n,k}(j)} \right)_{Nc}$$  \(4.35\)

where $Nc$ is a number of points in constellation diagram, $Nbpsc$ is a number of bits per symbol, $p$ and $q$ are the values taken by the I and Q axes (e.g. the values are {-3, -1, +1, +3} for 16-QAM)

$$\text{Var}(x_{n,k}(j)) = \text{Var}(\alpha_{n,k}(j)) - E(x_{n,k}(j))^2$$  \(4.36\)

where $\text{Var}(\alpha_{n,k}(j))$ is the variance of the QAM symbol. $E(x_{n,k}(j))$ and $\text{Var}(x_{n,k}(j))$ are updated in (4.8) and (4.10) respectively to calculate the LLR for $x_{n,k}(j)$. 

Figure 4.4: IDMA system with antenna diversity
Table 4.1: Simulation Parameter of Higher Order QAM IDMA System

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Packet data size [bit]</td>
<td>128 (16-QAM), 192 (64-QAM), 256 (256-QAM)</td>
</tr>
<tr>
<td>Number of users</td>
<td>16, 10, 7</td>
</tr>
<tr>
<td>Spreading length</td>
<td>16</td>
</tr>
<tr>
<td>Number of iterations in MUD</td>
<td>10</td>
</tr>
<tr>
<td>Number of symbols</td>
<td>1024</td>
</tr>
<tr>
<td>Number of AP antennas</td>
<td>2</td>
</tr>
<tr>
<td>Channel model</td>
<td>Rayleigh channel (9 paths)</td>
</tr>
<tr>
<td>Modulation</td>
<td>QPSK, 16-QAM, 64-QAM and 256-QAM</td>
</tr>
<tr>
<td>Cyclic Prefix</td>
<td>64</td>
</tr>
<tr>
<td>Convolution Code</td>
<td>K=1/2, L=7, [171 133]</td>
</tr>
<tr>
<td>Number of block simulation</td>
<td>1000</td>
</tr>
</tbody>
</table>

4.4 Simulation Results of QAM IDMA System

The IDMA system is simulated and evaluated to assess its performance in higher order QAM modulations such as 16-QAM, 64-QAM and 256-QAM modulation. The detailed parameter of our simulation is described in Table 4.1 below.

In 16-QAM modulation, the data length is 128 bits. The data is encoded with rate of 1/2 the convolution code to produce 256 coded bits. If the spreading length is 16 bits, the coded bits spread to a 4096 bit data length. All users employ the same spreading factor that
contains a balanced number of +1 and -1 as the spreading sequence. After spreading, each user is interleaved by a user-specific interleaver, which is randomly and independently generated with a length of 4096. Next, these chips are mapped to higher order QAM symbols. The OFDM symbol of each user is modulated to multiple sub-carriers by using IFFT. The total number of sub-carriers $N_c$ is set to be 1024 for every type of modulation. A cyclic prefix of 64 is inserted. Multi-path Rayleigh fading channels are used in this simulation. At the receiver side, FFT is proceeded prior to the iterative MUD. The iteration number is fixed at 10 to guarantee the convergence.

In Fig. 4.6, we have compared the performance of 2 layer SCM-QPSK and 16-QAM modulation with IDMA system with one antenna. Note that the total throughput per user is equal in both cases. However, because the convergence of the two methods differs, a number of users shown in this figure correspond to the highest number of users where each algorithm properly converges. In this figure, it is shown that the performance of the proposed algorithm just differs by about 1 dB to 2 dB compared with SCM-QPSK at $10^{-4}$ dB. But the complexity of the proposed algorithm is much less than SCM-QPSK, which will be shown in the next section.

To overcome the reduction of the parallel number of users when employing high order modulation such as 256-QAM, we supplement the system with antenna diversity in (4.34). In addition, it is especially effective in severe fading situations which can cause performance degradation in wireless system.

Fig. 4.7 shows the performance of the proposed system with high number of users made possible by using two antennas. In this system, 16-QAM, 64-QAM and 256-QAM modulations can support up to 16 users, 10 users and 7 users respectively with good performance. These advantages are mainly because of the use of MUD and antenna diversity.

In a realistic multiple access system, each user has different channel condition and different capability, which leads to a multiple access transmission where each user employs modulation order independently. To show the performance of the proposed system in this scenario, the thesis simulates a system with mixed modulation consisting of QPSK, 16-QAM, 64-QAM and 256-QAM modulation. We have selected a total of 24 users in which 15 users using QPSK, 4 users using 16-QAM, 3 user using 64-QAM and 2 users using 256-QAM. The receiver is assumed to have 2 antennas. In Fig. 4.8, the result can be proven that
the OFDM-IDMA system can support the realistic scenario where users employ modulation independently.
4.5 Complexity Comparison between SCM and QAM Modulation

According to the ESE algorithm for QPSK, the complexity of SCM-QPSK modulation has 32 multiplications, 20 additions/subtractions and 2 divisions [4]. On the other hand, the simplified LLR higher order QAM modulation presented in this chapter has the following hardware complexities: 16-QAM modulation has 32 multiplications, 36 additions/subtractions, 2 divisions; 64-QAM modulation has 32 multiplications, 72 additions/subtractions, 2 divisions; and 256-QAM modulation has 32 multiplications, 136 additions/subtractions, 2 divisions per chip per user per iteration.

The summary of the comparison of the complexity of the IDMA receiver with 10 iterations is shown in Table 4.2. Note that the effect of the complexity of the approximate LLR in the proposed system is reflected in the number of multiplications.

In SCM-QPSK modulation with 6 users and 2 streams per user, we have a total of 12. In QPSK modulation, there are 2 bits per symbol. Thus, the total number of bit is 24. This is equivalent to the proposed 16-QAM system with 6 users, the proposed 64-QAM system
Table 4.2: Complexity Comparison between SCM and QAM Modulation

<table>
<thead>
<tr>
<th>Parameters</th>
<th>SCM-QPSK</th>
<th>16-QAM</th>
<th>64-QAM</th>
<th>256-QAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Number of users</td>
<td>9 (x2 streams/user)</td>
<td>9</td>
<td>6</td>
<td>4</td>
</tr>
<tr>
<td>Multiplications</td>
<td>5760</td>
<td>2880</td>
<td>1920</td>
<td>1280</td>
</tr>
<tr>
<td>Additions/Subtractions</td>
<td>3600</td>
<td>3240</td>
<td>4320</td>
<td>3040</td>
</tr>
<tr>
<td>Divisions</td>
<td>360</td>
<td>180</td>
<td>120</td>
<td>80</td>
</tr>
</tbody>
</table>

with 4 users and the proposed 256-QAM system with 3 users.

According to the results in Table 4.2, we can conclude that the more bit per symbol in higher order QAM modulation, the less overall complexity for the proposed IDMA system. For the same number of transmitted bits, the complexity of 256-QAM modulation is about 25% compared to SCM-QPSK modulation.

4.6 Summary

In this chapter, the principles of the IDMA scheme for higher order QAM modulation have been presented. IDMA system has a turbo-type iterative interference cancellation which can improve the performance and support many users. To improve the efficient, SCM-IDMA is used but the structure of SCM-IDMA is very complex. We have proposed the simplified LLR computation to reduce the complex calculation in QAM modulation. One of the reasons why the QAM modulation of IDMA system has not implemented so far is due to the performance of QAM-IDMA is not good. The effectiveness of using antenna diversity is also shown in this chapter to improve the performance of QAM-IDMA.
Chapter 5

Interleaved Domain Interference Canceller for Low Latency IDMA System

5.1 Introduction

IDMA is a special form of Code Division Multiple Access (CDMA). The receiver differentiates each STA by their unique interleaving patterns instead of the spreading codes. This leads to a low complexity receiver which grows linearly with the number of parallel stations (STAs) supported in [10]. At the simplest case, the hardware complexity of the IDMA transmitter is very similar to a regular OFDMA or multi-carrier CDMA transmitter. However, the receiver is recursive and requires deep memory hardware. The main problem that needs to be addressed in designing an IDMA system is the latency caused by the interleaving process. For the interleavers proposed in the literatures so far, both the interleaving and de-interleaving operations permute sequences serially, which will take many hardware clock periods. Thus, it leads to high processing latency and low processing throughput. This has been the bottleneck of the system throughput, especially when the number of iterations is large. The interference cancellation updates the extrinsic log likelihood ratios (LLRs) to improve performance by using previous LLR values. The reduction of latency in each iteration has a significant effect because the parallel processing cannot be employed
to hasten the interference cancellation. In addition, the reduction of latency is particularly important in the case of IEEE 802.11 system. The standard defines a short interframe space (SIFS) such that a wireless interface processes a received frame and responds with a response frame of $16\mu s$. In practical IDMA systems, however, each iteration of the interference cancellation consists of an interleaving and deinterleaving process that would cause a latency much higher than the defined SIFS. This problem is a huge obstacle in the adoption of IDMA in commercial devices such as IEEE 802.11.

There are some papers that propose different methods to reduce the latency of IDMA [30, 31, 32]. The problem of latency reduction is tackled by using grouped spread IDMA to decrease the number of users who participate in the iteration process [30]. Although the group spread IDMA has low latency and low complexity, its bit error rate (BER) performance is worse than the IDMA system that uses a small number of iterations. The parallel interleavers for user separation is proposed in [31] for the improvement of throughput. However, the correlation of interleavers is very poor resulting in reducing BER performance [31]. In [32], the author demonstrated the feasibility of implementing IDMA in current large scale integration (LSI) technology and proposed the dual-frame processing. The paper [32] proposed the dual-frame processing to reduce the latency due to the waiting time which occurred in interleaver and deinterleaver memory units. This is done by doubling the memory size of the random-access memories (RAM) block to process two frames simultaneously. The paper [32] used the waiting time to transmit two frames to improve the throughput twice, but it cannot reduce the latency in the iteration of the interference cancellation. In contrast, our proposed architecture can reduce the latency by half by simplifying the architecture without the need to double the memory size of RAM. This architecture can calculate the updated extrinsic LLRs to detect users in the interleaved domain without the deinterleaver iteration in interference canceller called the interleaved domain architecture. As a result of the interleaved domain architecture, the proposed architecture can increase the throughput by decreasing the latency to half without increasing the complexity.

The rest of the chapter is organized as follows. In Section 5.2, we discuss the overview of IDMA system. Section 5.3 describes the proposed IDMA receiver architecture in detail. In Section 5.4, we derive the hardware implementation of the proposed architecture. The results are shown in Section 5.5. Lastly, we conclude this chapter in Section 5.6.
5.2 Latency Analysis

In this section, we focus on the interference canceller of IDMA receiver as shown in Fig. 5.1. In the interference canceller, the extrinsic LLR is calculated to generate the updated variable for the ESE in next iterations.

Each iteration of the interference cancellation involves the following processes:

– ESE
– Deinterleaver
– Despreader
– Spreader
– Extrinsic LLR computation
– Interleaver
– Soft mapper

From the received signal $y_k(j)$, the first process involves computing an initial estimate of each user data bits using (4.29) to obtain $\hat{b}_{n,k}(j)$. The next step is the deinterleaver shown in (4.30). Because of the writing process of the deinterleaver, the memory operations need $J$ cycles which are equal to the frame size. After this, the next step to despread is expressed in (4.31) and is an accumulator operation that has negligible latency equal to the spreading factor $SP$. The computation of the extrinsic LLR shown in (4.33) includes the interleaving which again would need $J$ cycles. Lastly, the feedback update variable in (4.35)–(4.36)
will also have negligible latency because it uses a lookup table. The sum of soft mapper delay and the ESE delay is Ctrl cycles. In our design, Ctrl equals to 14 cycles including 6 cycles caused by the soft mapper and 8 cycles caused by the ESE. Since the number of deinterleaving/interleaving length is very large compared to the number of spreading length and the arithmetic computation, the largest latency of IDMA system is in the interleaver and deinterleaver with $2 \times J$ delayed cycles for the conventional architecture. Table 5.1 shows the summary of the latency.

### 5.3 Proposed Interleaved Domain Architecture

The relation between the interleaver and deinterleaver can be expressed as follows:

$$\hat{c}_{n,k}(j) = \hat{b}_{n,k}(\pi_n^{-1}(j)) \Leftrightarrow \hat{c}_{n,k}(\pi_n(j)) = \hat{b}_{n,k}(j)$$  \hspace{1cm} (5.1)

On the other hand, the extrinsic LLR can be calculated as

$$\epsilon(x_{n,k}(j)) = \sum_{sp=0}^{SP-1} \hat{c}_{n,k} \left( \left[ \frac{\pi_n(j)}{SP} \right] \times SP + sp \right) - \hat{b}_{n,k}(\pi_n^{-1}(\pi_n(j)))$$  \hspace{1cm} (5.2)

$$= \sum_{sp=0}^{SP-1} \hat{c}_{n,k} \left( \left[ \frac{\pi_n(j)}{SP} \right] \times SP + sp \right) - \hat{b}_{n,k}(j)$$  \hspace{1cm} (5.3)

$$= \sum_{sp=0}^{SP-1} \hat{c}_{n,k} \left( \left[ \frac{\pi_n(j)}{SP} \right] \times SP + sp \right) - \hat{c}_{n,k}(\pi_n(j))$$  \hspace{1cm} (5.4)
As shown in (5.4), the extrinsic LLR can be calculated by subtracting the current data from the sum of all data in one spreading codeword. The sum of data in one spreading codeword is calculated by \( \sum_{sp=0}^{SP-1} \hat{c}_{n,k} \left( \frac{\pi_n(j)}{SP} \times SP + sp \right) \) and the current data is \( \hat{c}_{n,k}(\pi_n(j)) \). The data \( \hat{c}_{n,k} \), which is the data after deinterleaver, is used instead of both \( \tilde{c}_{n,k} \) and \( \hat{c}_{n,k} \) as in (4.33). The interleaver address \( \pi_n(j) \) can be calculated by the algebraic interleaver [34] from the sequence addresses \( j \). Note that the received signal \( y_k(j) \) and the channel \( H_{n,k}(j) \) which are used to calculate the ESE are the interleaved signals. If the interference canceller can be processed in the interleaved domain, the latency can be significantly reduced. In the original IDMA system, the data has to be deinterleaved before processed at the despreader. And the data has to be interleaved to calculate the update LLRs. Thus, the deinterleaver and the interleaver have to be processed sequentially in each iteration. According to (5.4), the deinterleaver, the despreader, the spreader and the interleaver are combined to process concurrently. Instead of using deinterleaved addresses to read the LLRs for despreading, the interleaved domain architecture uses generated interleaved addresses to read these data to calculate the extrinsic LLR. Therefore, the output of the proposed extrinsic LLR calculation is the interleaved data. Fig. 5.2 presents the interleaved domain architecture in the IDMA receiver. The deinterleaver, the despreader, the spreader and the interleaver in the interference canceller are replaced by the interleaved domain block to reduce the latency by half.

In (5.4), data in one spreading codeword must be read simultaneously for despreading. This means that there are \( SP \) data reads at the same time. Although the multiple port register has the ability to read \( SP \) data simultaneously, its implementation is currently impossible on field programmable gate array (FPGA) because it requires high hardware resource. Therefore, we propose to use multiple RAMs instead of multiple ports register for low complexity. The number of RAMs is equal to the spreading length \( SP \). The memory size of each RAM is \( \frac{J}{SP} \). Thus, the total memory size of \( SP \) RAMs is \( J \).

By using the RAM, (5.4) is rewritten as (5.5) where \( \hat{c}_{n,k,m} \) is the data of \( n \)-th user at antenna \( k \)-th in \( m \)-th RAM. Modulo calculation of \( \pi_n(j) \) and \( SP \) is used to determine the RAM which stores the current data.
Figure 5.2: Proposed architecture of IDMA receiver

\[
\epsilon(x_{n,k}(j)) = \sum_{m=0}^{SP-1} \hat{c}_{n,k,m} \left( \frac{\pi_n(j)}{SP} \right) - \hat{c}_{n,k,(\pi_n(j)\%SP)} \left( \frac{\pi_n(j)}{SP} \right)
\]

The deinterleaver and the interleaver are omitted in the proposed architecture. Thus, the reading for the extrinsic LLR calculation in the current iteration and the writing for the updated LLR calculation in the next iteration use the same RAM address. The data is read and written in the same time in two continuous iterations. And each data is randomly read in \(SP\) times. If one RAM is used, the data is overwritten. Therefore, two RAMs are used to separate the reading and the writing processes in two adjacent iterations. The total number of RAMs becomes \(2 \times SP\). Since the target FPGA has only dual-port RAM, the proposed architecture uses a dual-port RAM as two single-port RAMs. Thus, the number of dual-port RAMs is \(SP\). And memory size of each dual-port RAM is \(\frac{2 \times J}{SP}\). In this paper, the terminologies of “lower half” and “upper half” of dual-port RAMs are used to indicate low addresses from 0 to \(\frac{J}{SP}\) and high addresses from \(\frac{J}{SP}\) to \(\frac{2 \times J}{SP} - 1\). The lower half and upper half are used in two continuous iterations.

5.4 Implementation of Proposed Architecture

5.4.1 Conventional Architecture

In the conventional architecture [35], the IDMA interference canceller processes the iteration sequence of deinterleaver, despreader, spreader, extrinsic LLR calculation, interleaver,
soft mapper and ESE as shown in Fig. 4.1. In a hardware design, the processing of interleaver and deinterleaver needs two RAMs with $2\times J$ cycles to write the data. The flow chart of the conventional architecture is presented in Fig. 5.3. In the first iteration, initialization values include the mean $E(x_{n,k}(j))=0$ and the variance $Var(x_{n,k}(j))=1$. The ESE calculation uses the received signal $y_k(j)$, the channel of each user $H_{n,k}(j)$ and the initialization values to calculate the estimated LLRs. The ESE calculation needs $Ctrl$ delayed cycles. The deinterleaver is used to detect user with different interleaver patterns for users. In the conventional architecture, there are two single-port RAMs used for each iteration. In Fig. 5.3, the deinterleaver uses one single-port RAM called “RAM_0” and the interleaver uses the other single-port RAM called “RAM_1”. The deinterleaver uses “RAM_0” to write interleaved data corresponding to interleaved write addresses called “De_IL_WRITE”. After $J$ cycles, data is read with sequence read addresses called “De_IL_READ”. These sequence data are despread after $SP$ cycles. In the first iteration, these LLRs are not correct and need to be updated. The LLRs are spread and the extrinsic LLRs are calculated by subtracting the spread data with the pre-despread data. These extrinsic values are written in “RAM_1” for the interleaver called “IL_WRITE”. After $J$ cycles of writing, the interleaved data is read called “IL_READ”. These interleaved data are used to calculate the updated mean and variance at the soft mapper. These updated LLRs are feedback to the ESE calculation for the next iteration. In the last iteration, the deinterleaver and the despreader are used to export the decoded bits. The despread data is written in “RAM_1” with sequence address to export the decoded bit called “SP_WRITE”. This process needs $J$ cycles to write the despread data. In total, the operation cycles that need to process the interference cancellation in the conventional architecture are $I \times (2\times J + SP + Ctrl)$ cycles.

5.4.2 Proposed Architecture

Fig. 5.4 presents the flow chart of the proposed architecture. In each iteration, the ESE calculation and the soft mapper need $Ctrl$ cycles to produce the LLRs. In the first iteration, the proposed architecture writes the interleaved LLRs with the interleaved addresses into lower half of all dual-port RAMs. In Fig. 5.4, $ID1$ and $ID2$ are used to decide lower half or upper half of dual-port RAMs called “All RAMs $ID1$” and “All RAMs $ID2$” where
Figure 5.3: Flow chart of the conventional architecture

\( ID1 = \text{mod}(\text{Iteration}, 2) \) and \( ID2 = \text{mod}(ID1, 2) \). If \( ID1 \) and \( ID2 \) are equal to 0, the lower half of dual-port RAMs is used. Otherwise, the upper half of dual-port RAMs is used. Therefore, the writing and the reading are processed in two different part of RAM in one iteration to avoid overwriting. “All RAMs” means “RAM 1-st” to “RAM \( SP \)-th” as in Fig. 5.5. The deinterleaver writing called “De_IL WRITE” needs \( J \) cycles. After writing the deinterleaved data, the proposed architecture reads simultaneously \( SP \) data in \( SP \) RAMs with the interleaved addresses called “IL READ”. In “Extrinsic LLR calculation”, the interleaved read data from \( SP \) RAMs are added together simultaneously for despreading to reduce \( SP \) cycles compared to the conventional system. After that, the despread data subtracts the current data for the extrinsic LLR calculation. In the second iteration, LLRs
output from the ESE calculation is written in upper half of dual-port RAMs at the addresses which correspond to the read addresses in the first iteration. These iterations are processed in the loop until the last iteration which has Iteration equal to $I-1$. In the last iteration, the sequence address is used to read as in the normal deinterleaver called “De_IL READ”. Then the LLRs from $SP$ RAMs are added together simultaneously for the despreading to export the decoded bit. In Fig. 5.5, the “Last” signal is used to select the sequence address and enable to export $\frac{J}{SP}$ decoded bits. Since the proposed architecture can skip the despreading and downsampling processes at the last iteration, it can reduce $J$ cycles compared to the conventional system. The proposed architecture needs $J+Ctrl$ cycles to process data for each iteration. We need $I\times(J+Ctrl)$ in total. The latency is reduced by half compared to $I\times(2\times J+SP+Ctrl)$ cycles in the conventional architecture.

The proposed architecture is shown in Fig. 5.5. The inputs are described in Table 5.2. Note that the write address (WA) and the read address (RA) are sequence addresses which are generated by counter from 0 to $J-1$. The timing chart of the write enable (WE1, WE2),
the read enable ($RE_1$, $RE_2$) and Last signal are shown in Fig. 5.6. $WE_1$ and $RE_1$ are used to enable the writing and the reading process in lower half of dual-port RAMs. $WE_2$ and $RE_2$ are used to enable the writing and the reading process in upper half of dual-port RAMs. Therefore, the delay between $WE_1$ and $WE_2$ as well as $RE_1$ and $RE_2$ is $J + \text{Ctrl}$. Last signal is used right after the last iteration to control the exporting of the decoded bits. Last signal is set to 1 within $\frac{J}{SP}$ cycles which is equal to the length of the despread data shown in Fig. 5.6. In Fig. 5.5, the algebraic interleaver is used to generate the interleaver index. The write address input ($wa$) and the read address input ($ra$) of RAMs are calculated based on Eq. (5.5). $WE_2$ and $RE_2$ are used to enable the upper half of RAMs. If the upper half is chosen which means $WE_2$ and $RE_2$ equal to 1, $wa$ and $ra$ are added to $\frac{J}{SP}$ shown in black blocks in this figure. In the proposed architecture, the data which are stored on RAMs at the same address are in the same spreading codeword. In other words, the order of the data in the spreading codeword corresponds to the RAM index. Thus, the write enable of the first RAM ($we_1$) to the $SP$-th RAM ($we_{SP}$) are used to determine the current data written in which RAM. At one time, one write enable signal is equal to 1, the others are equal to 0. In contrast, since the reading is performed simultaneously in multiple RAMs for the despreading, the read enable ($re$) is the same for all RAMs. However, the extrinsic LLR calculation needs to eliminate the current data from the despreading calculation. The select signal $sel_1$ to $sel_{SP}$ are used to eliminate the current data which is set to 0. In the last iteration, Last signal is used as a control signal to export the decoded bit. The additional process for the last iteration is noted by the dash items in Fig. 5.5. The read address is sequence address which is used to read the data from all RAMs as the normal deinterleaver. Since the extrinsic LLR calculation is skipped, all read data are added together to despread. Thus, the select signals are set to 1.

5.5 FPGA Implementation Results of Interleaved Domain IDMA Receiver

In order to show the performance of the proposed system as well as to confirm the soundness of the chosen design architecture, we perform simulations of the BER performance
Figure 5.5: Architecture of the proposed interleaved domain architecture using dual-port RAM
Figure 5.6: Timing chart of the proposed architecture and the latency comparison of the conventional architecture with the proposed architecture. The efficiency of the proposed system in hardware utilization is also shown in this section. The default simulation parameters are listed in Table 5.3.

5.5.1 Simulation Results of Interleaved Domain IDMA Receiver

The BER performance result of the proposed architecture and the conventional architecture are shown in Fig. 5.7. The fixed point word length which is used is 24 bits including the integer length of 8 bits and the fraction length of 16 bits. The maximum simulation iterations is 10,000 times with a 512 bits data frame. Since the calculations of the ESE and the soft mapper are remained unchanged in the proposed architecture, the BER performance of the fixed-point proposed architecture is as the same as the fixed-point conventional architecture in hardware implementation. The comparison between the hardware implementation of the proposed system and the Matlab simulation of the conventional system is also shown in Fig. 5.7. Since the fixed-point word length chosen in the design is large enough to perform LLR values, the BER performance of the proposed architecture is closed to the BER performance of the conventional architecture with floating-point. Migrating from the floating to fixed point representation results in a small (0.1 dB) loss in BER performance. The
Table 5.2: Input/Output Port Parameters

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Din</td>
<td>Received signal</td>
</tr>
<tr>
<td>Hin</td>
<td>Estimated channel</td>
</tr>
<tr>
<td>WE1/WE2</td>
<td>Write enable for lower/upper half in RAMs</td>
</tr>
<tr>
<td>RE1/RE2</td>
<td>Read enable for lower/upper half in RAMs</td>
</tr>
<tr>
<td>WA/RA</td>
<td>Write/Read address are generated by counter (0 to J-1)</td>
</tr>
<tr>
<td>Last</td>
<td>Equal 1 right after the last iteration, otherwise equal 0</td>
</tr>
<tr>
<td>Dout</td>
<td>Output signal of decoded bit</td>
</tr>
</tbody>
</table>

Figure 5.7: BER performance of the proposed system vs SNR

small difference of two lines also shows the proposed system to be robust to fixed point arithmetic.

In Table 5.4, the comparison between the conventional architecture [35], the dual-frame processing [32] and the proposed architecture is shown. \( W_d \) denotes a bit length in fixed-point operation, \( F \) indicates the clock frequency (Hz) and \( N_b \) is the frame data size (bits). Although the number of RAMs in the proposed architecture is larger than the number of
Table 5.3: Simulation Parameters

<table>
<thead>
<tr>
<th>System</th>
<th>IDMA</th>
</tr>
</thead>
<tbody>
<tr>
<td>Modulation type</td>
<td>BPSK</td>
</tr>
<tr>
<td>Frame data size [bit] ($N_b$)</td>
<td>512</td>
</tr>
<tr>
<td>Repetition code length ($SP$)</td>
<td>16</td>
</tr>
<tr>
<td>Number of symbols ($J$)</td>
<td>8192</td>
</tr>
<tr>
<td>Number of users</td>
<td>20</td>
</tr>
<tr>
<td>Number of IDMA iterations ($I$)</td>
<td>10</td>
</tr>
<tr>
<td>Number of algebraic interleaver stage</td>
<td>3</td>
</tr>
<tr>
<td>Fixed-point word length [bit] ($W_d$)</td>
<td>24</td>
</tr>
<tr>
<td>Fixed-point fraction length [bit]</td>
<td>16</td>
</tr>
<tr>
<td>Channel model</td>
<td>One-path Rayleigh fading</td>
</tr>
<tr>
<td>Signal to noise ratio [dB]</td>
<td>12</td>
</tr>
<tr>
<td>Simulation iteration (times)</td>
<td>10,000</td>
</tr>
</tbody>
</table>

RAMs in the conventional architecture, the total memory size of the proposed architecture is as the same as the conventional architecture. Moreover the memory size of the proposed architecture is smaller than half of the dual-frame processing. The throughput of the proposed architecture can increase by twice compared to the conventional method and is as the same as that of the dual-frame processing. However, the latency of the proposed architecture can be reduced by half compared to the conventional architecture while the dual-frame processing [32] cannot reduce the latency.

As we can see above, the main contribution of the latency reduction is the interleaved domain processing in the interference cancellation. Assuming a reference frequency of 640 MHz and an interleaver length of 900 bits, we plot the latency vs. the number of iterations in Fig. 5.8. In Fig. 5.8, when the number of iterations increases, the number of the interleaver and deinterleaver increases, which causes the latency to become large. By processing the updated LLR completely in the interleaved domain, the latency of the proposed architecture can be reduced by half compared to the conventional architecture. At the 10-th iteration, while the conventional architecture needs about $28\mu s$ to operate the system, the proposed architecture needs only $14\mu s$ which easily meets the SIFS requirement of IEEE 802.11 mentioned in the Introduction. While a 640MHz is too high for an FPGA implementation, an optimized application specific integrated circuit (ASIC) implementation of
<table>
<thead>
<tr>
<th>Type</th>
<th>Conventional architecture[35]</th>
<th>Dual-frame processing [32]</th>
<th>Proposed architecture</th>
</tr>
</thead>
<tbody>
<tr>
<td>Memory size (bits)</td>
<td>$2 \times J \times W_d$</td>
<td>$4 \times J \times W_d$</td>
<td>$2 \times J \times W_d$</td>
</tr>
<tr>
<td>Throughput (bits/second)</td>
<td>$F \times N_b$</td>
<td>$2 \times F \times N_b$</td>
<td>$F \times N_b$</td>
</tr>
<tr>
<td>Operation cycles</td>
<td>$I \times (2 \times J + SP + Ctrl) + V$</td>
<td>$I \times (2 \times J + SP + Ctrl) + V$</td>
<td>$I \times (J + Ctrl) + V$</td>
</tr>
</tbody>
</table>
the proposed architecture can come close. Additional techniques such as bit width and IDMA iteration optimization can provide additional latency reduction but are outside the scope of this paper.

This simulation does not include the channel decoder such as Viterbi decoder or low density parity check (LDPC) decoder. In the IDMA and turbo coding literature, the convolutional encoder is one of the recursive types because it has better performance in iterative decoding when a posteriori probability (APP) decoder is inside the iteration loop. But since this will cause a very high latency and hardware complexity to implement, the proposed architecture opted for a simpler iteration loop where only the repetition decoder is placed inside the iteration loop as in [25]. Hence even if the effect of a channel decoder is added, the latency may increase but still below 16μs so that the proposed IDMA architecture can achieve to the time constraint of SIFS. The effect of the channel decoder with latency of V cycles on the throughput can be seen in Table 5.4. For example in [36], the operation cycles of Viterbi decoder are 54 clocks which translate to a mere 0.08μs additional latency.

In Fig. 5.9, the latency evaluations of the conventional architecture and the proposed architecture has the same time scale. The interference canceller iteration needs ten iterations to estimate the bit information for each user. The operation frequency is the same
between the conventional architecture and the proposed architectures because ESE calculation having the longest path delay is the same in two architectures. Since the operation frequency is the same, the latency of proposed architecture can be calculated by only operation cycles. By using the simulation parameters as in Table 5.3, while the conventional architecture needs \(10 \times (2 \times 8192 + 16 + 14) = 164,140\) cycles, the proposed architecture needs \(10 \times (8192 + 14) = 82,060\) cycles in the interference cancellation. Thus, the latency of the proposed architecture can reduce by half compared to the conventional architecture as shown in the mathematical equations in Table 5.4.

### 5.5.2 Synthesis Results of Interleaved Domain IDMA Receiver

The synthesis results of the target FPGA Xilinx Virtex 6 240TFF784 are presented in Table 5.5 and Table 5.6. In Table 5.5, the hardware utilization of the conventional architecture, the proposed architecture using single-port RAM and the proposed architecture using dual-port RAM are shown. Because the target FPGA has only dual-port RAM, the use of single-port RAM increases number of RAM blocks. It also uses the extra logic for the address decoder. Hence, the register and the look-up table (LUT) usage are higher than the conventional one and the design of dual-port RAM. The difference of the conventional architecture and the proposed architecture using dual-port RAM is small, which demonstrates our proposed architecture using dual-port RAM to be effective for IDMA system.

The proposed architecture using dual-port RAMs increases slice registers to 14% while reducing slice LUTs to 8% and occupied slices to 1% compared to the conventional architecture. RAM and digital signal processing (DSP) block of the proposed architecture are as the same as the conventional architecture. Since the proposed architecture has to generate specific write address and read address, the number of registers needed are slightly larger than the conventional one. However, the number of slice LUTs and the occupied slices are slightly smaller than the conventional architecture because the despreading and the extrinsic LLR calculation are combined to use one adder in the proposed architecture. The number of RAMs is the same because the total memory size of RAM is the same. The evaluated frequency is 110MHz which is the same between the conventional architecture and the proposed architecture because ESE calculation having the longest path delay is the
Figure 5.9: Latency evaluations of the conventional architecture and the proposed architecture
Table 5.5: Synthesis Comparisons

<table>
<thead>
<tr>
<th>Type</th>
<th>Conventional system [35]</th>
<th>Proposed system Single-port RAM</th>
<th>Proposed system Dual-port RAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Frequency</td>
<td>110 MHz</td>
<td>110 MHz</td>
<td>110 MHz</td>
</tr>
<tr>
<td>Slice Registers</td>
<td>18,604</td>
<td>25,204</td>
<td>21,204</td>
</tr>
<tr>
<td>Slice LUTs</td>
<td>41,919</td>
<td>76,947</td>
<td>38,551</td>
</tr>
<tr>
<td>Occupied Slices</td>
<td>11,903</td>
<td>22,853</td>
<td>11,734</td>
</tr>
<tr>
<td>RAMB36E1</td>
<td>160</td>
<td>160</td>
<td>160</td>
</tr>
<tr>
<td>RAMB18E1</td>
<td>320</td>
<td>480</td>
<td>320</td>
</tr>
<tr>
<td>DSP48E1s</td>
<td>420</td>
<td>420</td>
<td>420</td>
</tr>
</tbody>
</table>

Table 5.6: Synthesis Results (Xilinx Virtex 6 240TFF784)

<table>
<thead>
<tr>
<th>Type</th>
<th>Proposed system</th>
<th>Available</th>
<th>Utilization (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Slice Registers</td>
<td>21,204</td>
<td>301,440</td>
<td>7%</td>
</tr>
<tr>
<td>Slice LUTs</td>
<td>38,551</td>
<td>150,720</td>
<td>25%</td>
</tr>
<tr>
<td>Occupied Slices</td>
<td>11,734</td>
<td>37,680</td>
<td>31%</td>
</tr>
<tr>
<td>RAMB36E1</td>
<td>160</td>
<td>416</td>
<td>38%</td>
</tr>
<tr>
<td>RAMB18E1</td>
<td>320</td>
<td>832</td>
<td>38%</td>
</tr>
<tr>
<td>DSP48E1s</td>
<td>420</td>
<td>768</td>
<td>54%</td>
</tr>
</tbody>
</table>

same in two architectures.

Table 5.6 shows the hardware utilization of the proposed architecture. The result indicates that the proposed architecture can fit the target FPGA board.

5.6 Summary

We have presented the interleaved domain architecture of an interference cancellation for the IDMA receiver which can reduce the latency about 50% effectively and increase the throughput about twice with almost the same hardware utilization. Because the interleaved domain architecture uses the same LLR calculation equation as the conventional IDMA, the BER performance of the interleaved domain is unchanged. The simulation results show that if we use a frequency of 640 MHz and an interleaver symbol of 900 bits, the processing takes about 14μs which is smaller than 16μs and so it can satisfy the SIFS requirement.
of 802.11 systems. The design is implemented in the target FPGAs of Xilinx Virtex 6 240TFF784. The synthesis results have also shown the efficiency of the proposed architecture compared to the conventional architecture and the ability to implement this system on the target FPGA board.
Chapter 6

Conclusions and Future Works

6.1 Conclusions

The goal of this thesis is to make IDMA systems applicable for future MU-MIMO communication systems. The IDMA system has several other advantages over uplink multiple access schemes such as OFDMA and CDMA. However, since the latency of IDMA system is high due to iterative processing, the IDMA system have not proposed yet for any wireless standards. The interleaved domain IDMA system can reduce the latency to half increasing the throughput by twice which can able to implement into the practice. Moreover, the proposed higher order QAM modulation for IDMA system can achieve the low complexity and also improve the throughput. Regardless of the wireless applications, the proposed MU-MIMO channel emulator is important to test the IDMA system and the current MU-MIMO systems are properly working.

A comprehensive view of MU-MIMO wireless communication system has been provided in Chapter 1 and Chapter 2.

We have presented the implementation of MU-MIMO channel emulator in Chapter 3. This channel emulator also includes the automatic CSI feedback which is necessary for the evaluation of the MU-BF system. Our emulator is based on FPGA technology and rapid prototyping software tools. Synthesis results have also shown the efficiency of single path processing in the hardware implementation. In a parallel implementation, adding a
feedback channel output would double the hardware complexity. A single path implementation, however, would result in only a few additional non-sequential elements even though the sequential elements such as registers would double as usual. In the single path implementation of IEEE 802.11ac channel model D, the logic utilization for both feedforward channel and feedback channel is only 20% while the utilization of one feedforward channel takes all 15%. Comparing single path implementation with parallel processing, the significant efficiency of single path implementation is indicated. The estimated logic utilization of parallel processing takes 16800%, which cannot be consequently fitted into the implementation device. The single path implementation method, however, requires only 15%, reducing its workload by 1120.

In Chapter 4, we have proposed the low complexity IDMA system by using the simplified higher order QAM modulations. For the same number of transmitted bits per symbol, the complexity of 256-QAM modulation is about 25% compared to the SCM-QPSK modulation. By using the higher order QAM modulations, the proposed IDMA system can improve the throughput but the performance is not good. We have compared the performance of SCM-QPSK and higher order QAM modulation for IDMA system with one antenna. The performance of the proposed higher order QAM modulation worse than SCM-QPSK-IDMA about 1 dB to 2 dB at $10^{-4}$ dB. We have shown the effectiveness of using the antenna diversity to improve the performance for the QAM-IDMA system. If two antennas are used in the proposed system, the performance of higher order QAM IDMA system is improved by twice compared to the one antenna IDMA system.

In Chapter 5, we have presented the low latency IDMA system which uses a novel interleaved domain architecture. The proposed architecture can perform multi-user detection directly without deinterleaving the received frame in the interference canceller iteration. The interleaving is also no longer needed in the interference cancellation loop resulting in the decrease of latency. The hardware implementation of this low latency IDMA system has presented. By using the design by RAM instead of registers, the proposed interleaved domain architecture of an interference cancellation can reduce the latency to 50% effectively and increase the throughput to double with almost the same hardware utilization. The simulation results show that if we use a frequency of 640 MHz and interleaver symbol of 900 bits, the processing takes about 14μs and hence can satisfy the SIFS requirement of
802.11 systems.

As a result of the low latency and low complexity IDMA architecture, the proposed IDMA is more feasible for the practical implementation in future wireless communication systems. In addition, the MU-MIMO channel emulator can provide the experimental tests for the proposed IDMA in the implementation.

6.2 Future Works

In our future work, we will do a thorough analysis of the proposed system to improve its convergence. One way to do this is via optimal power allocation for IDMA system. Another avenue to improvement is by using a flexible spreading length and number of iterations depending on number of users. Since the latency is independent of the spreading length in the proposed architecture, the control signals for flexible spreading length may be implemented easier than the conventional IDMA architecture.

For the chip design, the VLSI implementation of the proposed IDMA architecture is necessary to get the power consumption and circuit area.

According to the result of the latency simulation in Chapter 5, we use the high frequency of 640 MHz because we want to achieve a low latency. In current, it is very hard to meet this frequency. The additional technologies need to be considered to achieve such high frequencies.

Because of the design complexity of register for the low latency IDMA, the current design as shown in Chapter 5 uses the design of dual-port RAM. In case of the multi-port RAM supporting, the proposed interleaved domain IDMA can achieve lower complexity.

The combination of IDMA system and OFDMA system is considered as an interesting future work. The bandwidth resources are split orthogonally into identical sub-bands like OFDMA technique. Each sub-band includes a number of users that can transmit their signals simultaneously within each sub-band by IDMA technique. The other users are decoded independently without any interference. The decoding complexity of multi-user detection is lower than IDMA system. By this combination, we have greater spectral efficiency and reduce the number of multi-user detection at the receiver side. Because of using IDMA technique instead of NOMA power allocation technique, the user grouping of weak
channel gains and high channel gains is unnecessary. This leads to the low complexity system in the practical implementation.
Appendix A

Snapshots of the Designs

This appendix shows the snapshots of our proposed designs. For the Model based designs for the MU-MIMO channel emulator in chapter 3, we show the snapshots of the circuits. For the Verilog based designs for the the low latency IDMA system, we show the snapshots of simulation waveform run by Modelsim.
Figure A.1: MU-MIMO channel emulator for 4x4 antenna and 35 taps
Figure A.2: MU-MIMO channel emulator with sounding feedback
Figure A.3: MU-MIMO channel emulator evaluation by using oscilloscope

Figure A.4: Spatial correlation block of MU-MIMO channel emulator
Figure A.5: Rician block of MU-MIMO channel emulator
Acknowledgment

I would like to thank Prof. Hiroshi Ochi, who has instructed and supported me during the Ph.D course in Kyushu Institute of Technology. I also would like to thank Prof. Masayuki Kurosaki, Dr. Leonardo Lanante and Dr. Yuhei Nagao for their insightful comments and advices in all time of my research.

To my parents and siblings who support me in every undertaking in my life.

I am also indebted to the following reviewers, Prof. Masato Tsuru, Prof. Xiaoqing Wen, who took time to read and give their very helpful advices for my thesis manuscript. Prof. Shuichi Ohno and Prof. Shigenori Kinjo who traveled to Fukuoka for my thesis defense and also gave me very insightful comments.

I am also thankful for the Japanese Government (MEXT) Scholarship Program for giving me financial and moral support during my Ph.D course.

I cannot thank enough to all lab members, especially my tutor Ms. Reina Hongyo, for their helpings me to solve all problems related to daily life in Japan.
Bibliography


Publication List

Journals


International Conferences


4. Tran Thi Thao Nguyen, Yuhei Nagao, Leonardo Lanante, Masayuki Kurosaki, and Hiroshi Ochi, “FPGA Implementation of a MIMO Channel Emulator for the IEEE


**Domestic Conferences**


4. Leonardo Lanante, Tran Thi Thao Nguyen, Tatsumi Uwai, Takafumi Tomiyasu, and

Proposals for IEEE 802.11ax standard


2. Leonardo Lanante, Tran Thi Thao Nguyen, Hiroshi Ochi, Tatsumi Uwai, and Yuhei Nagao, “MAC Efficiency Gain of Uplink Multi-user Transmission,” doc.:IEEE 802.11-15/0089r1, Atlanta Georgia, USA, Jan. 2015.
