This paper presents a VLSI design of a TomlinsonHarashima (TH) precoder for multi-user MIMO (MU-MIMO) systems. The TH precoder consists of LQ decomposition (LQD), interference cancellation (IC), and weight coefficient multiplication (WCM) units. The LQ decomposition unit is based on an application specific instruction-set processor (ASIP) architecture with floating-point arithmetic for high accuracy operations. In the IC and WCM units with fixed-point arithmetic, the proposed architecture uses an arrayed pipeline structure to shorten a circuit critical path delay. The implementation result shows that the proposed architecture reduces circuit area and power consumption by 11% and 15%, respectively.
Introduction
Multiple-input multiple-output (MIMO) systems are attracting attention. MIMO is a technique to improve speed and capacity in wireless communication by increasing transmit and receive antennas that is adopted in wireless LAN standard of IEEE 802.11n. The next-generation standard of IEEE 802.11ac is being formulated to achieve more than 1 Gbps throughputs and to support multi-user MIMO (MU-MIMO) to enhance communication capacity. MU-MIMO achieves larger communication capacity than single-user MIMO (SU-MIMO) by using parallel transmission among multiple terminals. MU-MIMO requires precoding at the transmitter side for interference cancellation among receiving terminals. Types of MU-MIMO systems are classified into linear and non-linear precoding schemes. Linear precoding by zero-forcing (ZF) and minimum mean square error (MMSE) methods just multiply transmitting signals by precoding weights [1] , [2] . The communication quality tends to be degraded under high spatial correlations among users. Non-linear precoding by vector perturbation (VP) and Tomlinson-Harashima precoding (THP) reduces transmission power by coding signals and overcomes the weakness Manuscript in linear precoding [3] , [4] . The non-linear precoding has better communication quality but requires larger computational complexity.
Hardware architectures of THP have been presented by Lin et al. [5] and Gu and Parhi [6] . Their architectures focus on single-input and single-output (SISO) systems. THP testbed using a digital signal processor (DSP) for MU-MIMO has developed [7] , but the testbed computes using a lot of DSP units and power consumption and efficiency of LSI circuit area have not been discussed. We present a TH precoder consisting of LQ decomposition, interference cancellation (IC), and weight coefficient multiplication (WCM) units. The LQ decomposition unit is based on an application specific instruction-set processor (ASIP) architecture with floating-point arithmetic for high accuracy operations. We refer to the ASIP for singular value decomposition (SVD) designed in our previous work [8] . This ASIP has been developed for high-speed computation of SVD in SU-MIMO beamforming. In the IC and WCM units with fixed-point arithmetic, the conventional architecture takes a long circuit delay in successive interference cancellation of MIMO-THP. The proposed architecture uses an arrayed structure to have a shorter circuit delay than the conventional architecture. In the VLSI implementation, we indicate that the proposed architecture decreases processing latency time and power consumption. This paper is organized as follows. We explain theory of THP in Sect. 2. Section 3 presents a design of the THP circuit. Section 4 describes performance evaluation of the designed circuit. Section 5 concludes the paper.
Tomlinson-Harashima Precoding for MU-MIMO Systems
A MU-MIMO system with four transmit antennas and two double-antenna users, i.e., 4 × 2 MU-MIMO, is illustrated in Fig. 1 . We assume that channel state information (CSI) is ideally fed back from a receiver to a transmitter. In THP, signals are transmitted after subtraction of multi-user interference caused by the propagation channel. The channel matrix H consisting of H 1 and H 2 is estimated in a receiver and decomposed into a lower triangular matrix L and a unitary matrix Q.
Copyright c 2013 The Institute of Electronics, Information and Communication Engineers 
The transmitted signalsx are generated using l i j in (1) as
The signals are transmitted from each antenna after weight coefficient multiplication by W=Q H . The received signalsỹ through the propagation channel are expressed as
where n is a white Gaussian noise. Here, modulo operation in (2) is denoted as
where M is a modulo window size. The purpose of modulo operation is to suppress amplitude of signals by trans-
The amplitude of signals increases depending on l i j . Power efficiency of the transmitted signals worsens if the signals are increased by multiplying l i j . A modulo window size depends on modulation levels [2] . The Gram-Schmidt orthonormalization algorithm is adopted for the LQ decomposition (LQD). In the algorithm, orthonormal vectors are calculated from given linear independent vectors. There are other methods for LQD, such as Givens rotation [9] and Householder transformation [10] . Compared with these methods, Gram-Schmidt orthonormalization is suitable for hardware implementation owing to the small calculation complexity [11] .
In the Gram-Schmidt algorithm, we explain by an n × n square matrix A with n-dimensional row vectors given by linear independent. A unitary matrix Q with row vectors q n is generated by orthonormalizing each row vector a n of A as
A lower triangular matrix L is calculated by multiplying A by Q H , which is generated in (5) as
VLSI Design

Overall Structure
The TH precoder consists of the LQ decomposition (LQD), interference cancellation (IC), and weight coefficient multiplication (WCM) units, which are illustrated in Fig. 2 . The LQD unit executes the Gram-Schmidt algorithm in Eqs. (5) and (6) with floating-point arithmetic operation because the matrix decomposition requires high accuracy and large dynamic range computations. The IC unit performs the inference cancellation in Eq. (2) and the WCM unit gives the matrix multiplication by Q Hx in Eq. (3). The timing chart and the packet format in the TH precoder are illustrated in Fig. 3 . They have very different requirements for calculation accuracy and throughput performance. The throughput requirement of the LQD unit is not high because the CSI does not change frequently. The CSI update interval of 20 ms is presented by Shapira and Shany [12] . On the other hand, the IC and WCM units request the same high throughput as the baseband symbol rate. Since the IEEE802.11ac standard supports 160-MHz channel utilization, the symbol rate of 160 symbols/seconds is required for real-time precoding, where its processing is denoted in the data symbols in Fig. 3 .
To decrease a work load in the IC and WCM units, we apply pre-computation for lower triangular matrix L in Eq. (1). The divisions in Eq. (2) are shifted from the IC unit to the LQD unit as
According to Eq. (7), Eq. (2) is rewritten as 
Since the frequency of updating the CSI is low, the update frequency of L is also low. The precomputation in Eq. (7) can be performed by the LQD unit.
ASIP Implementation of LQD Unit
The LQD unit is based on an application specific instruction-set processor (ASIP) architecture, which is illustrated in Fig. 4 . We utilize the ASIP designed by Iwaizumi et al. [8] , which provides high-speed computation of SVD in SU-MIMO beamforming. The same arithmetic operation units and instruction sets are effective in the LQ decomposition. The data and instructions are stored in each memory, and the processing unit executes instructions in order. Floating-point units (FPUs) deal with IEEE 754 standard single precision floating-point in the processing unit. The circuit structure of the processing unit is illustrated in Fig. 5 . The FPU supports four types of arithmetic operations: addition, subtraction, multiplication, and division. The processing unit can execute complicated processing such as complex and accumulative operations by combining the eight FPUs. The four FPUs in the first stage are used for one complex and two real multiplications. By combining these FPUs, complex multiplication can be executed. Since all the FPUs execute pipeline processing, cycles per instruction (CPI) in this processor almost reaches to one by increasing block data size in pipeline processing. Table 1 enumerates the supported instructions. The instruction consists of memory address of input data A, B, output data C, and operation type, as illustrated in Fig. 6 . The bit lengths are log 2 N and log 2 N op bits, where N is the number of data memory words and N op is the number of instructions. The dedicated highspeed division and square-root operation units denoted by "FDIV" and "FQRT" are also implemented. By reducing computation cycles, the total dissipated energy can be reduced at the cost of a smaller circuit area. Table 2 shows circuit performance of the LQD unit, where N=2,048 and N op =256 are set in the processor specification. The LQD unit has been synthesized on a 90-nm CMOS standard cell library where the supply voltage is 1.0 V. We set the clock frequency to 400 MHz. This evaluation includes not only LQ decomposition but also the precomputation of division employed in the IC unit for the proposed architecture. For the 160-MHz channel utilization in the IEEE802.11ac, the number of channel matrices (corresponding to OFDM data subcarriers) is 480. The total calculation time is 2.5 ns × 232.52 × 480 = 0.279 ms. This time is much shorter than the CSI update interval of 20 ms. This indicates that the LQ unit can provide real-time processing including the precom- Conversion from integer to floating-point format 25
Conversion from floating-point format to integer putation of division.
Conventional Architecture
The conventional architecture is given by straightforward computation of Eq. (8) . The structure of the conventional IC unit is shown in Fig. 7 . The IC unit has operation blocks "M1S1" and "M2S2" consisting of multipliers and subtracters. The modulo block "Mod" performs arithmetic and floor operations in accordance with Eq. (4). The detailed structure of the modulo operation block is denoted in Fig. 8 . The conventional architecture in the WCM units is illustrated in Fig. 9 . The matrix operation of Q Hx is done by complex multiplication and additions. The block of "A4" has four input ports and generates result data by one output port. The drawback is that the IC unit suffers from a long critical path to computex 4 . The computation ofx 4 requests the outputs ofx 1 ,x 2 , andx 3 . It decreases operating clock frequency or increases circuit area and power consumption by taking many parallel structures of logic gates in logic synthesis.
Proposed Architecture
The proposed IC unit with an arrayed structure is illustrated in Fig. 10 . The three registers are inserted to shorten the critical path in the conventional IC unit. Due to the registers, the outputs ofx 2 ,x 3 , andx 4 are delayed in clock cycles. For the WCM unit, we use different timings to generatex 1 to x 4 , the structure of which is shown in Fig. 11 . The timing charts of the conventional and the proposed architectures are compared in Fig. 12 . The proposed architecture takes the pipeline latency delay of three cycles unlike the conven- tional architecture. However, the throughput equal to the sampling rate does not change. Table 3 shows circuit performance of the conventional and the proposed architectures in the IC and the WCM units. The IC and WCM units have a 15-bit length in fixedpoint arithmetic units. The target clock frequency is set to 160 MHz for all the units in logic synthesis and power consumption measurement. The gate-level power measured using a Synopsys Power Compiler in the condition of 1.0 V supply power. The proposed architecture exhibits smaller circuit area and power consumption because the conventional architecture needs many parallel structures of logic gates in logic synthesis to reduce a critical path delay. The power consumption of the LQD unit is larger than the summation of the IC and the WCM units. However, the LQD unit has much shorter computation time than the IC and the WCM units as shown in Table 2 and Table 3 . We assume that the processing time of the IC and the WCM units in Fig. 3 occupies 50% for the CSI update interval. Hence, the LQD unit consumes much less energy than the IC and the WCM units.
Evaluation
Conclusion
We presented an arrayed pipelined TH precoder consisting of the LQD, the IC, and the WCM units for MU-MIMO systems. The LQD unit is designed by using an ASIP architecture. The proposed architecture in the IC and the WCM units shortened a critical path and reduced circuit area and power consumption by 11% and 15%, respectively.
