Lattice reduction (LR) is a preprocessing technique for multiple-input multiple-output (MIMO) symbol detection to achieve better bit error-rate (BER) performance. In this paper, we propose a customized homogeneous multiprocessor for LR. Each individual core is based on transport triggered architecture (TTA). We propose a few modifications of the popular LR algorithm, Lenstra-Lenstra-Lovász (LLL) for high throughput. High level programming is used to implement the control path of the TTA cores and several special function units are designed to accelerate the program. The multiprocessor takes 187 cycles to reduce a single matrix for LR. The architecture is synthesized on 90 nm technology and takes 405 kgates at 210 MHz.
I. INTRODUCTION
Multiple-input multiple-output (MIMO) is a key technology to utilize the available radio spectrum efficiently. The basic idea of MIMO is to send multiple independent data streams from multiple antennas in the same frequency band. MIMO detector separates these independent streams at the receiver to identify the symbols that are being transmitted. Maximum likelihood (ML) is the optimal solution for the MIMO detection problem if all the transmitted symbol vectors are equally likely. However, the straighforward implementation of ML by an exhaustive search over all the possible transmitted symbol vectors can lead to prohibitive complexity for higher modulation order and antenna dimensions. Linear detection methods are widely used for practical implementations. The linear MIMO detection algorithms are less complex, but suffers from a degraded bit error-rate (BER) performance.
Lattice reduction (LR) is a preprocessing technique that can be used with the linear detection to significantly improve the BER performance and reduce the gap between the traditional linear detectors and optimal ML. LR transforms the MIMO channel matrix to a near orthogonal matrix and thus facilitates to achieve a better BER performance. The most used LR algorithm is called the Lenstra-Lenstra-Lovász (LLL) algorithm according to the name of the inventors [1] . The LLL algorithm poses many challenges due to the undeterministic execution time and higher computational complexity. We propose a few modifications of the original LLL algorithm on complex domain to fascilitate the hardware implementation and call it modified LLL (MLLL) throughout the paper. We demonstrate by Matlab simulation that the BER performance loss of the MLLL is negligible compared to the original LLL.
There are several hardware accelerators proposed in [4] [5] [6] [7] for different LR algorithms. The fixed hardware implementations provide high data rate and consume less silicon area compared to the customized application specific processors (ASIP). The drawback of the fixed hardware implementation is that it operates only on a fixed set of parameters due to the hardwired control path and it is not possible to modify the control path in the future. An ASIP customized for a small set of algorithms is an attractive solution in terms of cost, silicon area and high throughput. Most importantly, an ASIP reduces the design risk with an instruction memory that can be used to load new programs or control instructions. The control instructions can be easily obtained by a retargetable compiler for that particular customized architecture.
A customized very long instruction word (VLIW) processor is implemented in [8] for the LR. We take different approach and design a customized multiprocessor based on the transport triggered architecture (TTA) paradigm. TTA is a processor design philosophy where the programmer can control the internal data transports between different function units of the processor. TTA exploits the instruction level parallelism (ILP) by processing several instructions in a single clock cycle. The TTA based codesign environment (TCE) tool is used in this work to design the TTA processor cores. TCE enables the designer to write an application with a high level language and design the target processor in a graphical user interface at the same time. A more detailed explanation of TCE and TTA can be found in [9] [10] and [11] . In this work, every core of the proposed multiprocessor is programmed with C language to shorten the time-to-market. The multiprocessor takes 187 cycles and achieves a maximum clock frequency of 210 MHz on 90 nm technology. To our knowledge, this is the first TTA based customized architecture for LR.
II. SYSTEM MODEL

A. Conventional MIMO Detection
Consider a MIMO system consists of M T transmit antennas, which are sending data over the channel and N R receive antennas which are receiving transmitted bits from the channel. The modulation scheme that is used here is quadrature amplitude modulation (QAM) with the assumption N R ≥ M T . The received signal y can be represented as y = Hx + n, (1) where y ∈ C NR is the received signal vector, x ∈ C MT is the transmit symbol vector, H ∈ C NR×MT is the channel matrix and n ∈ C NR is the circularly symmetric complex white Gaussian noise vector with zero mean and variance σ 2 .
In the receiver, the linear zero forcing (ZF) detector calculates the inverse of the channel matrix to compute the transmitted symbol vector which can be expressed by,
where H is the channel matrix and (·) † denotes the pseudoinverse. Typically, the channel matrix H is QR decomposed into two parts as H = QR. Here Q ∈ C (NR×NR) denotes a unitary matrix and R ∈ C (NR×NR) denotes an upper triangular matrix.
B. Lattice Reduction
A lattice is a periodic arrangement of discrete points. A lattice can be characterized in terms of a set of basis vectors, where any points of the lattice can be represented by a superposition of integer multiples of the basis vectors. For simplicity, we call the set B = (b 1 , b 2 , ...., b n ) as the basis of the lattice.
A complex valued lattice in the n-dimensional complex space C n can be defined as
where B is the basis of the lattice and ω = [ω 1 , ω 2 , ...., ω n ]. Note that in (3), the υ, ω and matrix B can be replaced with y, x and H respectively to obtain L = {y|y = Hx}. In this case, the vector space L is the set of all possible undisturbed received signal points. There are many bases that can span the space L, and the aim of the LR algorithm is to find a set of least correlated base with the shortest basis vectors [12] .
C. LR-based MIMO Detection
LR finds an improved basis for the lattice induced by the channel. The original basis and the reduced basis are related by a unimodular matrix, T. Therefore, the LR aided detection finds the received symbol in the new reduced basis and afterwards transfer the signal in the original lattice. The new channel matrix after the LR can be expressed as,H = HT and the transmitted signal is also treated as multiplied by T −1 which is z = T −1 x for the reduced basis. The received signal y = Hx + n can be expressed as
The LR aided detection operates onH and z instead of H and x. The LR aided ZF detector can be expressed as
The LR algorithm is applied on the QR decomposed H to obtain the modifiedQ andR. Afterwards, the lattice reduced channel matrix can be obtained asH =QR.
III. LATTICE REDUCTION ALGORITHM
LLL algorithm is widely used to compute the suitable unimodular matrix T and to obtain a reduced lattice basis. LLL was originally proposed for the real valued LR [1] . However, the channel matrix is naturally complex valued and therefore, complex version of LLL (CLLL) is used to reduce the complexity.
The CLLL algorithm suffers from irregular dataflow, which eventually leads to higher latency. Therefore, a fixedcomplexity LLL (fcLLL) algorithm is proposed in [2] . The fcLLL alters the signal flow of the CLLL to follow a deterministic structure. It is possible to utilize less complex Siegel algorithm instead of the complex Lovász condition [3] . It is also very important to use an early termination mechanism to meet the strict requirements. Applying all this modifications, we propose a modified-LLL (MLLL) algorithm for LR with less complexity and negligible BER performance loss. The MLLL implemented in this paper is summarized in Algorithm 1.
Algorithm 1 Modified CLLL Algorithm (MLLL)
INPUT: The BER performance of the traditional ZF, original CLLL aided ZF, MLLL aided ZF and the optimal ML is simulated for various signal-to-noise (SNR) in a Matlab simulator. An additive white Gaussian noise (AWGN) channel is used for 16-QAM modulation and the BER is averaged over 10 000 Monte-Carlo trials. Fig. 1 shows the MLLL algorithm with 5 iterations. It can be seen that the performance loss is negligible compared to the original CLLL algorithm. 
IV. TTA MULTIPROCESSOR FOR MLLL
A. Special Function Units
Six special function units (SFU) are designed and written in VHDL to accelerate each iteration of the MLLL algorithm. A special function unit is designed to support the complex multiplication (CMUL) operation. Two SFUs for μ calculation and size reduction are designed according to [8] . These SFUs are single-cycle and multiplier-less. It is observed from the Matlab simulations that the value of μ has a range of [−4, 4] , and thus, the dynamic range of the SFUs are set accordingly. Another simple SFU is designed to compute the SIEGEL criterion. Instead of multiplying the input with .75 the SIEGEL SFU calculates the value with a combination of two shifters and one adder. The most complex SFU that lies in the datapath of a single TTA core is the CORDIC SFU. A master-slave cordic is considered in this work [4] . The master-slave CORDIC is a combination of two CORDIC blocks in vectoring mode and rotation mode respectively. By setting the input as 1 and 0 of the CORDIC with rotation mode, it is possible to calculate the cosine and sine values directly. Therefore, the angle calculation done in a conventional CORDIC block is not needed here. In every stage it is possible to calculate the values of the signums and add or subtract accordingly in the rotation mode. As we need a 16-bit CORDIC, there are two options to design it. An iterative CORDIC that uses registers and iterates 16 times over the 1-stage datapath. However, it takes 16-clock cycles to compute the output. For a processor based implementation a 16-cycle SFU is complex as there will be 15 NOP operations in the assembly code. It is possible to fully unroll the CORDIC block without any registers. Then the critical path for the CORDIC block becomes too high. We find a compromise between the two approach and design a 4-stage CORDIC datapath that can be reused four times to create a 4-cycle master-slave CORDIC. The block diagram of the master-slave CORDIC is presented in Fig. 2 , where the ellipse contains a single stage of the datapath. An ARRANGE SFU is designed to rearrange the 32-bit variables.
B. High Level Architecture of the multiprocessor
A 32-bit fixed point TTA processor is designed to support a single iteration of the MLLL algorithm and five of these TTA cores are connected in a pipelined fashion to compute the LR matrix. Part of a single TTA processor core is illustrated inside the dotted block of Fig. 3 . For readability, the whole processor is not given. The blocks in the upper part of the core represent the function units and register files of the processor. The black horizontal straight lines represent the buses of the processor. The vertical rectangular blocks represent the sockets. Each core includes the load/store unit (LSU), arithmetic logic unit (ALU), global control unit (GCU), register files, several conventional function units and SFUs. The Q, R and T matrix are read from three separate first-in-first-out (FIFO) memory buffer by using the function units called STREAM. The STREAM units can read every input sample in one clock cycle. Three STREAM units are used to get the inputs simultaneously. Three STREAM unit is used to write the outputs in the memory buffer.
Ten register files are used to save the intermediate results. A single boolean register file is included in the processor design. When the registers are not enough, the processor is able to access the data memory to temporary store data through the LSU. The SFUs can be called by macros to accelerate the program code.
Eight buses are used in a single core and therefore the core is able to process eight instruction in a single clock cycle. The connection between function units and buses is illustrated by black spots in the sockets of Fig. 3 . The cores are connected to one another by FIFO buffer memories. In this way, five iteration of MLLL can be processed in parallel. The cores are identical and except the first core, the rest of the cores use the same program image.
V. RESULTS AND DISCUSSION
It can be seen from the TCE tool that the TTA multiprocessor takes 187 cycles to compute the MLLL algorithm. The multiprocessor is synthesized on 90 nm CMOS process. The maximum clock frequency achieved during the synthesis for the multiprocessor is 210 MHz. The total gate count of the multiprocessor at 210 MHz is around 405 kgates.
A comparison with different other implementations of LR is presented in Table I . Two important VLSI architectures for the LR algorithm can be found in [4] and [6] with low latency and area. The authors implemented the reverse-siegel LLL (RS-LLL) and hardware-optimized LLL (HOLL) in [4] and [6] respectively. The VLSI architecture for the Clarkson's algorithm is provided in [5] . The architecture provides less throughput than our architecture even after using a hardwired control path. The latency of the VLSI architecture of [7] is lowest, but with the price of a very low maximum achievable clock frequency of 37 MHz. Though most of the VLSI implementations take less cycles and area, the architectures suffer from inflexibility, and as a consequence later field updates are not possible. As different variants of LLL algorithms are proposed in different literatures, a flexible implementation is a necessity. Our customized multiprocessor is an example of such a flexible implementation with moderate latency and cost. It is possible to support different variants of LLL algorithms by updating the instruction memory with new binary program image. The updated binary instructions can be obtained by compiling the other LLL algorithms for our particular architecture with the help of a retargetable compiler.
The programmable VLIW core [8] takes less clock cycle and flexible. The implementation consisted of not only LR, but also QR decomposition and detection also. However, it is not clear the amount of area needed only for the LR. The total area is very high compared to the other implementations even at 40 nm technology.
A VLSI architecture of another lattice reduction algorithm called Seysen's algorithm (SA) is presented in [13] . A unified architecture for SA and LLL can lead to an interesting research direction.
VI. CONCLUSION
