Abstract-Multiple-input multiple-output (MIMO) is an existing technique that can significantly increase throughput of the system by employing multiple antennas at the transmitter and the receiver. Realizing maximum benefit from this technique requires computationally intensive detectors which poses significant challenges to receiver design. Furthermore, a flexible detector or multiple detectors are needed to handle different configurations. Graphical Processor Unit (GPU), a highly parallel commodity programmable co-processor, can deliver extremely high computation throughput and is well suited for signal processing applications. However, careful architecture aware design is needed to leverage performance offered by GPU. We show we can achieve good performance while maintaining flexibility by employing an optimized trellis-based MIMO detector on GPU.
I. INTRODUCTION Wireless communication systems enable higher data rate services by providing higher spectral efficiency. Multiple-input multiple-output combined with orthogonal frequency division multiplexing (MIMO-OFDM) is used in current and upcoming standards such as WiMAX and 3GPP LTE. MIMO increases spectral efficiency by employing multiple antennas at the transmitter and at the receiver. OFDM divides the available bandwidth into a set of orthogonal subchannels or subcarriers. As the received signal at each antenna for each subcarrier is a signal that consists of a combination of multiple data streams from multiple transmit antennas, a higher complexity detector is required to recover the transmitted vector compared to single antenna systems. Although an exhaustive search based MIMO detector would be optimal, complexity would be prohibitive. Fortunately, a suboptimal MIMO detector can provide close to optimal performance with significantly lower complexity.
The typical suboptimal MIMO detectors are fixed ASIC designs operating with a specific configuration and a predetermined workload in mind. For example, Burg et al. [1] implemented a depth-first 4 x 4 16-QAM detector in ASIC with an average throughput of 73 Mbps at a signal-to-noise ratio (SNR) of 20 dB. Wong et ale [2] first introduced a K-best (with K=10) 4 x 4 16-QAM detector in ASIC achieving 10 Mbps. Later on Guo and Nilsson [3] developed a KSE (with K=5) 4 x 4 16-QAM detector in ASIC with a higher throughput of 53.3 Mbps. In addition, researchers have investigated many other ways of hardware implementations. Huang et al. [4] respectively. Qi and Chakrabarti [5] mapped a depth-first detector on a multi-core software radio architecture (SDR). Antikainen et al. [6] presented an application-specific instruction set processor (ASIP) implementation of K-best detector. Beside these traditional solutions, graphic processor unit (GPU) has become a viable alternative to high performance accelerators for several reasons. First, GPU delivers extremely high computation throughput by employing many cores to execute a common set of operations on a large set of data in parallel. Many communication algorithms are inherently data parallel and computationally intensive, and can take advantage of highly parallel computation offered by GPU. Second, although custom ASIC and FPGA are capable of delivering higher throughput than GPU, GPU can deliver real-time throughput and continues to grow exponentially in term of performance. Combined with the fact that these types of processors are extremely cost-effective and ubiquitous in mobile and desktop devices, communication algorithms in the future can be offloaded onto this type of processor in place of custom ASIC or FPGA. Finally, unlike ASIC, GPU is extremely flexible and can be reconfigured on the fly to handle different workloads.
However, careful architecture aware algorithm design is needed to achieve high performance on the GPU. For example, due to the limited amount of resource on GPU, such as on-chip memory and/or long latency due to software sorting [7] , many existing algorithms such as depth first sphere detector and Kbest detector do not map very efficiently onto this architecture. Furthermore, designing a detection algorithm that scales well, keeping the cores fully utilized to achieve peak throughput across different combinations of number of antennas and different modulation, is a difficult task.
The most intensive baseband processing blocks in a MIMO receiver are the detector and the channel decoder. Falcao et al. [8] presented an LDPC decoder capable of real time throughput on GPU. In this paper, we will explore a new design paradigm by also applying the GPU for real time MIMO detection. We aim to show that an optimized trellis-based MIMO detector can achieve good performance while maintaining flexibility offered by programmable hardware. In section II we will cover the system model and in section III give an overview of the detection algorithm. This is followed by a discussion on the CUDA architecture in section IV and an overview of the implementation in section V. Finally, we will give performance results in section VI and conclude in section VII. 
III. TRELLIS -BASED MIMO DETECTION
The hard decision maximum likelihood(ML) detector finds the transmit vector, s, that minimizes the euclidean distance between the received vector, 9 and the expected received vector Rs. It can be expressed as: 
B. Greedy Shortest Path Algorithm
One way of finding the most likely transmit vector is through exhaustive search. Searching through all possible transmit vectors is a time intensive process . For example, if the transmitter is utilizing 64-QAM, then the total number of possible transmit vectors is 64 4 = 16,777,216. Therefore, instead of finding the best transmit candidate by evaluating all complete paths through the trellis, a greedy shortest path algorithm [9, 10] can approximately solve the hard detection problem in this section . In this greedy algorithm, we prune the incoming paths at each vertex.
Edge reduction reduces the number of paths by pruning incoming paths. First we connect the Q surviving path, h~, ..., h'q_1'
to a vertex i.
The search space of all transmit vectors is shown in Figure I , where each transmit vector is a path through the trellis. The new cumulative weight for path k, d k , is generated by adding the path weight to vertex i, w~~>, to the old cumulative weight for path k, d~. '
Hs +w
QRs +w Rs +w The hard decision maximum likelihood(ML) detector can also be represented using a trellis graph.
A. Graph construction
Without loss of generality, we use a 4 x 4 QPSK system in this section to construct our graph . To find the transmit vector with the smallest euclidean distance, we need to calculate the euclidean distance of each transmit vector. Given a transmit vector, s, the euclidean distance is defined as:
Bestinca . 
m E{O,...,Q -l } and discard the other Q -1 subpaths.
The goal is finding the shortest path through the trellis . The search process can be expressed as a series of edge reductions. We perform edge reductions until we have completely traversed the trellis. Figure 3 shows an example of the result graph after applying edge reductions at each vertex. At stage 3, we have four surviving paths, one path per vertex at stage 3. The best path is chosen among the four remaining paths at the toor.
thread can select a set of data using its own unique ID and executes the kernel function on the set of data. Threads executes independently in this model. However, threads within a block can synchronize through a barrier and writing to shared memory. In contrast, thread blocks are completely independent and can be synchronized by terminating the kernel and writing to global memory.
During kernel execution, multiple thread blocks can be assigned to a SM and is executed concurrently. CUDA divides threads within a thread block into blocks of 32 threads . These 32 threads are executed as a group using the same common instruction, a warp instruction, at each step. A SM issues a warp instruction whose operands are ready. To mask the pipeline latency and memory stalls, a SM can switch and issue an independent warp instruction from the the same thread block or another concurrent thread block with zero-overhead. Therefore, to achieve peak performance, one would like to map many concurrent threads onto a SM. The five types of memory that a Compute Unified Device Architecture [11] is a software programming model that exposes the massive computation potential offered by the programmable GPU. The raw computation power is enabled by many cores on the programmable GPU. A GPU can have multiple stream multiprocessors (SM), where each stream multiprocessor consists of eight pipelined cores. During execution, all cores in a SM execute a single 32-bit integer or float operation on multiple data in parallel. A CUDA device has a large amount (up to 1GB) of off-chip device memory (global memory). In addition, fast on-chip resources, such as registers, shared memory and constant memory can be used in place of offchip global memory to keep the computation throughput high.
In this model , the programmer defines the kernel function, a set of common operations. At runtime, the kernel spawns a large number of threads blocks. Each thread block contains multiple threads, up to a 512 threads per thread block. Each Table I ). Before the start of the program, the host copies data from system memory into global memory and constant memory which all threads on the device can access. This is slow as the GPU is often connected to the host through PCI-express bus. Fetching data from global memory, the external DRAM, also results in a latency penalty. Registers, shared memory, and constant memory can reduce memory access time by reducing global memory access. Registers and share memory are on-chip resources. Shared memory is slower than registers, but can be accessed by threads within a thread block . However, shared memory on each SM has 16 access ports. If 16 threads, half of a warp, are scheduled to access shared memory at the same time, they must have certain conditions to allow the instruction to execute with one load or store. It takes one load or store if all threads access the same port (broadcast) or none of the threads accesses the same port. However, random layout with some broadcast and some one-to-one accesses will be serialized and cause a stall. Although constant memory resides in global memory, memory access to this memory space is cached. Like shared memory, it takes one cached read if all threads access the same location. Unlike shared memory, all other constant memory access patterns will be serialized and need global memory access.
Both shared memory and registers are divided among concurrent threads on a SM. Since there are only 16 KB of shared memory and 8196 registers per stream multiprocessor, designing an algorithm that effectively partitions registers and shared memory such that at least a few blocks can be mapped onto the same processor, each thread has an efficient memory access where Sj is the jth element of a kth subpath h~. ,.
(II) where Sj is the jth element of k t h subpath from the previous level. And can be expressed as {h~, q}.
To reduce complexity, the calculation can be done in two steps.
N -2 L R (N -l -k ,j ) Sj j =N -l -k IIYN-t-l -8k -R(N-l ,N-l)q l lĩ
nto shared memory. In the second step, the detector does edge reduction iteratively until there is one path per trellis vertex at stage N -1. Each successive iterations of edge reduction can be divided into three stages. In the first stage, each thread calculates the partial distance vector of an incoming path. This immediate result is stored into the shared memory, which is shared across all threads for the next stage. In the second stage, each thread evaluates Q possible paths by calculating Q partial distances by using vectors calculated in previous iteration. The best path, one with the lowest partial distance , is picked. In the third stage, Q best paths, one per state, are stored into shared memory for the next iteration. In the last step, our detector picks the path through the trellis that has the lowest partial distance . The algorithm is summarized in Figure 4 .
I) Calculate the Initial Partial Distance (Stage 0):
There are Q edges between the root node and vertices at stage O. For the initial calculation, thread q calculates the partial distance for a edge between the root node and qth vertex. A. Memory Utilization A warp will not execute until all operands are ready, therefore it is desirable to reduce memory access time. Threads in our detector need to fetch R , y, s , and fetch/write path history, partial euclidean vector, and cumulative euclidean distance . The trellisbased detection algorithm memory access is regular enough such that memory access often can be parallelized, leading to low latency memory access.
In our detector, shared memory stores data that needs be shared across threads: path history, partial euclidean vector, and cumulative euclidean distance . The addressing pattern for vectors in shared memory is either linear or broadcast, causing no conflicts. Inputs to the detector, channel matrix R, the received vector y, and fl, are stored in the constant memory. The memory access pattern for Rand y is broadcast in our detector for Q = 16, Q = 64 and Q = 256. For these cases, constant memory access is cached since each thread within a half of warp always access the same element in Rand y. However, a thread block for Q = 4 consists of multiple Viterbi MIMO detectors . In this case, each half warp needs simultaneous access to four different elements in R and in y. This is less efficient as the first constant memory read is cached and the subsequent memory accesses are serialized.
V. MAPPING MIMO DETECTION ON CUDA
Typically there are hundreds of subcarriers in many high data rate standards such as LTE. In our implementation, each stream multiprocessor handles a set of subcarriers independently . There is no need to synchronize across stream multiprocessors.
The algorithm proposed in section III maps efficiently onto a single multiprocessor. The data parallelism of this algorithm is Q-there are Q edge reductions we can do at each stage.
Therefore, computation can be balanced evenly across Q threads .
However, the minimum number of threads within a warp is 32. When Q = 4 and Q = 16, a warp is not completely occupied, reducing the effective throughput. To increase efficiency of the detector, we allow each thread block to consist of more than one MIMO detector, which in turn allows each thread block to have 32 threads and fully occupy one warp.
Therefore, the implemented MIMO detection kernel creates many thread blocks that executes in parallel. Depending on Q, each thread block consists of one or more MIMO detectors . Each MIMO detector, has Q threads, performs edge reduction iteratively to estimate the input vector given a unique channe l matrix and a receive vector. To keep the detectors operating at peak utilization, on-chip memory is used to reduce access penalty. In addition, computation is shared among threads to reduce complexity.
pattern is a non-trivial task. Trellis based detection algorithm works particularly well with this architecture .
B. Algorithm Implementation
The detection process can be divided into several steps. In the first step, each thread calculates the initial partial distance for a edge from root to a vertex in trellis stage O. The results are stored Thread q first calculates b q and stores the result (an intermediate partial distance vector) into shared memory. We synchronize the threads for the next step. The partial distance vector calculated is used to speed up the Q partial distance calculation for each vertex. This reduces complexity, the original algorithm requires N x Q2 complex multiplies, while the reduced complexity algorithms only require N x Q complex multiplies.
Each thread finds the the path with the minimum partial distance using edge reduction. Each thread calculates Q partial distances serially and finds the path with the minimum weight. At the end of the iteration, there are Q paths, one path per thread. The paths are written to the shared memory for the next iteration. The steps in the algorithm are summarized in algorithm 1. The GPU used is an Nvidia 9600GT graphic card, which has 64 stream processors running at 1900MHz and 512MB of DDR3 memory running at 2000 MHz. The test code first generates the random input symbols and a random channel. After passing the input symbols through this channel, it performs QRdecomposition on the channel matrix H to generate Rand y. Both Rand yare fed into the detection kernel running on GPU.
Various modulation schemes are compared: 4-QAM, 16-QAM, 64-QAM and 256-QAM. The detector kernel detects 300 symbols for each modulation. To keep utilization high, each thread block detects 16 symbols for 4-QAM, 4 symbols for 16-QAM, and 1 symbol for 64-QAM and 256-QAM. The execution time is averaged over 1000 runs. As the GPU is often connected 307 to the host through the PCI-express bus, transfer data via this bus results in a measurable and non-negligible latency penalty. Table II shows the results for 2 x 2 and 4 x 4 MIMO system with and without transport time. MIMO-OFDM is used to achieve high data rate in a real time system such as 3GPP LTE and WiMAX. The number of data subcarriers per symbol is 300 for a 5 MHz LTE MIMO system. Since each slot is 0.5ms and consists of 7 MIMO symbols, our detector needs to detect a 300 subcarrier MIMO OFDM symbol in 0.0714 ms to handle maximum throughput for a particular configuration. If we consider peak throughput without the transfer overhead, our detector can handle 4-QAM and 16-QAM for 2 x 2 and 4 x 4 5MHz LTE MIMO system. The detector can support other standards such as WiMAX by changing the number of symbols fed into the detector.
The current implementation attempts to maximize efficiency by ensuring each thread block has an integer multiple of 32 threads and employing a regular algorithm with very regular memory access. To measure the efficiency of the mapping, we can use the warp instruction count to generate an upper bound limit on execution time (without transport). The number of warp instructions executed on a SM to detect this set of MIMO symbols is recorded for each modulation using the CUDA Visual Profiler. The estimated execution time, Test, can be calculated as: (12) where F c is clock frequency, C is the number of executed warp instructions, and C P I is the average number of cycles per instruction.
In CUDA, the average CPI is 4 cycles per instruction and each SM is clocked at 1900MHz. The estimated runtime is shown in Table III . The estimated detection time is shorter than the actual execution time. This is expected since CPI of 1 corresponds to the maximum instruction throughput. For example, a MAD instruction with one shared memory operand achieves only 67% of the maximum instruction throughput [12] . Furthermore, occupancy for all four cases is below 50% due to the high number of registers each thread uses. This affects 4-QAM the most, as the additional stalls due to inefficient constant memory access pattern can not always be masked through fast thread switching.
A. Compared to ASICIFPGAIASIP
Although a conventional MIMO ASIC detector could achieve higher throughput with fewer silicon resources, it lacks the necessary flexibility to support different modulation orders and different number of antennas. Moreover, the fixed-point arithmetic employed by the ASIC has to be designed very carefully to avoid large performance degradation. For example, the internal bit width could be large due to the correlation of the channel matrices and the "colored noise". The GPU, on the other hand, will never encounter performance loss due to its floating point computation capability. Table IV compares our GPU design with state-of-the-art ASICIFPGAIASIP designs in terms of throughput. In [13] , a depth-first search detector with 256 searches per level is implemented. In [3] , a K-best detector with K == 5 and real decomposition is implemented. In [14] , a relaxed K-best detector with K == 48 is implemented. In [6] , a K-best with K == 7 detector is implemented. We also list our early ASIC design [9] based on the same trellis detection algorithm described above. As can be seen, the proposed detection algorithm is not only suitable for parallel ASIC implementation but also suitable for GPU-based parallel software implementation. Compared to ASICIFPGA solutions from [3, 13, 14] , our GPU design can achieve comparable throughput. Compared to ASIP solution from [6] , our GPU design has a higher throughput. In summary, the GPU design has more flexibility to support different MIMO system configurations and has the capability to support floatingpoint signal processing which can greatly simplify the fixed-point design issues. This paper presents a Trellis MIMO detector implementation using a floating-point GPU. The algorithm was designed to fully utilize the multiple stream processors in GPU. Compared to the conventional fixed-point VLSI implementations, the GPU based MIMO detector has more flexibility in supporting different MIMO system configurations while still maintaining good performance and achieving high throughput. The GPU based MIMO detector implementation proposed in this paper open up a new opportunity for MIMO software defined radio.
VIII. ACKNOWLEDGMENT

