130 research outputs found

    High Throughput VLSI Architecture for Soft-Output MIMO Detection Based on A Greedy Graph Algorithm

    Get PDF
    Maximum-likelihood (ML) decoding is a very computational- intensive task for multiple-input multiple-output (MIMO) wireless channel detection. This paper presents a new graph based algorithm to achieve near ML performance for soft MIMO detection. Instead of using the traditional tree search based structure, we represent the search space of the MIMO signals with a directed graph and a greedy algorithm is ap- plied to compute the a posteriori probability (APP) for each transmitted bit. The proposed detector has two advantages: 1) it keeps a fixed throughput and has a regular and parallel datapath structure which makes it amenable to high speed VLSI implementation, and 2) it attempts to maximize the a posteriori probability by making the locally optimum choice at each stage with the hope of finding the global minimum Euclidean distance for every transmitted bit x_k element of {-1, +1}. Compared to the soft K-best detector, the proposed solution significantly reduces the complexity because sorting is not required, while still maintaining good bit error rate (BER) performance. The proposed greedy detection algorithm has been designed and synthesized for a 4 x 4 16-QAM MIMO system in a TSMC 65 nm CMOS technology. The detector achieves a maximum throughput of 600 Mbps with a 0.79 mm2 core area.Nokia CorporationNational Science Foundatio

    High-Throughput Soft-Output MIMO Detector Based on Path-Preserving Trellis-Search Algorithm

    Get PDF
    In this paper, we propose a novel path-preserving trellis-search (PPTS) algorithm and its high-speed VLSI architecture for soft-output multiple-input-multiple-output (MIMO) detection. We represent the search space of the MIMO signal with an unconstrained trellis, where each node in stage of the trellis maps to a possible complex-valued symbol transmitted by antenna. Based on the trellis model, we convert the soft-output MIMO detection problem into a multiple shortest paths problem subject to the constraint that every trellis node must be covered in this set of paths. The PPTS detector is guaranteed to have soft information for every possible symbol transmitted on every antenna so that the log-likelihood ratio (LLR) for each transmitted data bit can be more accurately formed. Simulation results show that the PPTS algorithm can achieve near-optimal error performance with a low search complexity. The PPTS algorithm is a hardware-friendly data-parallel algorithm because the search operations are evenly distributed among multiple trellis nodes for parallel processing. As a case study, we have designed and synthesized a fully-parallel systolic-array detector and two folded detectors for a 4x4 16-QAM system using a 1.08 V TSMC 65-nm CMOS technology.With a 1.18 mm2 core area, the folded detector can achieve a throughput of 2.1 Gbps.With a 3.19 mm2 core area, the fully-parallel systolic-array detector can achieve a throughput of 6.4 Gbps

    Reconfigurable Real-time MIMO Detector on GPU

    Get PDF
    In a high performance multiple-input multiple-output (MIMO) system, a soft output MIMO detector combined with a channel decoder is often used at the receiver to maximize performance gain. Graphic processor unit (GPU) is a low-cost parallel programmable co-processor that can deliver extremely high computation throughput and is well suited for signal processing applications. We propose and implement a novel soft MIMO detection algorithm and show we meet real-time performance while maintaining flexibility using GPU.NokiaNokia Siemens Networks (NSN)Texas InstrumentsXilinxNational Science Foundatio

    Implementation of a High Throughput Soft MIMO Detector on GPU

    Get PDF
    Multiple-input multiple-output (MIMO) significantly increases the throughput of a communication system by employing multiple antennas at the transmitter and the receiver. To extract maximum performance from a MIMO system, a computationally intensive search based detector is needed. To meet the challenge of MIMO detection, typical suboptimal MIMO detectors are ASIC or FPGA designs. We aim to show that a MIMO detector on Graphic processor unit (GPU), a low-cost parallel programmable co-processor, can achieve high throughput and can serve as an alternative to ASIC/FPGA designs. However, careful architecture aware software design is needed to leverage the performance offered by GPU. We propose a novel soft MIMO detection algorithm, multi-pass trellis traversal (MTT), and show that we can achieve ASIC/FPGA-like performance and handle different configurations in software on GPU. The proposed design can be used to accelerate wireless physical layer simulations and to offload MIMO detection processing in wireless testbed platforms.NokiaNokia Siemens Networks (NSN)Texas InstrumentsXilinxNational Science Foundatio

    LOW-COMPLEXITY AND HIGH-PERFORMANCE SOFT MIMO DETECTION BASED ON DISTRIBUTED M-ALGORITHM THROUGH TRELLIS-DIAGRAM

    Get PDF
    This paper presents a novel low-complexity multiple-input multipleoutput (MIMO) detection scheme using a distributed M-algorithm (DM) to achieve high performance soft MIMO detection. To reduce the searching complexity, we build a MIMO trellis graph and split the searching operations among different nodes, where each node will apply the M-algorithm. Instead of keeping a global candidate list as the traditional detector does, this algorithm keeps multiple small candidate lists to generate soft information. Since the DM algorithm can achieve good BER performance with a small M, the sorting cost of the DM algorithm is lower than that of the conventional K-best MIMO algorithm. The proposed algorithm is very suitable for high speed parallel processing.NokiaNokia Siemens Networks (NSN)XilinxNational Science Foundatio

    Algorithm Development and VLSI Implementation of Energy Efficient Decoders of Polar Codes

    Get PDF
    With its low error-floor performance, polar codes attract significant attention as the potential standard error correction code (ECC) for future communication and data storage. However, the VLSI implementation complexity of polar codes decoders is largely influenced by its nature of in-series decoding. This dissertation is dedicated to presenting optimal decoder architectures for polar codes. This dissertation addresses several structural properties of polar codes and key properties of decoding algorithms that are not dealt with in the prior researches. The underlying concept of the proposed architectures is a paradigm that simplifies and schedules the computations such that hardware is simplified, latency is minimized and bandwidth is maximized. In pursuit of the above, throughput centric successive cancellation (TCSC) and overlapping path list successive cancellation (OPLSC) VLSI architectures and express journey BP (XJBP) decoders for the polar codes are presented. An arbitrary polar code can be decomposed by a set of shorter polar codes with special characteristics, those shorter polar codes are referred to as constituent polar codes. By exploiting the homogeneousness between decoding processes of different constituent polar codes, TCSC reduces the decoding latency of the SC decoder by 60% for codes with length n = 1024. The error correction performance of SC decoding is inferior to that of list successive cancellation decoding. The LSC decoding algorithm delivers the most reliable decoding results; however, it consumes most hardware resources and decoding cycles. Instead of using multiple instances of decoding cores in the LSC decoders, a single SC decoder is used in the OPLSC architecture. The computations of each path in the LSC are arranged to occupy the decoder hardware stages serially in a streamlined fashion. This yields a significant reduction of hardware complexity. The OPLSC decoder has achieved about 1.4 times hardware efficiency improvement compared with traditional LSC decoders. The hardware efficient VLSI architectures for TCSC and OPLSC polar codes decoders are also introduced. Decoders based on SC or LSC algorithms suffer from high latency and limited throughput due to their serial decoding natures. An alternative approach to decode the polar codes is belief propagation (BP) based algorithm. In BP algorithm, a graph is set up to guide the beliefs propagated and refined, which is usually referred to as factor graph. BP decoding algorithm allows decoding in parallel to achieve much higher throughput. XJBP decoder facilitates belief propagation by utilizing the specific constituent codes that exist in the conventional factor graph, which results in an express journey (XJ) decoder. Compared with the conventional BP decoding algorithm for polar codes, the proposed decoder reduces the computational complexity by about 40.6%. This enables an energy-efficient hardware implementation. To further explore the hardware consumption of the proposed XJBP decoder, the computations scheduling is modeled and analyzed in this dissertation. With discussions on different hardware scenarios, the optimal scheduling plans are developed. A novel memory-distributed micro-architecture of the XJBP decoder is proposed and analyzed to solve the potential memory access problems of the proposed scheduling strategy. The register-transfer level (RTL) models of the XJBP decoder are set up for comparisons with other state-of-the-art BP decoders. The results show that the power efficiency of BP decoders is improved by about 3 times

    Algorithm Development and VLSI Implementation of Energy Efficient Decoders of Polar Codes

    Get PDF
    With its low error-floor performance, polar codes attract significant attention as the potential standard error correction code (ECC) for future communication and data storage. However, the VLSI implementation complexity of polar codes decoders is largely influenced by its nature of in-series decoding. This dissertation is dedicated to presenting optimal decoder architectures for polar codes. This dissertation addresses several structural properties of polar codes and key properties of decoding algorithms that are not dealt with in the prior researches. The underlying concept of the proposed architectures is a paradigm that simplifies and schedules the computations such that hardware is simplified, latency is minimized and bandwidth is maximized. In pursuit of the above, throughput centric successive cancellation (TCSC) and overlapping path list successive cancellation (OPLSC) VLSI architectures and express journey BP (XJBP) decoders for the polar codes are presented. An arbitrary polar code can be decomposed by a set of shorter polar codes with special characteristics, those shorter polar codes are referred to as constituent polar codes. By exploiting the homogeneousness between decoding processes of different constituent polar codes, TCSC reduces the decoding latency of the SC decoder by 60% for codes with length n = 1024. The error correction performance of SC decoding is inferior to that of list successive cancellation decoding. The LSC decoding algorithm delivers the most reliable decoding results; however, it consumes most hardware resources and decoding cycles. Instead of using multiple instances of decoding cores in the LSC decoders, a single SC decoder is used in the OPLSC architecture. The computations of each path in the LSC are arranged to occupy the decoder hardware stages serially in a streamlined fashion. This yields a significant reduction of hardware complexity. The OPLSC decoder has achieved about 1.4 times hardware efficiency improvement compared with traditional LSC decoders. The hardware efficient VLSI architectures for TCSC and OPLSC polar codes decoders are also introduced. Decoders based on SC or LSC algorithms suffer from high latency and limited throughput due to their serial decoding natures. An alternative approach to decode the polar codes is belief propagation (BP) based algorithm. In BP algorithm, a graph is set up to guide the beliefs propagated and refined, which is usually referred to as factor graph. BP decoding algorithm allows decoding in parallel to achieve much higher throughput. XJBP decoder facilitates belief propagation by utilizing the specific constituent codes that exist in the conventional factor graph, which results in an express journey (XJ) decoder. Compared with the conventional BP decoding algorithm for polar codes, the proposed decoder reduces the computational complexity by about 40.6%. This enables an energy-efficient hardware implementation. To further explore the hardware consumption of the proposed XJBP decoder, the computations scheduling is modeled and analyzed in this dissertation. With discussions on different hardware scenarios, the optimal scheduling plans are developed. A novel memory-distributed micro-architecture of the XJBP decoder is proposed and analyzed to solve the potential memory access problems of the proposed scheduling strategy. The register-transfer level (RTL) models of the XJBP decoder are set up for comparisons with other state-of-the-art BP decoders. The results show that the power efficiency of BP decoders is improved by about 3 times

    Low complexity scalable MIMO sphere detection through antenna detection reordering

    Get PDF
    This paper describes a novel low complexity scalable multiple-input multiple-output (MIMO) detector that does not require preprocessing and the optimal squared l2-norm computations to achieve good bit error (BER) performance. Unlike existing detectors such as Flexsphere that use preprocessing before MIMO detection to improve performance, the proposed detector instead performs multiple search passes, where each search pass detects the transmit stream with a different permuted detection order. In addition, to reduce the number of multipliers required in the design, we use l1-norm in place of the optimal squared l2-norm. To ameliorate the BER performance loss due to l1- norm, we propose squaring then scaling the l1-norm. By changing the number of parallel search passes and using norm scaling, we show that this design achieves comparable performance to Flexsphere with reduced resource requirement or achieves BER performance close to exhaustive search with increased resource requirement.National Science Foundatio

    On the application of graphics processor to wireless receiver design

    Get PDF
    In many wireless systems, a Turbo decoder is often combined with a soft-output multiple-input and multiple-output (MIMO) detector at the receiver to maximize performance in many 4G and beyond wireless standards. Although custom application specific designs are usually used to meet this challenge, programmable graphics processing units (GPU) has become an alternative to the traditional ASIC and FPGA solution for wireless applications. However, careful architecture-aware algorithm design and mapping are required to maximize performance of a communication block on GPU. For MIMO soft detection, we implemented a new MIMO soft detection algorithm, multi-pass trellis traversal (MTT). For Turbo decoding, we used a parallel window algorithm. We showed that our implementations can achieve high throughput while maintaining good performance. This work will allow us to implement a complete iterative MIMO receiver in software on GPU in the future
    corecore