109 research outputs found

    Configurable and Scalable High Throughput Turbo Decoder Architecture for Multiple 4GWireless Standards

    Get PDF
    In this paper, we propose a novel multi-code turbo decoder architecture for 4G wireless systems. To support various 4G standards, a configurable multi-mode MAP (maximum a posteriori) decoder is designed for both binary and duo-binary turbo codes with small resource overhead (less than 10%) compared to the single-mode architecture. To achieve high data rates in 4G, we present a parallel turbo decoder architecture with scalable parallelism tailored to the given throughput requirements. High-level parallelism is achieved by employing contention-free interleavers. Multi-banked memory structure and routing network among memories and MAP decoders are designed to operate at full speed with parallel interleavers. We designed a very low-complexity recursive on-line address generator supporting multiple interleaving patterns, which avoids the interleaver address memory. Design trade-offs in terms of area and power efficiency are explored to find the optimal architectures. A 711 Mbps data rate is feasible with 32 Radix-4 MAP decoders running at 200 MHz clock rate.Texas Instruments Incorporate

    Configurable and Scalable Turbo Decoder for 4G Wireless Receivers

    Get PDF
    The increasing requirements of high data rates and quality of service (QoS) in fourth-generation (4G) wireless communication require the implementation of practical capacity approaching codes. In this chapter, the application of Turbo coding schemes that have recently been adopted in the IEEE 802.16e WiMax standard and 3GPP Long Term Evolution (LTE) standard are reviewed. In order to process several 4G wireless standards with a common hardware module, a reconfigurable and scalable Turbo decoder architecture is presented. A parallel Turbo decoding scheme with scalable parallelism tailored to the target throughput is applied to support high data rates in 4G applications. High-level decoding parallelism is achieved by employing contention-free interleavers. A multi-banked memory structure and routing network among memories and MAP decoders are designed to operate at full speed with parallel interleavers. A new on-line address generation technique is introduced to support multiple Turbo interleaving patterns, which avoids the interleaver address memory that is typically necessary in the traditional designs. Design trade-offs in terms of area and power efficiency are analyzed for different parallelism and clock frequency goals

    A Flexible LDPC/Turbo Decoder Architecture

    Get PDF
    Low-density parity-check (LDPC) codes and convolutional Turbo codes are two of the most powerful error correcting codes that are widely used in modern communication systems. In a multi-mode baseband receiver, both LDPC and Turbo decoders may be required. However, the different decoding approaches for LDPC and Turbo codes usually lead to different hardware architectures. In this paper we propose a unified message passing algorithm for LDPC and Turbo codes and introduce a flexible soft-input soft-output (SISO) module to handle LDPC/Turbo decoding. We employ the trellis-based maximum a posteriori (MAP) algorithm as a bridge between LDPC and Turbo codes decoding. We view the LDPC code as a concatenation of n super-codes where each super-code has a simpler trellis structure so that the MAP algorithm can be easily applied to it. We propose a flexible functional unit (FFU) for MAP processing of LDPC and Turbo codes with a low hardware overhead (about 15% area and timing overhead). Based on the FFU, we propose an area-efficient flexible SISO decoder architecture to support LDPC/Turbo codes decoding. Multiple such SISO modules can be embedded into a parallel decoder for higher decoding throughput. As a case study, a flexible LDPC/Turbo decoder has been synthesized on a TSMC 90 nm CMOS technology with a core area of 3.2 mm2. The decoder can support IEEE 802.16e LDPC codes, IEEE 802.11n LDPC codes, and 3GPP LTE Turbo codes. Running at 500 MHz clock frequency, the decoder can sustain up to 600 Mbps LDPC decoding or 450 Mbps Turbo decoding.NokiaNokia Siemens Networks (NSN)XilinxTexas InstrumentsNational Science Foundatio

    Reconfigurable architectures for beyond 3G wireless communication systems

    Get PDF

    Implementation of a High Throughput 3GPP Turbo Decoder on GPU

    Get PDF
    Turbo code is a computationally intensive channel code that is widely used in current and upcoming wireless standards. General-purpose graphics processor unit (GPGPU) is a programmable commodity processor that achieves high performance computation power by using many simple cores. In this paper, we present a 3GPP LTE compliant Turbo decoder accelerator that takes advantage of the processing power of GPU to offer fast Turbo decoding throughput. Several techniques are used to improve the performance of the decoder. To fully utilize the computational resources on GPU, our decoder can decode multiple codewords simultaneously, divide the workload for a single codeword across multiple cores, and pack multiple codewords to fit the single instruction multiple data (SIMD) instruction width. In addition, we use shared memory judiciously to enable hundreds of concurrent multiple threads while keeping frequently used data local to keep memory access fast. To improve efficiency of the decoder in the high SNR regime, we also present a low complexity early termination scheme based on average extrinsic LLR statistics. Finally, we examine how different workload partitioning choices affect the error correction performance and the decoder throughput.Renesas MobileTexas InstrumentsXilinxNational Science Foundatio

    Architecture and Analysis for Next Generation Mobile Signal Processing.

    Full text link
    Mobile devices have proliferated at a spectacular rate, with more than 3.3 billion active cell phones in the world. With sales totaling hundreds of billions every year, the mobile phone has arguably become the dominant computing platform, replacing the personal computer. Soon, improvements to today’s smart phones, such as high-bandwidth internet access, high-definition video processing, and human-centric interfaces that integrate voice recognition and video-conferencing will be commonplace. Cost effective and power efficient support for these applications will be required. Looking forward to the next generation of mobile computing, computation requirements will increase by one to three orders of magnitude due to higher data rates, increased complexity algorithms, and greater computation diversity but the power requirements will be just as stringent to ensure reasonable battery lifetimes. The design of the next generation of mobile platforms must address three critical challenges: efficiency, programmability, and adaptivity. The computational efficiency of existing solutions is inadequate and straightforward scaling by increasing the number of cores or the amount of data-level parallelism will not suffice. Programmability provides the opportunity for a single platform to support multiple applications and even multiple standards within each application domain. Programmability also provides: faster time to market as hardware and software development can proceed in parallel; the ability to fix bugs and add features after manufacturing; and, higher chip volumes as a single platform can support a family of mobile devices. Lastly, hardware adaptivity is necessary to maintain efficiency as the computational characteristics of the applications change. Current solutions are tailored specifically for wireless signal processing algorithms, but lose their efficiency when other application domains like high definition video are processed. This thesis addresses these challenges by presenting analysis of next generation mobile signal processing applications and proposing an advanced signal processing architecture to deal with the stringent requirements. An application-centric design approach is taken to design our architecture. First, a next generation wireless protocol and high definition video is analyzed and algorithmic characterizations discussed. From these characterizations, key architectural implications are presented, which form the basis for the advanced signal processor architecture, AnySP.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/86344/1/mwoh_1.pd

    Datacenter Design for Future Cloud Radio Access Network.

    Full text link
    Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers. I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding. Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4× to 16× fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13× more energy and 6× more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform. Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a “hill-climbing” power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd

    On the application of graphics processor to wireless receiver design

    Get PDF
    In many wireless systems, a Turbo decoder is often combined with a soft-output multiple-input and multiple-output (MIMO) detector at the receiver to maximize performance in many 4G and beyond wireless standards. Although custom application specific designs are usually used to meet this challenge, programmable graphics processing units (GPU) has become an alternative to the traditional ASIC and FPGA solution for wireless applications. However, careful architecture-aware algorithm design and mapping are required to maximize performance of a communication block on GPU. For MIMO soft detection, we implemented a new MIMO soft detection algorithm, multi-pass trellis traversal (MTT). For Turbo decoding, we used a parallel window algorithm. We showed that our implementations can achieve high throughput while maintaining good performance. This work will allow us to implement a complete iterative MIMO receiver in software on GPU in the future
    corecore