38 research outputs found

    GPU Accelerated Scalable Parallel Decoding of LDPC Codes

    Get PDF
    This paper proposes a flexible low-density parity-check (LDPC) decoder which leverages graphic processor units (GPU) to provide high decoding throughput. LDPC codes are widely adopted by the new emerging standards for wireless communication systems and storage applications due to their near-capacity error correcting performance. To achieve high decoding throughput on GPU, we leverage the parallelism embedded in the check-node computation and variable-node computation and propose a parallel strategy of partitioning the decoding jobs among multi-processors in GPU. In addition, we propose a scalable multi-codeword decoding scheme to fully utilize the computation resources of GPU. Furthermore, we developed a novel adaptive performance-tuning method to make our decoder implementation more flexible and scalable. The experimental results show that our LDPC decoder is scalable and flexible, and the adaptive performance-tuning method can deliver the peak performance based on the GPU architecture.Renesas MobileSamsungNational Science Foundatio

    Massively parallel implementation of cyclic LDPC codes on a general purpose graphic processing unit

    Get PDF
    2009 IEEE Workshop On Signal Processing Systems (SiPS) Tampere, Finland 2009-10-07 ~ 2009-10-09Simulation of low-density parity-check (LDPC) codes frequently takes several days, thus the use of general purpose graphics processing units (GPGPUs) is very promising. However, GPGPUs are designed for compute-intensive applications, and they are not optimized for data caching or control management. In LDPC decoding, the parity check matrix H needs to be accessed at every node updating process, and the size of H matrix is often larger than that of GPU on-chip memory especially when the code-length is long or the weight is high. In this work, the parity check matrix of cyclic or quasi-cyclic LDPC codes is greatly compressed by exploiting the periodic property of the matrix. In our experiments, the Compute Unified Device Architecture (CUDA) of Nvidia is used. With the (1057, 813) and (4161, 3431) projective geometry (PG)–LDPC codes, the execution speed of the proposed method is more than twice of the reference implementations that do not exploit the cyclic property of the parity check matrices

    Next generation earth‑to‑space telecommand coding and synchronization: ground system design, optimization and software implementation

    Get PDF
    The Consultative Committee for Space Data Systems, followed by all national and international space agencies, has updated the Telecommand Coding and Synchronization sublayer to introduce new powerful low-density parity-check (LDPC) codes. Their large coding gains significantly improve the system performance and allow new Telecommand services and profiles with higher bit rates and volumes. In this paper, we focus on the Telecommand transmitter implementation in the Ground Station baseband segment. First, we discuss the most important blocks and we focus on the most critical one, i.e., the LDPC encoder. We present and analyze two techniques, one based on a Shift Register Adder Accumulator and the other on Winograd convolution both exploiting the block circulant nature of the LDPC matrix. We show that these techniques provide a significant complexity reduction with respect to the usual encoder mapping, thus allowing to obtain high uplink bit rates. We then discuss the choice of a proper hardware or software platform, and we show that a Central Processing Unit-based software solution is able to achieve the high bit rates requested by the new Telecommand applications. Finally, we present the results of a set of tests on the real-time software implementation of the new system, comparing the performance achievable with the different encoding options

    New Algorithms for High-Throughput Decoding with Low-Density Parity-Check Codes using Fixed-Point SIMD Processors

    Get PDF
    Most digital signal processors contain one or more functional units with a single-instruction, multiple-data architecture that supports saturating fixed-point arithmetic with two or more options for the arithmetic precision. The processors designed for the highest performance contain many such functional units connected through an on-chip network. The selection of the arithmetic precision provides a trade-off between the task-level throughput and the quality of the output of many signal-processing algorithms, and utilization of the interconnection network during execution of the algorithm introduces a latency that can also limit the algorithm\u27s throughput. In this dissertation, we consider the turbo-decoding message-passing algorithm for iterative decoding of low-density parity-check codes and investigate its performance in parallel execution on a processor of interconnected functional units employing fast, low-precision fixed-point arithmetic. It is shown that the frequent occurrence of saturation when 8-bit signed arithmetic is used severely degrades the performance of the algorithm compared with decoding using higher-precision arithmetic. A technique of limiting the magnitude of certain intermediate variables of the algorithm, the extrinsic values, is proposed and shown to eliminate most occurrences of saturation, resulting in performance with 8-bit decoding nearly equal to that achieved with higher-precision decoding. We show that the interconnection latency can have a significant detrimental effect of the throughput of the turbo-decoding message-passing algorithm, which is illustrated for a type of high-performance digital signal processor known as a stream processor. Two alternatives to the standard schedule of message-passing and parity-check operations are proposed for the algorithm. Both alternatives markedly reduce the interconnection latency, and both result in substantially greater throughput than the standard schedule with no increase in the probability of error

    Energy-Efficient Computing for Mobile Signal Processing

    Full text link
    Mobile devices have rapidly proliferated, and deployment of handheld devices continues to increase at a spectacular rate. As today's devices not only support advanced signal processing of wireless communication data but also provide rich sets of applications, contemporary mobile computing requires both demanding computation and efficiency. Most mobile processors combine general-purpose processors, digital signal processors, and hardwired application-specific integrated circuits to satisfy their high-performance and low-power requirements. However, such a heterogeneous platform is inefficient in area, power and programmability. Improving the efficiency of programmable mobile systems is a critical challenge and an active area of computer systems research. SIMD (single instruction multiple data) architectures are very effective for data-level-parallelism intense algorithms in mobile signal processing. However, new characteristics of advanced wireless/multimedia algorithms require architectural re-evaluation to achieve better energy efficiency. Therefore, fourth generation wireless protocol and high definition mobile video algorithms are analyzed to enhance a wide-SIMD architecture. The key enhancements include 1) programmable crossbar to support complex data alignment, 2) SIMD partitioning to support fine-grain SIMD computation, and 3) fused operation to support accelerating frequently used instruction pairs. Near-threshold computation has been attractive in low-power architecture research because it balances performance and power. To further improve energy efficiency in mobile computing, near-threshold computation is applied to a wide SIMD architecture. This proposed near-threshold wide SIMD architecture-Diet SODA-presents interesting architectural design decisions such as 1) very wide SIMD datapath to compensate for degraded performance induced by near-threshold computation and 2) scatter-gather data prefetcher to exploit large latency gap between memory and the SIMD datapath. Although near-threshold computation provides excellent energy efficiency, it suffers from increased delay variations. A systematic study of delay variations in near-threshold computing is performed and simple techniques-structural duplication and voltage/frequency margining-are explored to tolerate and mitigate the delay variations in near-threshold wide SIMD architectures. This dissertation analyzes representative wireless/multimedia mobile signal processing algorithms, proposes an energy-efficient programmable platform, and evaluates performance and power. A main theme of this dissertation is that the performance and efficiency of programmable embedded systems can be significantly improved with a combination of parallel SIMD and near-threshold computations.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/86356/1/swseo_1.pd

    Software Defined Radio Solutions for Wireless Communications Systems

    Get PDF
    Wireless technologies have been advancing rapidly, especially in the recent years. Design, implementation, and manufacturing of devices supporting the continuously evolving technologies require great efforts. Thus, building platforms compatible with different generations of standards and technologies has gained a lot of interest. As a result, software defined radios (SDRs) are investigated to offer more flexibility and scalability, and reduce the design efforts, compared to the conventional fixed-function hardware-based solutions.This thesis mainly addresses the challenges related to SDR-based implementation of today’s wireless devices. One of the main targets of most of the wireless standards has been to improve the achievable data rates, which imposes strict requirements on the processing platforms. Realizing real-time processing of high throughput signal processing algorithms using SDR-based platforms while maintaining energy consumption close to conventional approaches is a challenging topic that is addressed in this thesis.Firstly, this thesis concentrates on the challenges of a real-time software-based implementation for the very high throughput (VHT) Institute of Electrical and Electronics Engineers (IEEE) 802.11ac amendment from the wireless local area networks (WLAN) family, where an SDR-based solution is introduced for the frequency-domain baseband processing of a multiple-input multipleoutput (MIMO) transmitter and receiver. The feasibility of the implementation is evaluated with respect to the number of clock cycles and the consumed power. Furthermore, a digital front-end (DFE) concept is developed for the IEEE 802.11ac receiver, where the 80 MHz waveform is divided to two 40 MHz signals. This is carried out through time-domain digital filtering and decimation, which is challenging due to the latency and cyclic prefix (CP) budget of the receiver. Different multi-rate channelization architectures are developed, and the software implementation is presented and evaluated in terms of execution time, number of clock cycles, power, and energy consumption on different multi-core platforms.Secondly, this thesis addresses selected advanced techniques developed to realize inband fullduplex (IBFD) systems, which aim at improving spectral efficiency in today’s congested radio spectrum. IBFD refers to concurrent transmission and reception on the same frequency band, where the main challenge to combat is the strong self-interference (SI). In this thesis, an SDRbased solution is introduced, which is capable of real-time mitigation of the SI signal. The implementation results show possibility of achieving real-time sufficient SI suppression under time-varying environments using low-power, mobile-scale multi-core processing platforms. To investigate the challenges associated with SDR implementations for mobile-scale devices with limited processing and power resources, processing platforms suitable for hand-held devices are selected in this thesis work. On the baseband processing side, a very long instruction word (VLIW) processor, optimized for wireless communication applications, is utilized. Furthermore, in the solutions presented for the DFE processing and the digital SI canceller, commercial off-the-shelf (COTS) multi-core central processing units (CPUs) and graphics processing units (GPUs) are used with the aim of investigating the performance enhancement achieved by utilizing parallel processing.Overall, this thesis provides solutions to the challenges of low-power, and real-time software-based implementation of computationally intensive signal processing algorithms for the current and future communications systems

    Linear-time encoding and decoding of low-density parity-check codes

    Get PDF
    Low-density parity-check (LDPC) codes had a renaissance when they were rediscovered in the 1990’s. Since then LDPC codes have been an important part of the field of error-correcting codes, and have been shown to be able to approach the Shannon capacity, the limit at which we can reliably transmit information over noisy channels. Following this, many modern communications standards have adopted LDPC codes. Error-correction is equally important in protecting data from corruption on a hard-drive as it is in deep-space communications. It is most commonly used for example for reliable wireless transmission of data to mobile devices. For practical purposes, both encoding and decoding need to be of low complexity to achieve high throughput and low power consumption. This thesis provides a literature review of the current state-of-the-art in encoding and decoding of LDPC codes. Message- passing decoders are still capable of achieving the best error-correcting performance, while more recently considered bit-flipping decoders are providing a low-complexity alternative, albeit with some loss in error-correcting performance. An implementation of a low-complexity stochastic bit-flipping decoder is also presented. It is implemented for Graphics Processing Units (GPUs) in a parallel fashion, providing a peak throughput of 1.2 Gb/s, which is significantly higher than previous decoder implementations on GPUs. The error-correcting performance of a range of decoders has also been tested, showing that the stochastic bit-flipping decoder provides relatively good error-correcting performance with low complexity. Finally, a brief comparison of encoding complexities for two code ensembles is also presented

    연판정 오류정정을 위한 낮은 복잡도의 블록 터보부호 복호화 연구

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 8. 성원용.As the throughput needed for communication systems and storage devices increases, high-performance forward error correction (FEC), especially soft-decision (SD) based technique, becomes essential. In particular, block turbo codes (BTCs) and low-density parity check (LDPC) codes are considered as candidate FEC codes for the next generation systems, such as beyond-100Gbps optical networks and under-20nm NAND flash memory devices, which require capacity-approaching performance and very low error floor. The BTCs have definite strengths in diversity and encoding complexity because they generally employ a two-dimensional structure, which enables sub-frame level decoding for the row or column code-words. This sub-frame level decoding gives a strong advantage for parallel processing. The BTC decoding throughput can be improved by applying a low-complexity algorithm to the small level decoding or by running multiple sub-frame decoding modules simultaneously. In this dissertation, we develop high-throughput BTC decoding software that pursuits these advantages. The first part of this dissertation is devoted to finding efficient test patterns in the Chase-Pyndiah algorithm. Although the complexity of this algorithm linearly increases according to the number of the test patterns, it naively considers all possible patterns containing least reliable positions. As a result, consideration of one more position nearly doubles the complexity. To solve this issue, we first introduce a new position selection criterion that excludes some of the selected ones having a relatively large reliability. This technique excludes the selection of sufficiently reliable positions, which greatly reduces the complexity. Secondly, we propose a pattern selection scheme considering the error coverage. We define the error coverage factor that represents the influence on the error-correcting performance and compute it by analyzing error events. Based on the computed factor, we select the patterns with the greedy algorithm. By using these methods, we can flexibly balance the complexity and the performance. The second part of this dissertation is developing low-complexity soft-output processing methods needed for BTC decoding. In the Chase-Pyndiah algorithm, the soft-output is updated in two different ways according to whether competing code-words exist on the updating positions or not. If the competing code-words exist, the Euclidean distance between the soft-input signal and the code-words that are generated from the test patterns is used. However, the cost of distance computation is very high and linearly increases with the sub-frame length. We identify computationally redundant positions and optimize the computing process by ignoring them. If the competing ones do not exist, the reliability factor that should be pre-determined by an extensive search is demanded. To avoid this, we propose adaptive determination methods, which provides even better error-correcting performance. In addition, we investigate the Pyndiah's soft-output computation and find its drawbacks that appear during the approximation process. To remove the drawbacks, we replace the updating method of the positions that are expected to be seriously damaged by the approximation with the reliability factor-based one, which is much simpler, even though they have the competing words. This dissertation also develops a graphics processing unit (GPU) based BTC decoding program. In order to hide the latency of arithmetic and memory access operations, this software applies the kernel structure that processes multiple BTC-words and allocates multiple sub-frames to each thread-block. Global memory access optimization and data compression, which demands less shared memory space, are also employed. For efficient mapping of the Chase-Pyndiah algorithm onto GPUs, we propose parallel processing schemes employing efficient reduction algorithms and provide step-by-step parallel algorithms for the algebraic decoding. The last part of this dissertation is devoted to summarizing the developed decoding method and comparing it with the decoding of the LDPC convolutional code (CC), which is currently reported as the most powerful candidate for the 100Gbps optical network. We first investigate the complexity reduction and the error rate performance improvement of the developed method. Then, we analyze the complexity of the LDPC-CC decoding and compare it with the developed BTC decoding for the 20% overhead codes. This dissertation is intended to develop high-throughput SD decoding software by introducing complexity reduction techniques for the Chase-Pyndiah algorithm and efficient parallel processing methods, and to emphasize the competitiveness of the BTC. The proposed decoding methods and parallel processing algorithms verified in the GPU-based systems are also applicable to hardware-based ones. By implementing hardware-based decoders that employ the developed methods in this dissertation, significant improvements on the throughputs and the energy efficiency can be obtained. Moreover, thanks to the wide rate coverage of the BTC, the developed techniques can be applied to many high-throughput error correction applications, such as the next-generation optical network and storage device systems.Chapter 1 Introduction 1 1.1 Turbo Codes 1 1.2 Applications of Turbo Codes 4 1.3 Outline of the Dissertation 5 Chapter 2 Encoding and Iterative Decoding of Block Turbo Codes 7 2.1 Introduction 7 2.2 Encoding Procedure of Shortened-Extended BTCs 9 2.3 Scheduling Methods for Iterative Decoding 9 2.3.1 Serial Scheduling 10 2.3.2 Parallel Scheduling 10 2.3.3 Replica Scheduling 11 2.4 Elementary Decoding with Chase-Pyndiah Algorithm 13 2.4.1 Chase-Pyndiah Algorithm for Extended BTCs 13 2.4.2 Reliability Computation of the ML Code-Word 17 2.4.3 Algebraic Decoding for SEC and DEC BCH Codes 20 2.5 Issues of Chase-Pyndiah Algorithm 23 Chapter 3 Complexity Reduction Techniques for Code-Word Set Generation of the Chase-Pyndiah Algorithm 24 3.1 Introduction 24 3.2 Adaptive Selection of LRPs 25 3.2.1 Selection Constraints of LRPs 25 3.2.2 Simulation Results 26 3.3 Test Pattern Selection 29 3.3.1 The Error Coverage Factor of Test Patterns 30 3.3.2 Greedy Selection of Test Patterns 33 3.3.3 Simulation Results 34 3.4 Concluding Remarks 34 Chapter 4 Complexity Reduction Techniques for Soft-Output Update of the Chase-Pyndiah Algorithm 37 4.1 Introduction 37 4.2 Distance Computation 38 4.2.1 Position-Index List Based Method 39 4.2.2 Double Index Set-Based Method 42 4.2.3 Complexity Analysis 46 4.2.4 Simulation Results 47 4.3 Reliability Factor Determination 49 4.3.1 Refinement of Distance-Based Reliability Factor 51 4.3.2 Adaptive Determination of the Reliability Factor 51 4.3.3 Simulation Results 53 4.4 Accuracy Improvement in Extrinsic Information Update 54 4.4.1 Drawbacks of the Sub-Optimal Update 55 4.4.2 Low-Complexity Extrinsic Information Update 58 4.4.3 Simulation Results 59 4.5 Concluding Remarks 61 Chapter 5 High-Throughput BTC Decoding on GPUs 64 5.1 Introduction 64 5.2 BTC Decoder Architecture for GPU Implementations 66 5.3 Memory Optimization 68 5.3.1 Global Memory Access Reduction 68 5.3.2 Improvement of Global Memory Access Coalescing 68 5.3.3 Efficient Shared Memory Control with Data Compression 70 5.3.4 Index Parity Check Scheme 73 5.4 Parallel Algorithms with the CUDA Shuffle Function 77 5.5 Implementation of Algebraic Decoder 78 5.5.1 Galois Field Operations with Look-Up Tables 78 5.5.2 Error-Locator Polynomial Setting with the LUTs 81 5.5.3 Parallel Chien Search with the LUTs 84 5.6 Simulation Results 85 5.7 Concluding Remarks 89 Chapter 6 Competitiveness of BTCs as FEC codes for the Next-Generation Optical Networks 91 6.1 Introduction 91 6.2 The Complexity Reduction of the Modified Chase-Pyndiah Algorithm 92 6.2.1 Summary of the Complexity Reduction 92 6.2.2 The Error-Correcting Performance 94 6.3 Comparison of BTCs and LDPC-CCs 97 6.3.1 Complexity Analysis of the LDPC-CC Decoding 97 6.3.2 Comparison of the 20% Overhead BTC and LDPC-CC 100 6.4 Concluding Remarks 101 Chapter 7 Conclusion 102 Bibliography 105 국문 초록 113Docto
    corecore