1,586 research outputs found

    Spatially Coupled Turbo Codes: Principles and Finite Length Performance

    Get PDF
    In this paper, we give an overview of spatially coupled turbo codes (SC-TCs), the spatial coupling of parallel and serially concatenated convolutional codes, recently introduced by the authors. For presentation purposes, we focus on spatially coupled serially concatenated codes (SC-SCCs). We review the main principles of SC-TCs and discuss their exact density evolution (DE) analysis on the binary erasure channel. We also consider the construction of a family of rate-compatible SC-SCCs with simple 4-state component encoders. For all considered code rates, threshold saturation of the belief propagation (BP) to the maximum a posteriori threshold of the uncoupled ensemble is demonstrated, and it is shown that the BP threshold approaches the Shannon limit as the coupling memory increases. Finally we give some simulation results for finite lengths.Comment: Invited paper, IEEE Int. Symp. Wireless Communications Systems (ISWCS), Aug. 201

    NVIDIA Tensor Core Programmability, Performance & Precision

    Full text link
    The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called "Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 201

    Binary Message Passing Decoding of Product-like Codes

    Get PDF
    We propose a novel binary message passing decoding algorithm for product-like codes based on bounded distance decoding (BDD) of the component codes. The algorithm, dubbed iterative BDD with scaled reliability (iBDD-SR), exploits the channel reliabilities and is therefore soft in nature. However, the messages exchanged by the component decoders are binary (hard) messages, which significantly reduces the decoder data flow. The exchanged binary messages are obtained by combining the channel reliability with the BDD decoder output reliabilities, properly conveyed by a scaling factor applied to the BDD decisions. We perform a density evolution analysis for generalized low-density parity-check (GLDPC) code ensembles and spatially coupled GLDPC code ensembles, from which the scaling factors of the iBDD-SR for product and staircase codes, respectively, can be obtained. For the white additive Gaussian noise channel, we show performance gains up to 0.290.29 dB and 0.310.31 dB for product and staircase codes compared to conventional iterative BDD (iBDD) with the same decoder data flow. Furthermore, we show that iBDD-SR approaches the performance of ideal iBDD that prevents miscorrections.Comment: Accepted for publication in the IEEE Transactions on Communication

    High Performance Computing of Gene Regulatory Networks using a Message-Passing Model

    Full text link
    Gene regulatory network reconstruction is a fundamental problem in computational biology. We recently developed an algorithm, called PANDA (Passing Attributes Between Networks for Data Assimilation), that integrates multiple sources of 'omics data and estimates regulatory network models. This approach was initially implemented in the C++ programming language and has since been applied to a number of biological systems. In our current research we are beginning to expand the algorithm to incorporate larger and most diverse data-sets, to reconstruct networks that contain increasing numbers of elements, and to build not only single network models, but sets of networks. In order to accomplish these "Big Data" applications, it has become critical that we increase the computational efficiency of the PANDA implementation. In this paper we show how to recast PANDA's similarity equations as matrix operations. This allows us to implement a highly readable version of the algorithm using the MATLAB/Octave programming language. We find that the resulting M-code much shorter (103 compared to 1128 lines) and more easily modifiable for potential future applications. The new implementation also runs significantly faster, with increasing efficiency as the network models increase in size. Tests comparing the C-code and M-code versions of PANDA demonstrate that this speed-up is on the order of 20-80 times faster for networks of similar dimensions to those we find in current biological applications

    Asymptotic and Finite Frame Length Analysis of Frame Asynchronous Coded Slotted ALOHA

    Full text link
    We consider a frame-asynchronous coded slotted ALOHA (FA-CSA) system where users become active according to a Poisson random process. In contrast to standard frame-synchronous CSA (FS-CSA), users transmit a first replica of their message in the slot following their activation and other replicas uniformly at random in a number of subsequent slots. We derive the (approximate) density evolution that characterizes the asymptotic performance of FA-CSA when the frame length goes to infinity. We show that, if users can monitor the system before they start transmitting, a boundary-effect similar to that of spatially-coupled codes occurs, which greatly improves the decoding threshold as compared to FS-CSA. We also derive analytical approximations of the error floor (EF) in the finite frame length regime. We show that FA-CSA yields in general lower EF, better performance in the waterfall region, and lower average delay, as compared to FS-CSA.Comment: 5 pages, 6 figures. Updated notation, terminology, and typo
    • …
    corecore