1,586 research outputs found
Spatially Coupled Turbo Codes: Principles and Finite Length Performance
In this paper, we give an overview of spatially coupled turbo codes (SC-TCs),
the spatial coupling of parallel and serially concatenated convolutional codes,
recently introduced by the authors. For presentation purposes, we focus on
spatially coupled serially concatenated codes (SC-SCCs). We review the main
principles of SC-TCs and discuss their exact density evolution (DE) analysis on
the binary erasure channel. We also consider the construction of a family of
rate-compatible SC-SCCs with simple 4-state component encoders. For all
considered code rates, threshold saturation of the belief propagation (BP) to
the maximum a posteriori threshold of the uncoupled ensemble is demonstrated,
and it is shown that the BP threshold approaches the Shannon limit as the
coupling memory increases. Finally we give some simulation results for finite
lengths.Comment: Invited paper, IEEE Int. Symp. Wireless Communications Systems
(ISWCS), Aug. 201
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called
"Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices
per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta
microarchitecture, provides 640 Tensor Cores with a theoretical peak
performance of 125 Tflops/s in mixed precision. In this paper, we investigate
current approaches to program NVIDIA Tensor Cores, their performances and the
precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming
matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply
Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS
GEMM. After experimenting with different approaches, we found that NVIDIA
Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100
GPU, seven and three times the performance in single and half precision
respectively. A WMMA implementation of batched GEMM reaches a performance of 4
Tflops/s. While precision loss due to matrix multiplication with half precision
input might be critical in many HPC applications, it can be considerably
reduced at the cost of increased computation. Our results indicate that HPC
applications using matrix multiplications can strongly benefit from using of
NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES) 201
Binary Message Passing Decoding of Product-like Codes
We propose a novel binary message passing decoding algorithm for product-like
codes based on bounded distance decoding (BDD) of the component codes. The
algorithm, dubbed iterative BDD with scaled reliability (iBDD-SR), exploits the
channel reliabilities and is therefore soft in nature. However, the messages
exchanged by the component decoders are binary (hard) messages, which
significantly reduces the decoder data flow. The exchanged binary messages are
obtained by combining the channel reliability with the BDD decoder output
reliabilities, properly conveyed by a scaling factor applied to the BDD
decisions. We perform a density evolution analysis for generalized low-density
parity-check (GLDPC) code ensembles and spatially coupled GLDPC code ensembles,
from which the scaling factors of the iBDD-SR for product and staircase codes,
respectively, can be obtained. For the white additive Gaussian noise channel,
we show performance gains up to dB and dB for product and
staircase codes compared to conventional iterative BDD (iBDD) with the same
decoder data flow. Furthermore, we show that iBDD-SR approaches the performance
of ideal iBDD that prevents miscorrections.Comment: Accepted for publication in the IEEE Transactions on Communication
High Performance Computing of Gene Regulatory Networks using a Message-Passing Model
Gene regulatory network reconstruction is a fundamental problem in
computational biology. We recently developed an algorithm, called PANDA
(Passing Attributes Between Networks for Data Assimilation), that integrates
multiple sources of 'omics data and estimates regulatory network models. This
approach was initially implemented in the C++ programming language and has
since been applied to a number of biological systems. In our current research
we are beginning to expand the algorithm to incorporate larger and most diverse
data-sets, to reconstruct networks that contain increasing numbers of elements,
and to build not only single network models, but sets of networks. In order to
accomplish these "Big Data" applications, it has become critical that we
increase the computational efficiency of the PANDA implementation. In this
paper we show how to recast PANDA's similarity equations as matrix operations.
This allows us to implement a highly readable version of the algorithm using
the MATLAB/Octave programming language. We find that the resulting M-code much
shorter (103 compared to 1128 lines) and more easily modifiable for potential
future applications. The new implementation also runs significantly faster,
with increasing efficiency as the network models increase in size. Tests
comparing the C-code and M-code versions of PANDA demonstrate that this
speed-up is on the order of 20-80 times faster for networks of similar
dimensions to those we find in current biological applications
Asymptotic and Finite Frame Length Analysis of Frame Asynchronous Coded Slotted ALOHA
We consider a frame-asynchronous coded slotted ALOHA (FA-CSA) system where
users become active according to a Poisson random process. In contrast to
standard frame-synchronous CSA (FS-CSA), users transmit a first replica of
their message in the slot following their activation and other replicas
uniformly at random in a number of subsequent slots. We derive the
(approximate) density evolution that characterizes the asymptotic performance
of FA-CSA when the frame length goes to infinity. We show that, if users can
monitor the system before they start transmitting, a boundary-effect similar to
that of spatially-coupled codes occurs, which greatly improves the decoding
threshold as compared to FS-CSA. We also derive analytical approximations of
the error floor (EF) in the finite frame length regime. We show that FA-CSA
yields in general lower EF, better performance in the waterfall region, and
lower average delay, as compared to FS-CSA.Comment: 5 pages, 6 figures. Updated notation, terminology, and typo
- …