271 research outputs found
Super-FEC Codes for 40/100 Gbps Networking
This paper presents a simple approach to evaluate the performance bound at
very low bit-error-rate (BER) range for binary pseudo-product codes and
true-product codes. Moreover it introduces a super-product BCH code that can
achieve near-Shannon limit performance with very low decoding complexity. This
work has been accepted by IEEE Communications Letters for future publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible.Comment: This work has been accepted by IEEE Communications Letters for future
publication. Copyright may be transferred without notice, after which this
version may no longer be accessibl
Offloading Optimization for Low-Latency Secure Mobile Edge Computing Systems
This paper proposes a low-latency secure mobile edge computing (MEC) system where multiple users offload computing tasks to a base station in the presence of an eavesdropper. We jointly optimize the users’ transmit power, computing capacity allocation, and user association to minimize the computing and transmission latencies over all users subject to security and computing resource constraints. Numerical results show that our proposed algorithm outperforms baseline strategies. Furthermore, we highlight a novel trade-off between the latency and security of MEC systems
A Computationally Efficient Neural Video Compression Accelerator Based on a Sparse CNN-Transformer Hybrid Network
Video compression is widely used in digital television, surveillance systems,
and virtual reality. Real-time video decoding is crucial in practical
scenarios. Recently, neural video compression (NVC) combines traditional coding
with deep learning, achieving impressive compression efficiency. Nevertheless,
the NVC models involve high computational costs and complex memory access
patterns, challenging real-time hardware implementations. To relieve this
burden, we propose an algorithm and hardware co-design framework named NVCA for
video decoding on resource-limited devices. Firstly, a CNN-Transformer hybrid
network is developed to improve compression performance by capturing
multi-scale non-local features. In addition, we propose a fast algorithm-based
sparse strategy that leverages the dual advantages of pruning and fast
algorithms, sufficiently reducing computational complexity while maintaining
video compression efficiency. Secondly, a reconfigurable sparse computing core
is designed to flexibly support sparse convolutions and deconvolutions based on
the fast algorithm-based sparse strategy. Furthermore, a novel heterogeneous
layer chaining dataflow is incorporated to reduce off-chip memory traffic
stemming from extensive inter-frame motion and residual information. Thirdly,
the overall architecture of NVCA is designed and synthesized in TSMC 28nm CMOS
technology. Extensive experiments demonstrate that our design provides superior
coding quality and up to 22.7x decoding speed improvements over other video
compression designs. Meanwhile, our design achieves up to 2.2x improvements in
energy efficiency compared to prior accelerators.Comment: Accepted by DATE 202
Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design
Sparse training is one of the promising techniques to reduce the
computational cost of DNNs while retaining high accuracy. In particular, N:M
fine-grained structured sparsity, where only N out of consecutive M elements
can be nonzero, has attracted attention due to its hardware-friendly pattern
and capability of achieving a high sparse ratio. However, the potential to
accelerate N:M sparse DNN training has not been fully exploited, and there is a
lack of efficient hardware supporting N:M sparse training. To tackle these
challenges, this paper presents a computation-efficient training scheme for N:M
sparse DNNs using algorithm, architecture, and dataflow co-design. At the
algorithm level, a bidirectional weight pruning method, dubbed BDWP, is
proposed to leverage the N:M sparsity of weights during both forward and
backward passes of DNN training, which can significantly reduce the
computational cost while maintaining model accuracy. At the architecture level,
a sparse accelerator for DNN training, namely SAT, is developed to neatly
support both the regular dense operations and the computation-efficient N:M
sparse operations. At the dataflow level, multiple optimization methods ranging
from interleave mapping, pre-generation of N:M sparse weights, and offline
scheduling, are proposed to boost the computational efficiency of SAT. Finally,
the effectiveness of our training scheme is evaluated on a Xilinx VCU1525 FPGA
card using various DNN models and datasets. Experimental results show the SAT
accelerator with the BDWP sparse training method under 2:8 sparse ratio
achieves an average speedup of 1.75x over that with the dense training,
accompanied by a negligible accuracy loss of 0.56% on average. Furthermore, our
proposed training scheme significantly improves the training throughput by
2.97~25.22x and the energy efficiency by 1.36~3.58x over prior FPGA-based
accelerators.Comment: To appear in the IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems (TCAD
An Efficient FPGA-based Accelerator for Deep Forest
Deep Forest is a prominent machine learning algorithm known for its high
accuracy in forecasting. Compared with deep neural networks, Deep Forest has
almost no multiplication operations and has better performance on small
datasets. However, due to the deep structure and large forest quantity, it
suffers from large amounts of calculation and memory consumption. In this
paper, an efficient hardware accelerator is proposed for deep forest models,
which is also the first work to implement Deep Forest on FPGA. Firstly, a
delicate node computing unit (NCU) is designed to improve inference speed.
Secondly, based on NCU, an efficient architecture and an adaptive dataflow are
proposed, in order to alleviate the problem of node computing imbalance in the
classification process. Moreover, an optimized storage scheme in this design
also improves hardware utilization and power efficiency. The proposed design is
implemented on an FPGA board, Intel Stratix V, and it is evaluated by two
typical datasets, ADULT and Face Mask Detection. The experimental results show
that the proposed design can achieve around 40x speedup compared to that on a
40 cores high performance x86 CPU.Comment: 5 pages, 5 figures, conferenc
S2R: Exploring a Double-Win Transformer-Based Framework for Ideal and Blind Super-Resolution
Nowadays, deep learning based methods have demonstrated impressive
performance on ideal super-resolution (SR) datasets, but most of these methods
incur dramatically performance drops when directly applied in real-world SR
reconstruction tasks with unpredictable blur kernels. To tackle this issue,
blind SR methods are proposed to improve the visual results on random blur
kernels, which causes unsatisfactory reconstruction effects on ideal
low-resolution images similarly. In this paper, we propose a double-win
framework for ideal and blind SR task, named S2R, including a light-weight
transformer-based SR model (S2R transformer) and a novel coarse-to-fine
training strategy, which can achieve excellent visual results on both ideal and
random fuzzy conditions. On algorithm level, S2R transformer smartly combines
some efficient and light-weight blocks to enhance the representation ability of
extracted features with relatively low number of parameters. For training
strategy, a coarse-level learning process is firstly performed to improve the
generalization of the network with the help of a large-scale external dataset,
and then, a fast fine-tune process is developed to transfer the pre-trained
model to real-world SR tasks by mining the internal features of the image.
Experimental results show that the proposed S2R outperforms other single-image
SR models in ideal SR condition with only 578K parameters. Meanwhile, it can
achieve better visual results than regular blind SR models in blind fuzzy
conditions with only 10 gradient updates, which improve convergence speed by
300 times, significantly accelerating the transfer-learning process in
real-world situations
- …