509 research outputs found

    KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition

    Full text link
    CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints. Specifically, matrix-vector multiplication and number theoretic transform (NTT)-based polynomial multiplication are critical operations and bottlenecks that require optimization. To address this challenge, we propose an algorithm and hardware co-design approach to systematically optimize matrix-vector multiplication and NTT-based polynomial multiplication by employing a novel sub-structure sharing technique in order to reduce computational complexity, i.e., the number of modular multiplications and modular additions/subtractions consumed. The sub-structure sharing approach is inspired by prior fast parallel approaches based on polyphase decomposition. The proposed efficient feed-forward architecture achieves high speed, low latency, and full utilization of all hardware components, which can significantly enhance the overall efficiency of the Kyber scheme. The FPGA implementation results show that our proposed design, using the fast two-parallel structure, leads to an approximate reduction of 90% in execution time, along with a 66 times improvement in throughput performance.Comment: Proc. 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, CA, Oct. 29 - Nov. 2, 202

    High-Performance VLSI Architectures for Lattice-Based Cryptography

    Get PDF
    Lattice-based cryptography is a cryptographic primitive built upon the hard problems on point lattices. Cryptosystems relying on lattice-based cryptography have attracted huge attention in the last decade since they have post-quantum-resistant security and the remarkable construction of the algorithm. In particular, homomorphic encryption (HE) and post-quantum cryptography (PQC) are the two main applications of lattice-based cryptography. Meanwhile, the efficient hardware implementations for these advanced cryptography schemes are demanding to achieve a high-performance implementation. This dissertation aims to investigate the novel and high-performance very large-scale integration (VLSI) architectures for lattice-based cryptography, including the HE and PQC schemes. This dissertation first presents different architectures for the number-theoretic transform (NTT)-based polynomial multiplication, one of the crucial parts of the fundamental arithmetic for lattice-based HE and PQC schemes. Then a high-speed modular integer multiplier is proposed, particularly for lattice-based cryptography. In addition, a novel modular polynomial multiplier is presented to exploit the fast finite impulse response (FIR) filter architecture to reduce the computational complexity of the schoolbook modular polynomial multiplication for lattice-based PQC scheme. Afterward, an NTT and Chinese remainder theorem (CRT)-based high-speed modular polynomial multiplier is presented for HE schemes whose moduli are large integers

    Convolutional Deblurring for Natural Imaging

    Full text link
    In this paper, we propose a novel design of image deblurring in the form of one-shot convolution filtering that can directly convolve with naturally blurred images for restoration. The problem of optical blurring is a common disadvantage to many imaging applications that suffer from optical imperfections. Despite numerous deconvolution methods that blindly estimate blurring in either inclusive or exclusive forms, they are practically challenging due to high computational cost and low image reconstruction quality. Both conditions of high accuracy and high speed are prerequisites for high-throughput imaging platforms in digital archiving. In such platforms, deblurring is required after image acquisition before being stored, previewed, or processed for high-level interpretation. Therefore, on-the-fly correction of such images is important to avoid possible time delays, mitigate computational expenses, and increase image perception quality. We bridge this gap by synthesizing a deconvolution kernel as a linear combination of Finite Impulse Response (FIR) even-derivative filters that can be directly convolved with blurry input images to boost the frequency fall-off of the Point Spread Function (PSF) associated with the optical blur. We employ a Gaussian low-pass filter to decouple the image denoising problem for image edge deblurring. Furthermore, we propose a blind approach to estimate the PSF statistics for two Gaussian and Laplacian models that are common in many imaging pipelines. Thorough experiments are designed to test and validate the efficiency of the proposed method using 2054 naturally blurred images across six imaging applications and seven state-of-the-art deconvolution methods.Comment: 15 pages, for publication in IEEE Transaction Image Processin

    An equalization technique for high rate OFDM systems

    Get PDF
    In a typical orthogonal frequency division multiplexing (OFDM) broadband wireless communication system, a guard interval using cyclic prefix is inserted to avoid the inter-symbol interference and the inter-carrier interference. This guard interval is required to be at least equal to, or longer than the maximum channel delay spread. This method is very simple, but it reduces the transmission efficiency. This efficiency is very low in the communication systems, which inhibit a long channel delay spread with a small number of sub-carriers such as the IEEE 802.11a wireless LAN (WLAN). To increase the transmission efficiency, it is usual that a time domain equalizer (TEQ) is included in an OFDM system to shorten the effective channel impulse response within the guard interval. There are many TEQ algorithms developed for the low rate OFDM applications such as asymmetrical digital subscriber line (ADSL). The drawback of these algorithms is a high computational load. Most of the popular TEQ algorithms are not suitable for the IEEE 802.11a system, a high data rate wireless LAN based on the OFDM technique. In this thesis, a TEQ algorithm based on the minimum mean square error criterion is investigated for the high rate IEEE 802.11a system. This algorithm has a comparatively reduced computational complexity for practical use in the high data rate OFDM systems. In forming the model to design the TEQ, a reduced convolution matrix is exploited to lower the computational complexity. Mathematical analysis and simulation results are provided to show the validity and the advantages of the algorithm. In particular, it is shown that a high performance gain at a data rate of 54Mbps can be obtained with a moderate order of TEQ finite impulse response (FIR) filter. The algorithm is implemented in a field programmable gate array (FPGA). The characteristics and regularities between the elements in matrices are further exploited to reduce the hardware complexity in the matrix multiplication implementation. The optimum TEQ coefficients can be found in less than 4µs for the 7th order of the TEQ FIR filter. This time is the interval of an OFDM symbol in the IEEE 802.11a system. To compensate for the effective channel impulse response, a function block of 64-point radix-4 pipeline fast Fourier transform is implemented in FPGA to perform zero forcing equalization in frequency domain. The offsets between the hardware implementations and the mathematical calculations are provided and analyzed. The system performance loss introduced by the hardware implementation is also tested. Hardware implementation output and simulation results verify that the chips function properly and satisfy the requirements of the system running at a data rate of 54 Mbps

    An Introduction to Digital Signal Processing

    Get PDF
    An Introduction to Digital Signal Processing aims at undergraduate students who have basic knowledge in C programming, Circuit Theory, Systems and Simulations, and Spectral Analysis. The book is focused on basic concepts of digital signal processing, MATLAB simulation and implementation on selected DSP hardware in which the candidate is introduced to the basic concepts first before embarking to the practical part which comes in the later chapters. Initially Digital Signal Processing evolved as a postgraduate course which slowly filtered into the undergraduate curriculum as a simplified version of the latter. The goal was to study DSP concepts and to provide a foundation for further research where new and more efficient concepts and algorithms can be developed. Though this was very useful it did not arm the student with all the necessary tools that many industries using DSP technology would require to develop applications. This book is an attempt to bridge the gap. It is focused on basic concepts of digital signal processing, MATLAB simulation and implementation on selected DSP hardware. The objective is to win the student to use a variety of development tools to develop applications. Contents• Introduction to Digital Signal processing.• The transform domain analysis: the Discrete-Time Fourier Transform• The transform domain analysis: the Discrete Fourier Transform• The transform domain analysis: the z-transform• Review of Analogue Filter• Digital filter design.• Digital Signal Processing Implementation Issues• Digital Signal Processing Hardware and Software• Examples of DSK Filter Implementatio

    Low power JPEG2000 5/3 discrete wavelet transform algorithm and architecture

    Get PDF

    Energy-Efficient Digital Signal Processing for Fiber-Optic Communication Systems

    Get PDF
    Modern fiber-optic communication systems rely on complex digital signal processing (DSP) and forward error correction (FEC), which contribute to a significant amount of the over-all link power dissipation. Bandwidth demands are evergrowing and circuit technology scaling will due to fundamental reasons come to an end; energy-efficient design of DSP is thus necessary both from a sustainability perspective and a technical perspective. This thesis explores energy-efficient design of the sub-systems that are estimated to contribute to the majority of the receiver application-specific integrated-circuit power dissipation: chromatic-dispersion compensation, dynamic equalization, nonlinearity mitigation, and forward error correction. With a focus on real-time-processing circuit implementation of the considered algorithms, aspects such as finite-precision effects, pipelining, and parallel processing are explored, the impact on compensation and correction performance is investigated, and energy-efficient circuit implementations are developed. The sub-systems are investigated both individually, and in a system context. DSP designs showing significant energy-efficiency improvements are presented, as well as very high-throughput, energy-efficient, FEC designs. The subsystems are also considered in the context of datacenter interconnect links, and it is shown that DSP-based coherent systems are feasible even in power constrained settings

    A high-performance inner-product processor for real and complex numbers.

    Get PDF
    A novel, high-performance fixed-point inner-product processor based on a redundant binary number system is investigated in this dissertation. This scheme decreases the number of partial products to 50%, while achieving better speed and area performance, as well as providing pipeline extension opportunities. When modified Booth coding is used, partial products are reduced by almost 75%, thereby significantly reducing the multiplier addition depth. The design is applicable for digital signal and image processing applications that require real and/or complex numbers inner-product arithmetic, such as digital filters, correlation and convolution. This design is well suited for VLSI implementation and can also be embedded as an inner-product core inside a general purpose or DSP FPGA-based processor. Dynamic control of the computing structure permits different computations, such as a variety of inner-product real and complex number computations, parallel multiplication for real and complex numbers, and real and complex number division. The same structure can also be controlled to accept redundant binary number inputs for multiplication and inner-product computations. An improved 2's-complement to redundant binary converter is also presented
    • …
    corecore