Search CORE

769 research outputs found

Number theoretic techniques applied to algorithms and architectures for digital signal processing

Author: Ward Jeremy S.
Publication venue
Publication date: 01/01/1983
Field of study

Many of the techniques for the computation of a two-dimensional convolution of a small fixed window with a picture are reviewed. It is demonstrated that Winograd's cyclic convolution and Fourier Transform Algorithms, together with Nussbaumer's two-dimensional cyclic convolution algorithms, have a common general form. Many of these algorithms use the theoretical minimum number of general multiplications. A novel implementation of these algorithms is proposed which is based upon one-bit systolic arrays. These systolic arrays are networks of identical cells with each cell sharing a common control and timing function. Each cell is only connected to its nearest neighbours. These are all attractive features for implementation using Very Large Scale Integration (VLSI). The throughput rate is only limited by the time to perform a one-bit full addition. In order to assess the usefulness to these systolic arrays a 'cost function' is developed to compare them with more conventional techniques, such as the Cooley-Tukey radix-2 Fast Fourier Transform (FFT). The cost function shows that these systolic arrays offer a good way of implementing the Discrete Fourier Transform for transforms up to about 30 points in length. The cost function is a general tool and allows comparisons to be made between different implementations of the same algorithm and between dissimilar algorithms. Finally a technique is developed for the derivation of Discrete Cosine Transform (DCT) algorithms from the Winograd Fourier Transform Algorithm. These DCT algorithms may be implemented by modified versions of the systolic arrays proposed earlier, but requiring half the number of cells

Durham e-Theses

OpenGrey Repository

Residue Arithmetic VLSI Array Architecture for Manipulator Pseudo-Inverse Jacobian Computation

Author: Chang P. R.
Lee C. S. G.
Publication venue: 'Purdue University (bepress)'
Publication date: 01/11/1987
Field of study

Most Cartesian-based control strategies require the computation of the manipulator inverse Jacobian in real time at every sampling period. In some cases, the Jacobian matrix is not of full column or row rank due to singularity or redundant robot configuration. This requires the computation of the manipulator pseudo-inverse Jacobian in real time. The calculation of the pseudo-inverse Jacobian may become extremely sensitive to small perturbation in the data and numerical instabilities, when the Jacobian matrix is not of full column or row rank. Even if the Jacobian matrix is of full rank, the ill-conditioned problem may still plague the computation of the pseudoinverse Jacobian. This paper presents the use of residue arithmetic for the exact computation of the manipulator pseudo-inverse Jacobian to obviate the roundoff errors normally associated with the computations. A two-level macro-pipelined residue arithmetic array architecture implementing the Decell’s pseudo-inverse algorithm has been developed to overcome the ill-conditioned problem of the pseudo-inverse computation. Furthermore, the Decell algorithm is quite suitable for VLSI array implementation to achieve the real-time computation requirement. The first-level arrays are data-driven, wavefront-like arrays and perform the matrix multiplications, matrix diagonal additions, and trace computations. A pool or sequence of the first-level arrays are then configured into a second-level macro-pipeline with outputs of one array acting as inputs to another array in the pipe. The proposed architecture can calculate the pseudoinverse Jacobian with a pipelined time in the same computational complexity order as evaluating a matrix product in a wavefront array

Purdue E-Pubs

Resolution of Linear Algebra for the Discrete Logarithm Problem Using GPU and Multi-core Architectures

Author: A. Joux
B. Schmidt
C. Lanczos
C. Pomerance
D. Coppersmith
D.H. Wiedemann
E. Thomé
K. Aoki
L.M. Adleman
R. Barbulescu
R. Barbulescu
T. ElGamal
T. Kleinjung
V. Strassen
W. Diffie
Publication venue
Publication date: 01/01/2014
Field of study

In cryptanalysis, solving the discrete logarithm problem (DLP) is key to assessing the security of many public-key cryptosystems. The index-calculus methods, that attack the DLP in multiplicative subgroups of finite fields, require solving large sparse systems of linear equations modulo large primes. This article deals with how we can run this computation on GPU- and multi-core-based clusters, featuring InfiniBand networking. More specifically, we present the sparse linear algebra algorithms that are proposed in the literature, in particular the block Wiedemann algorithm. We discuss the parallelization of the central matrix--vector product operation from both algorithmic and practical points of view, and illustrate how our approach has contributed to the recent record-sized DLP computation in GF(

2^{809}

).Comment: Euro-Par 2014 Parallel Processing, Aug 2014, Porto, Portugal. \<http://europar2014.dcc.fc.up.pt/\&gt

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

PaReNTT: Low-Latency Parallel Residue Number System and NTT-Based Long Polynomial Modular Multiplication for Homomorphic Encryption

Author: Chiu Sin-Wei
Lao Yingjie
Parhi Keshab K.
Tan Weihang
Wang Antian
Publication venue
Publication date: 06/07/2023
Field of study

High-speed long polynomial multiplication is important for applications in homomorphic encryption (HE) and lattice-based cryptosystems. This paper addresses low-latency hardware architectures for long polynomial modular multiplication using the number-theoretic transform (NTT) and inverse NTT (iNTT). Chinese remainder theorem (CRT) is used to decompose the modulus into multiple smaller moduli. Our proposed architecture, namely PaReNTT, makes four novel contributions. First, parallel NTT and iNTT architectures are proposed to reduce the number of clock cycles to process the polynomials. This can enable real-time processing for HE applications, as the number of clock cycles to process the polynomial is inversely proportional to the level of parallelism. Second, the proposed architecture eliminates the need for permuting the NTT outputs before their product is input to the iNTT. This reduces latency by n/4 clock cycles, where n is the length of the polynomial, and reduces buffer requirement by one delay-switch-delay circuit of size n. Third, an approach to select special moduli is presented where the moduli can be expressed in terms of a few signed power-of-two terms. Fourth, novel architectures for pre-processing for computing residual polynomials using the CRT and post-processing for combining the residual polynomials are proposed. These architectures significantly reduce the area consumption of the pre-processing and post-processing steps. The proposed long modular polynomial multiplications are ideal for applications that require low latency and high sample rate as these feed-forward architectures can be pipelined at arbitrary levels

arXiv.org e-Print Archive

Coarse-grained reconfigurable array architectures

Author: A Lambrechts
B Bougard
B Bougard
B Mei
B Mei
B Mei
B Sutter De
G Venkataramani
H Park
H Park
J Lee
JMP Cardoso
JW Waerdt van de
K Berkel van
K Bondalapati
K Sankaralingam
KE Coons
LH Lee
M Ahn
M Gebhart
M Schlansker
M Taylor
M Woh
MD Galanis
MH Lee
S Friedman
SA Mahlke
T Oh
Y Kim
Y Kim
Y Kim
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Coarse-Grained Reconﬁgurable Array (CGRA) architectures accelerate the same inner loops that beneﬁt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efﬁciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ﬂexibility, performance, and power-efﬁciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ﬁne-tuning of source code

Crossref

Ghent University Academic Bibliography

A New Modular Division Algorithm and Applications

Author: Lavault Christian
Sedjelmaci Sidi Mohamed
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/01/1998
Field of study

12 pagesInternational audienceThe present paper proposes a new parallel algorithm for the modular division

u/v\bmod \beta^s

, where

u,\; v,\; \beta

and

s

are positive integers

(\beta\ge 2)

. The algorithm combines the classical add-and-shift multiplication scheme with a new propagation carry technique. This ''Pen and Paper Inverse'' ({\em PPI}) algorithm, is better suited for systolic parallelization in a ''least-significant digit first'' pipelined manner. Although it is equivalent to Jebelean's modular division algorithm~\cite{jeb2} in terms of performance (time complexity, work, efficiency), the linear parallelization of the {\em PPI} algorithm improves on the latter when the input size is large. The parallelized versions of the {\em PPI} algorithm leads to various applications, such as the exact division and the digit modulus operation (dmod) of two long integers. It is also applied to the determination of the periods of rational numbers as well as their

p

-adic expansion in any radix

\beta \ge 2

CiteSeerX

HAL-Paris 13

High-Performance VLSI Architectures for Lattice-Based Cryptography

Author: Tan Weihang
Publication venue: Clemson University Libraries
Publication date: 01/12/2022
Field of study

Lattice-based cryptography is a cryptographic primitive built upon the hard problems on point lattices. Cryptosystems relying on lattice-based cryptography have attracted huge attention in the last decade since they have post-quantum-resistant security and the remarkable construction of the algorithm. In particular, homomorphic encryption (HE) and post-quantum cryptography (PQC) are the two main applications of lattice-based cryptography. Meanwhile, the efficient hardware implementations for these advanced cryptography schemes are demanding to achieve a high-performance implementation. This dissertation aims to investigate the novel and high-performance very large-scale integration (VLSI) architectures for lattice-based cryptography, including the HE and PQC schemes. This dissertation first presents different architectures for the number-theoretic transform (NTT)-based polynomial multiplication, one of the crucial parts of the fundamental arithmetic for lattice-based HE and PQC schemes. Then a high-speed modular integer multiplier is proposed, particularly for lattice-based cryptography. In addition, a novel modular polynomial multiplier is presented to exploit the fast finite impulse response (FIR) filter architecture to reduce the computational complexity of the schoolbook modular polynomial multiplication for lattice-based PQC scheme. Afterward, an NTT and Chinese remainder theorem (CRT)-based high-speed modular polynomial multiplier is presented for HE schemes whose moduli are large integers

Clemson University: TigerPrints

Application of the residue number system to the matrix multiplication problem

Author: Bischof Cornelia
Bram Martin
Fukuyama
Kawabuchi Mari
Miura Yohei
Opitz Alexander Karl
Schafbauer Wolfgang
Takemiya Satoshi
Taniguchi Schunsuke
Thaler Florian
Udomsilp David
Publication venue: Texas A&M University
Publication date: 01/01/1989
Field of study

Due to the character of the original source materials and the nature of batch digitization, quality control issues may be present in this document. Please report any quality issues you encounter to [email protected], referencing the URI of the item.Includes bibliographical references.Not availabl

Juelich Shared Electronic Resources

Fast Computation of Small Cuts via Cycle Space Sampling

Author: Pritchard David
Thurimella Ramakrishna
Publication venue
Publication date: 21/07/2010
Field of study

We describe a new sampling-based method to determine cuts in an undirected graph. For a graph (V, E), its cycle space is the family of all subsets of E that have even degree at each vertex. We prove that with high probability, sampling the cycle space identifies the cuts of a graph. This leads to simple new linear-time sequential algorithms for finding all cut edges and cut pairs (a set of 2 edges that form a cut) of a graph. In the model of distributed computing in a graph G=(V, E) with O(log V)-bit messages, our approach yields faster algorithms for several problems. The diameter of G is denoted by Diam, and the maximum degree by Delta. We obtain simple O(Diam)-time distributed algorithms to find all cut edges, 2-edge-connected components, and cut pairs, matching or improving upon previous time bounds. Under natural conditions these new algorithms are universally optimal --- i.e. a Omega(Diam)-time lower bound holds on every graph. We obtain a O(Diam+Delta/log V)-time distributed algorithm for finding cut vertices; this is faster than the best previous algorithm when Delta, Diam = O(sqrt(V)). A simple extension of our work yields the first distributed algorithm with sub-linear time for 3-edge-connected components. The basic distributed algorithms are Monte Carlo, but they can be made Las Vegas without increasing the asymptotic complexity. In the model of parallel computing on the EREW PRAM our approach yields a simple algorithm with optimal time complexity O(log V) for finding cut pairs and 3-edge-connected components.Comment: Previous version appeared in Proc. 35th ICALP, pages 145--160, 200

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne