769 research outputs found

    Number theoretic techniques applied to algorithms and architectures for digital signal processing

    Get PDF
    Many of the techniques for the computation of a two-dimensional convolution of a small fixed window with a picture are reviewed. It is demonstrated that Winograd's cyclic convolution and Fourier Transform Algorithms, together with Nussbaumer's two-dimensional cyclic convolution algorithms, have a common general form. Many of these algorithms use the theoretical minimum number of general multiplications. A novel implementation of these algorithms is proposed which is based upon one-bit systolic arrays. These systolic arrays are networks of identical cells with each cell sharing a common control and timing function. Each cell is only connected to its nearest neighbours. These are all attractive features for implementation using Very Large Scale Integration (VLSI). The throughput rate is only limited by the time to perform a one-bit full addition. In order to assess the usefulness to these systolic arrays a 'cost function' is developed to compare them with more conventional techniques, such as the Cooley-Tukey radix-2 Fast Fourier Transform (FFT). The cost function shows that these systolic arrays offer a good way of implementing the Discrete Fourier Transform for transforms up to about 30 points in length. The cost function is a general tool and allows comparisons to be made between different implementations of the same algorithm and between dissimilar algorithms. Finally a technique is developed for the derivation of Discrete Cosine Transform (DCT) algorithms from the Winograd Fourier Transform Algorithm. These DCT algorithms may be implemented by modified versions of the systolic arrays proposed earlier, but requiring half the number of cells

    Residue Arithmetic VLSI Array Architecture for Manipulator Pseudo-Inverse Jacobian Computation

    Get PDF
    Most Cartesian-based control strategies require the computation of the manipulator inverse Jacobian in real time at every sampling period. In some cases, the Jacobian matrix is not of full column or row rank due to singularity or redundant robot configuration. This requires the computation of the manipulator pseudo-inverse Jacobian in real time. The calculation of the pseudo-inverse Jacobian may become extremely sensitive to small perturbation in the data and numerical instabilities, when the Jacobian matrix is not of full column or row rank. Even if the Jacobian matrix is of full rank, the ill-conditioned problem may still plague the computation of the pseudoinverse Jacobian. This paper presents the use of residue arithmetic for the exact computation of the manipulator pseudo-inverse Jacobian to obviate the roundoff errors normally associated with the computations. A two-level macro-pipelined residue arithmetic array architecture implementing the Decell’s pseudo-inverse algorithm has been developed to overcome the ill-conditioned problem of the pseudo-inverse computation. Furthermore, the Decell algorithm is quite suitable for VLSI array implementation to achieve the real-time computation requirement. The first-level arrays are data-driven, wavefront-like arrays and perform the matrix multiplications, matrix diagonal additions, and trace computations. A pool or sequence of the first-level arrays are then configured into a second-level macro-pipeline with outputs of one array acting as inputs to another array in the pipe. The proposed architecture can calculate the pseudoinverse Jacobian with a pipelined time in the same computational complexity order as evaluating a matrix product in a wavefront array

    Resolution of Linear Algebra for the Discrete Logarithm Problem Using GPU and Multi-core Architectures

    Get PDF
    In cryptanalysis, solving the discrete logarithm problem (DLP) is key to assessing the security of many public-key cryptosystems. The index-calculus methods, that attack the DLP in multiplicative subgroups of finite fields, require solving large sparse systems of linear equations modulo large primes. This article deals with how we can run this computation on GPU- and multi-core-based clusters, featuring InfiniBand networking. More specifically, we present the sparse linear algebra algorithms that are proposed in the literature, in particular the block Wiedemann algorithm. We discuss the parallelization of the central matrix--vector product operation from both algorithmic and practical points of view, and illustrate how our approach has contributed to the recent record-sized DLP computation in GF(28092^{809}).Comment: Euro-Par 2014 Parallel Processing, Aug 2014, Porto, Portugal. \<http://europar2014.dcc.fc.up.pt/\&gt

    PaReNTT: Low-Latency Parallel Residue Number System and NTT-Based Long Polynomial Modular Multiplication for Homomorphic Encryption

    Full text link
    High-speed long polynomial multiplication is important for applications in homomorphic encryption (HE) and lattice-based cryptosystems. This paper addresses low-latency hardware architectures for long polynomial modular multiplication using the number-theoretic transform (NTT) and inverse NTT (iNTT). Chinese remainder theorem (CRT) is used to decompose the modulus into multiple smaller moduli. Our proposed architecture, namely PaReNTT, makes four novel contributions. First, parallel NTT and iNTT architectures are proposed to reduce the number of clock cycles to process the polynomials. This can enable real-time processing for HE applications, as the number of clock cycles to process the polynomial is inversely proportional to the level of parallelism. Second, the proposed architecture eliminates the need for permuting the NTT outputs before their product is input to the iNTT. This reduces latency by n/4 clock cycles, where n is the length of the polynomial, and reduces buffer requirement by one delay-switch-delay circuit of size n. Third, an approach to select special moduli is presented where the moduli can be expressed in terms of a few signed power-of-two terms. Fourth, novel architectures for pre-processing for computing residual polynomials using the CRT and post-processing for combining the residual polynomials are proposed. These architectures significantly reduce the area consumption of the pre-processing and post-processing steps. The proposed long modular polynomial multiplications are ideal for applications that require low latency and high sample rate as these feed-forward architectures can be pipelined at arbitrary levels

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    A New Modular Division Algorithm and Applications

    No full text
    12 pagesInternational audienceThe present paper proposes a new parallel algorithm for the modular division u/vmodβsu/v\bmod \beta^s, where u,  v,  βu,\; v,\; \beta and ss are positive integers (β2)(\beta\ge 2). The algorithm combines the classical add-and-shift multiplication scheme with a new propagation carry technique. This ''Pen and Paper Inverse'' ({\em PPI}) algorithm, is better suited for systolic parallelization in a ''least-significant digit first'' pipelined manner. Although it is equivalent to Jebelean's modular division algorithm~\cite{jeb2} in terms of performance (time complexity, work, efficiency), the linear parallelization of the {\em PPI} algorithm improves on the latter when the input size is large. The parallelized versions of the {\em PPI} algorithm leads to various applications, such as the exact division and the digit modulus operation (dmod) of two long integers. It is also applied to the determination of the periods of rational numbers as well as their pp-adic expansion in any radix β2\beta \ge 2

    High-Performance VLSI Architectures for Lattice-Based Cryptography

    Get PDF
    Lattice-based cryptography is a cryptographic primitive built upon the hard problems on point lattices. Cryptosystems relying on lattice-based cryptography have attracted huge attention in the last decade since they have post-quantum-resistant security and the remarkable construction of the algorithm. In particular, homomorphic encryption (HE) and post-quantum cryptography (PQC) are the two main applications of lattice-based cryptography. Meanwhile, the efficient hardware implementations for these advanced cryptography schemes are demanding to achieve a high-performance implementation. This dissertation aims to investigate the novel and high-performance very large-scale integration (VLSI) architectures for lattice-based cryptography, including the HE and PQC schemes. This dissertation first presents different architectures for the number-theoretic transform (NTT)-based polynomial multiplication, one of the crucial parts of the fundamental arithmetic for lattice-based HE and PQC schemes. Then a high-speed modular integer multiplier is proposed, particularly for lattice-based cryptography. In addition, a novel modular polynomial multiplier is presented to exploit the fast finite impulse response (FIR) filter architecture to reduce the computational complexity of the schoolbook modular polynomial multiplication for lattice-based PQC scheme. Afterward, an NTT and Chinese remainder theorem (CRT)-based high-speed modular polynomial multiplier is presented for HE schemes whose moduli are large integers

    Application of the residue number system to the matrix multiplication problem

    Get PDF
    Due to the character of the original source materials and the nature of batch digitization, quality control issues may be present in this document. Please report any quality issues you encounter to [email protected], referencing the URI of the item.Includes bibliographical references.Not availabl

    Fast Computation of Small Cuts via Cycle Space Sampling

    Full text link
    We describe a new sampling-based method to determine cuts in an undirected graph. For a graph (V, E), its cycle space is the family of all subsets of E that have even degree at each vertex. We prove that with high probability, sampling the cycle space identifies the cuts of a graph. This leads to simple new linear-time sequential algorithms for finding all cut edges and cut pairs (a set of 2 edges that form a cut) of a graph. In the model of distributed computing in a graph G=(V, E) with O(log V)-bit messages, our approach yields faster algorithms for several problems. The diameter of G is denoted by Diam, and the maximum degree by Delta. We obtain simple O(Diam)-time distributed algorithms to find all cut edges, 2-edge-connected components, and cut pairs, matching or improving upon previous time bounds. Under natural conditions these new algorithms are universally optimal --- i.e. a Omega(Diam)-time lower bound holds on every graph. We obtain a O(Diam+Delta/log V)-time distributed algorithm for finding cut vertices; this is faster than the best previous algorithm when Delta, Diam = O(sqrt(V)). A simple extension of our work yields the first distributed algorithm with sub-linear time for 3-edge-connected components. The basic distributed algorithms are Monte Carlo, but they can be made Las Vegas without increasing the asymptotic complexity. In the model of parallel computing on the EREW PRAM our approach yields a simple algorithm with optimal time complexity O(log V) for finding cut pairs and 3-edge-connected components.Comment: Previous version appeared in Proc. 35th ICALP, pages 145--160, 200
    corecore