19 research outputs found

    A Multi-level Blocking Distinct Degree Factorization Algorithm

    Get PDF
    We give a new algorithm for performing the distinct-degree factorization of a polynomial P(x) over GF(2), using a multi-level blocking strategy. The coarsest level of blocking replaces GCD computations by multiplications, as suggested by Pollard (1975), von zur Gathen and Shoup (1992), and others. The novelty of our approach is that a finer level of blocking replaces multiplications by squarings, which speeds up the computation in GF(2)[x]/P(x) of certain interval polynomials when P(x) is sparse. As an application we give a fast algorithm to search for all irreducible trinomials x^r + x^s + 1 of degree r over GF(2), while producing a certificate that can be checked in less time than the full search. Naive algorithms cost O(r^2) per trinomial, thus O(r^3) to search over all trinomials of given degree r. Under a plausible assumption about the distribution of factors of trinomials, the new algorithm has complexity O(r^2 (log r)^{3/2}(log log r)^{1/2}) for the search over all trinomials of degree r. Our implementation achieves a speedup of greater than a factor of 560 over the naive algorithm in the case r = 24036583 (a Mersenne exponent). Using our program, we have found two new primitive trinomials of degree 24036583 over GF(2) (the previous record degree was 6972593)

    Fast Polynomial Multiplication over F_(2^60)

    No full text
    Can post-Schönhage–Strassen multiplication algorithms be competitive in practice for large input sizes? So far, the GMP library still outperforms all implementations of the recent, asymptotically more efficient algorithms for integer multiplication by Fürer, De–Kurur–Saha–Saptharishi, and ourselves. In this paper, we show how central ideas of our recent asymptotically fast algorithms turn out to be of practical interest for multiplication of polynomials over finite fields of characteristic two. Our Mathemagix implementation is based on the automatic generation of assembly codelets. It outperforms existing implementations in large degree, especially for polynomial matrix multiplication over finite fields

    A comprehensive analysis of constant-time polynomial inversion for post-quantum cryptosystems

    Get PDF
    Post-quantum cryptosystems have currently seen a surge in interest thanks to the current standardization initiative by the U.S.A. National Institute of Standards and Technology (NIST). A common primitive in post-quantum cryptosystems, in particular in code-based ones, is the computation of the inverse of a binary polynomial in a binary polynomial ring. In this work, we analyze, realize in software, and benchmark a broad spectrum of binary polynomial inversion algorithms, targeting operand sizes which are relevant for the current second round candidates in the NIST standardization process. We evaluate advantages and shortcomings of the different inversion algorithms, including their capability to run in constant-time, thus preventing timing side-channel attacks

    Even faster integer multiplication

    Full text link
    We give a new proof of F\"urer's bound for the cost of multiplying n-bit integers in the bit complexity model. Unlike F\"urer, our method does not require constructing special coefficient rings with "fast" roots of unity. Moreover, we prove the more explicit bound O(n log n K^(log^* n))$ with K = 8. We show that an optimised variant of F\"urer's algorithm achieves only K = 16, suggesting that the new algorithm is faster than F\"urer's by a factor of 2^(log^* n). Assuming standard conjectures about the distribution of Mersenne primes, we give yet another algorithm that achieves K = 4

    Fast in-place accumulated bilinear formulae

    Full text link
    Bilinear operations are ubiquitous in computer science and in particular in computer algebra and symbolic computation. One of the most fundamental arithmetic operation is the multiplication, and when applied to, e.g., polynomials or matrices, its result is a bilinear function of its inputs. In terms of arithmetic operations, many sub-quadratic (resp. sub-cubic) algorithms were developed for these tasks. But these fast algorithms come at the expense of (potentially large) extra temporary space to perform the computation. On the contrary, classical, quadratic (resp. cubic) algorithms, when computed sequentially, quite often require very few (constant) extra registers. Further work then proposed simultaneously ``fast'' and ``in-place'' algorithms, for both matrix and polynomial operations We here propose algorithms to extend the latter line of work for accumulated algorithms arising from a bilinear formula. Indeed one of the main ingredient of the latter line of work is to use the (free) space of the output as intermediate storage. When the result has to be accumulated, i.e., if the output is also part of the input, this free space thus does not even exist. To be able to design accumulated in-place algorithm we thus relax the in-place model to allow algorithms to also modify their input, therefore to use them as intermediate storage for instance, provided that they are restored to their initial state after completion of the procedure. This is in fact a natural possibility in many programming environments. Furthermore, this restoration allows for recursive combinations of such procedures, as the (non concurrent) recursive calls will not mess-up the state of their callers. We propose here a generic technique transforming any bilinear algorithm into an in-place algorithm under this model. This then directly applies to polynomial and matrix multiplication algorithms, including fast ones

    Loss of Precision in Implementations of the Toom-Cook Algorithm

    Get PDF
    Historically, polynomial multiplication has required a quadratic number of operations. Several algorithms in the past century have improved upon this. In this work, we focus on the Toom-Cook algorithm. Devised by Toom in 1963, it is a family of algorithms parameterized by an integer, n. The algorithm multiplies two polynomials by recursively dividing them into smaller polynomials, multiplying many small polynomials, and interpolating to obtain the product. While it is no longer the asymptotically fastest method of multiplying, there is a range of intermediate degrees (typically less than 1000) where it performs the best. Some applications, like quantum-resistant cryptosystems, require the use of polynomials whose coefficients belong to the ring of integers modulo a power of 2. A problem arises with using the Toom-Cook algorithm to multiply these polynomials because the interpolation step of the algorithm requires division by even numbers. This results in a loss of 2-adic precision. If too many bits of precision are lost, the product will be incorrect. Interpolating a polynomial from some of its values is generally easy, and different works have solved the interpolation step of the Toom-Cook algorithm with different equations. In order to track the loss of precision, it is necessary to establish and prove the general form of the solution to the system of equations. We present three sets of interpolation formulas: the matrix, natural, and efficient formulas. For any integer n \u3e 2, we seek to find a general expression for each of the three sets of formulas, and to prove the respective loss of precision. First, for the efficient interpolation, we prove the general set of formulas. Then, for the natural interpolation, we conjecture a general set of formulas that depends on two combinatorial identities. We prove the first identity and some cases of the second identity. Finally, we prove the loss of precision of the matrix interpolation formulas

    Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation

    Get PDF
    The most popular multithreaded languages based on the fork-join concurrency model (CIlkPlus, OpenMP) are currently being extended to support other forms of parallelism (vectorization, pipelining and single-instruction-multiple-data (SIMD)). In the SIMD case, the objective is to execute the corresponding code on a many-core device, like a GPGPU, for which the CUDA language is a natural choice. Since the programming concepts of CilkPlus and OpenMP are very different from those of CUDA, it is desirable to automatically generate optimized CUDA-like code from CilkPlus or OpenMP. In this thesis, we propose an accelerator model for annotated C/C++ code together with an implementation that allows the automatic generation of CUDA code. One of the key features of this CUDA code generator is that it supports the generation of CUDA kernel code where program parameters (like number of threads per block) and machine parameters (like shared memory size) are treated as unknown symbols. Hence, these parameters need not to be known at code-generation-time: machine parameters and program parameters can be respectively determined when the generated code is installed on the target machine. In addition, we show how these parametric CUDA programs can be optimized at compile-time in the form of a case discussion, where cases depend on the values of machine parameters (e.g. hardware resource limits) and program parameters (e.g. dimension sizes of thread-blocks). This generation of parametric CUDA kernels requires to deal with non-linear polynomial expressions during the dependence analysis and tiling phase. To achieve these algebraic calculations, we take advantage of techniques from computer algebra, in particular in the RegularChains library of Maple. Various illustrative examples are provided together with performance evaluation

    High Performance Sparse Multivariate Polynomials: Fundamental Data Structures and Algorithms

    Get PDF
    Polynomials may be represented sparsely in an effort to conserve memory usage and provide a succinct and natural representation. Moreover, polynomials which are themselves sparse – have very few non-zero terms – will have wasted memory and computation time if represented, and operated on, densely. This waste is exacerbated as the number of variables increases. We provide practical implementations of sparse multivariate data structures focused on data locality and cache complexity. We look to develop high-performance algorithms and implementations of fundamental polynomial operations, using these sparse data structures, such as arithmetic (addition, subtraction, multiplication, and division) and interpolation. We revisit a sparse arithmetic scheme introduced by Johnson in 1974, adapting and optimizing these algorithms for modern computer architectures, with our implementations over the integers and rational numbers vastly outperforming the current wide-spread implementations. We develop a new algorithm for sparse pseudo-division based on the sparse polynomial division algorithm, with very encouraging results. Polynomial interpolation is explored through univariate, dense multivariate, and sparse multivariate methods. Arithmetic and interpolation together form a solid high-performance foundation from which many higher-level and more interesting algorithms can be built
    corecore