9 research outputs found

    Big Prime Field FFT on the GPU

    Get PDF
    International audienceWe consider prime fields of large characteristic, typically fitting on k machine words, where k is a power of 2. When the characteristic of these fields is restricted to a subclass of the generalized Fermat numbers, we show that arithmetic operations in such fields offer attractive performance both in terms of algebraic complexity and parallelism. In particular , these operations can be vectorized, leading to efficient implementation of fast Fourier transforms on graphics processing units

    On The Parallelization Of Integer Polynomial Multiplication

    Get PDF
    With the advent of hardware accelerator technologies, multi-core processors and GPUs, much effort for taking advantage of those architectures by designing parallel algorithms has been made. To achieve this goal, one needs to consider both algebraic complexity and parallelism, plus making efficient use of memory traffic, cache, and reducing overheads in the implementations. Polynomial multiplication is at the core of many algorithms in symbolic computation such as real root isolation which will be our main application for now. In this thesis, we first investigate the multiplication of dense univariate polynomials with integer coefficients targeting multi-core processors. Some of the proposed methods are based on well-known serial classical algorithms, whereas a novel algorithm is designed to make efficient use of the targeted hardware. Experimentation confirms our theoretical analysis. Second, we report on the first implementation of subproduct tree techniques on many-core architectures. These techniques are basically another application of polynomial multiplication, but over a prime field. This technique is used in multi-point evaluation and interpolation of polynomials with coefficients over a prime field

    Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation

    Get PDF
    The most popular multithreaded languages based on the fork-join concurrency model (CIlkPlus, OpenMP) are currently being extended to support other forms of parallelism (vectorization, pipelining and single-instruction-multiple-data (SIMD)). In the SIMD case, the objective is to execute the corresponding code on a many-core device, like a GPGPU, for which the CUDA language is a natural choice. Since the programming concepts of CilkPlus and OpenMP are very different from those of CUDA, it is desirable to automatically generate optimized CUDA-like code from CilkPlus or OpenMP. In this thesis, we propose an accelerator model for annotated C/C++ code together with an implementation that allows the automatic generation of CUDA code. One of the key features of this CUDA code generator is that it supports the generation of CUDA kernel code where program parameters (like number of threads per block) and machine parameters (like shared memory size) are treated as unknown symbols. Hence, these parameters need not to be known at code-generation-time: machine parameters and program parameters can be respectively determined when the generated code is installed on the target machine. In addition, we show how these parametric CUDA programs can be optimized at compile-time in the form of a case discussion, where cases depend on the values of machine parameters (e.g. hardware resource limits) and program parameters (e.g. dimension sizes of thread-blocks). This generation of parametric CUDA kernels requires to deal with non-linear polynomial expressions during the dependence analysis and tiling phase. To achieve these algebraic calculations, we take advantage of techniques from computer algebra, in particular in the RegularChains library of Maple. Various illustrative examples are provided together with performance evaluation

    Implementation Techniques for the Truncated Fourier Transform

    Get PDF
    We study various algorithms for the Truncated Fourier Transform (TFT) which is a variation of the Discrete Fourier Transform (DFT) that allows one to work with an input vector of arbitrary size without zero padding. After a review of the original algorithms for the forward and inverse TFT introduced by J. van der Hoeven, we consider the variation of D. Harvey as well as that of J. Johnson and L.C. Meng. Both variations are based on Cooley-Tukey like formulas. The former is called strict general radix as it strictly follows the specifications proposed by J. van der Hoeven, while the latter is called relaxed general radix as it requires some zero padding so as to improve data flow which supports full vectorization and parallelization. In this thesis, we report on an implementation of the relaxed general radix forward TFT and a strict general radix inverse TFT. We have three objectives. First, obtaining a software tool generating optimized code forward and inverse TFT, extending the previous work of S. Covanov dedicated to FFT code generation. Second, comparing the practical efficiency of the strict and relaxed general radix schemes. Third, investigating the parallelization of one-dimensional TFT algorithms. Our experimental results show that, in practice, the relaxed general radix forward TFT can reach similar performance (in terms of running time, clock cycles and cache misses) as the optimized FFT code of the BPAS library (on input vectors on which both codes apply without zero padding). Moreover, for an input vector whose size ranges between two consecutive values for which FFT does not require zero padding, our relaxed TFT generated code provides an effective implementation. Unfortunately, the same satisfactory observation does not hold for the strict radix scheme when comparing the inverse TFT and FFT. As for parallelization, here again the relaxed general radix scheme is satisfactory while the strict general radix is not. For instance, w.r.t. to the FFT code, the parallel forward TFT code has a speedup factor of 5.31 and 6.78 for an input vector of size 2^23 and 2^26 respectively

    Cache-Friendly, Modular and Parallel Schemes For Computing Subresultant Chains

    Get PDF
    The RegularChains library in Maple offers a collection of commands for solving polynomial systems symbolically with taking advantage of the theory of regular chains. The primary goal of this thesis is algorithmic contributions, in particular, to high-performance computational schemes for subresultant chains and underlying routines to extend that of RegularChains in a C/C++ open-source library. Subresultants are one of the most fundamental tools in computer algebra. They are at the core of numerous algorithms including, but not limited to, polynomial GCD computations, polynomial system solving, and symbolic integration. When the subresultant chain of two polynomials is involved in a client procedure, not all polynomials of the chain, or not all coefficients of a given subresultant, may be needed. Based on that observation, we design so-called speculative and caching strategies which yield great performance improvements within our polynomial system solver. Our implementation of these techniques has been highly optimized. We have implemented optimized core arithmetic routines and multithreaded subresultant algorithms for univariate, bivariate and multivariate polynomials. We further examine memory access patterns and data locality for computing subresultants of multivariate polynomials, and study different optimization techniques for the fraction-free LU decomposition algorithm to compute subresultants based on determinant of Bezout matrices. Our code is publicly available at www.bpaslib.org as part of the Basic Polynomial Algebra Subprograms (BPAS) library that is mainly written in C, with concurrency support and user interfaces written in C++

    Dense Arithmetic over Finite Fields with the CUMODP Library

    No full text
    Abstract. CUMODP is a CUDA library for exact computations with dense polynomials over finite fields. A variety of operations like multiplication, division, computation of subresultants, multi-point evaluation, interpolation and many others are provided. These routines are primarily designed to offer GPU support to polynomial system solvers and a bivariate system solver is part of the library. Algorithms combine FFT-based and plain arithmetic, while the implementation strategy emphasizes reducing parallelism overheads and optimizing hardware usage
    corecore