17 research outputs found

    The Basic Polynomial Algebra Subprograms

    Get PDF
    International audienceThe Basic Polynomial Algebra Subprograms (BPAS) provides arithmetic operations (multiplication, division, root isolation, etc.) for univariate and multivariate polynomials over common types of coefficients (prime fields, complex rational numbers, rational functions, etc.). The code is mainly written in CilkPlus [10] targeting multicore processors. The current distribution focuses on dense polynomials and the sparse case is work in progress. A strong emphasis is put on adaptive algorithms as the library aims at supporting a wide variety of situations in terms of problem sizes and available computing resources. The BPAS library is publicly available in source at www.bpaslib.org

    An Implementation of Power Series in the BPAS Library

    Get PDF
    We discuss the design and implementation of lazy multivariate power series, univariate polynomials over power series, and their associated arithmetic within the Basic Polynomial Algebra Subprograms (BPAS) Library. This implementation is employed by lazy variations of Weierstrass preparation and the factorization of univariate polynomials over power series following Hensel\u27s lemma. Our implementation is lazy in that power series terms are only computed when explicitly requested. The precision of a power series is dynamically extended upon request, without requiring any re-computation of existing terms. This design extends into an ``ancestry\u27\u27 of power series whereby power series created from the result of arithmetic or Weierstrass preparation automatically hold on to enough information to dynamically update themselves to higher precision using information from their ``parents\u27\u27

    Applying Front End Compiler Process to Parse Polynomials in Parallel

    Get PDF
    Parsing large expressions, in particular large polynomial expressions, is an important task for computer algebra systems. Despite of the apparent simplicity of the problem, its efficient software implementation brings various challenges. Among them is the fact that this is a memory bound application for which a multi-threaded implementation is necessarily limited by the characteristics of the memory organization of supporting hardware. In this thesis, we design, implement and experiment with a multi-threaded parser for large polynomial expressions. We extract parallelism by splitting the input character string, into meaningful sub-strings that can be parsed concurrently before being merged into a single polynomial. Our implementation targeting multi-core processors is realized with the Basic Polynomial Algebra Subprograms (BPAS). Experimental results show that the approach is promising both in terms of speedup factors and memory consumption

    Implementation Techniques for the Truncated Fourier Transform

    Get PDF
    We study various algorithms for the Truncated Fourier Transform (TFT) which is a variation of the Discrete Fourier Transform (DFT) that allows one to work with an input vector of arbitrary size without zero padding. After a review of the original algorithms for the forward and inverse TFT introduced by J. van der Hoeven, we consider the variation of D. Harvey as well as that of J. Johnson and L.C. Meng. Both variations are based on Cooley-Tukey like formulas. The former is called strict general radix as it strictly follows the specifications proposed by J. van der Hoeven, while the latter is called relaxed general radix as it requires some zero padding so as to improve data flow which supports full vectorization and parallelization. In this thesis, we report on an implementation of the relaxed general radix forward TFT and a strict general radix inverse TFT. We have three objectives. First, obtaining a software tool generating optimized code forward and inverse TFT, extending the previous work of S. Covanov dedicated to FFT code generation. Second, comparing the practical efficiency of the strict and relaxed general radix schemes. Third, investigating the parallelization of one-dimensional TFT algorithms. Our experimental results show that, in practice, the relaxed general radix forward TFT can reach similar performance (in terms of running time, clock cycles and cache misses) as the optimized FFT code of the BPAS library (on input vectors on which both codes apply without zero padding). Moreover, for an input vector whose size ranges between two consecutive values for which FFT does not require zero padding, our relaxed TFT generated code provides an effective implementation. Unfortunately, the same satisfactory observation does not hold for the strict radix scheme when comparing the inverse TFT and FFT. As for parallelization, here again the relaxed general radix scheme is satisfactory while the strict general radix is not. For instance, w.r.t. to the FFT code, the parallel forward TFT code has a speedup factor of 5.31 and 6.78 for an input vector of size 2^23 and 2^26 respectively

    Cache-Friendly, Modular and Parallel Schemes For Computing Subresultant Chains

    Get PDF
    The RegularChains library in Maple offers a collection of commands for solving polynomial systems symbolically with taking advantage of the theory of regular chains. The primary goal of this thesis is algorithmic contributions, in particular, to high-performance computational schemes for subresultant chains and underlying routines to extend that of RegularChains in a C/C++ open-source library. Subresultants are one of the most fundamental tools in computer algebra. They are at the core of numerous algorithms including, but not limited to, polynomial GCD computations, polynomial system solving, and symbolic integration. When the subresultant chain of two polynomials is involved in a client procedure, not all polynomials of the chain, or not all coefficients of a given subresultant, may be needed. Based on that observation, we design so-called speculative and caching strategies which yield great performance improvements within our polynomial system solver. Our implementation of these techniques has been highly optimized. We have implemented optimized core arithmetic routines and multithreaded subresultant algorithms for univariate, bivariate and multivariate polynomials. We further examine memory access patterns and data locality for computing subresultants of multivariate polynomials, and study different optimization techniques for the fraction-free LU decomposition algorithm to compute subresultants based on determinant of Bezout matrices. Our code is publicly available at www.bpaslib.org as part of the Basic Polynomial Algebra Subprograms (BPAS) library that is mainly written in C, with concurrency support and user interfaces written in C++

    Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation

    Get PDF
    The most popular multithreaded languages based on the fork-join concurrency model (CIlkPlus, OpenMP) are currently being extended to support other forms of parallelism (vectorization, pipelining and single-instruction-multiple-data (SIMD)). In the SIMD case, the objective is to execute the corresponding code on a many-core device, like a GPGPU, for which the CUDA language is a natural choice. Since the programming concepts of CilkPlus and OpenMP are very different from those of CUDA, it is desirable to automatically generate optimized CUDA-like code from CilkPlus or OpenMP. In this thesis, we propose an accelerator model for annotated C/C++ code together with an implementation that allows the automatic generation of CUDA code. One of the key features of this CUDA code generator is that it supports the generation of CUDA kernel code where program parameters (like number of threads per block) and machine parameters (like shared memory size) are treated as unknown symbols. Hence, these parameters need not to be known at code-generation-time: machine parameters and program parameters can be respectively determined when the generated code is installed on the target machine. In addition, we show how these parametric CUDA programs can be optimized at compile-time in the form of a case discussion, where cases depend on the values of machine parameters (e.g. hardware resource limits) and program parameters (e.g. dimension sizes of thread-blocks). This generation of parametric CUDA kernels requires to deal with non-linear polynomial expressions during the dependence analysis and tiling phase. To achieve these algebraic calculations, we take advantage of techniques from computer algebra, in particular in the RegularChains library of Maple. Various illustrative examples are provided together with performance evaluation

    Big Prime Field FFT on the GPU

    Get PDF
    International audienceWe consider prime fields of large characteristic, typically fitting on k machine words, where k is a power of 2. When the characteristic of these fields is restricted to a subclass of the generalized Fermat numbers, we show that arithmetic operations in such fields offer attractive performance both in terms of algebraic complexity and parallelism. In particular , these operations can be vectorized, leading to efficient implementation of fast Fourier transforms on graphics processing units